summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2025-07-14sched/deadline: Reset extra_bw to max_bw when clearing root domainsJuri Lelli
dl_clear_root_domain() doesn't take into account the fact that per-rq extra_bw variables retain values computed before root domain changes, resulting in broken accounting. Fix it by resetting extra_bw to max_bw before restoring back dl-servers contributions. Fixes: 2ff899e351643 ("sched/deadline: Rebuild root domain accounting after every update") Reported-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # nuc & rock5b Link: https://lore.kernel.org/r/20250627115118.438797-3-juri.lelli@redhat.com
2025-07-14sched/deadline: Initialize dl_servers after SMPJuri Lelli
dl-servers are currently initialized too early at boot when CPUs are not fully up (only boot CPU is). This results in miscalculation of per runqueue DEADLINE variables like extra_bw (which needs a stable CPU count). Move initialization of dl-servers later on after SMP has been initialized and CPUs are all online, so that CPU count is stable and DEADLINE variables can be computed correctly. Fixes: d741f297bceaf ("sched/fair: Fair server interface") Reported-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # nuc & rock5b Link: https://lore.kernel.org/r/20250627115118.438797-2-juri.lelli@redhat.com
2025-07-14sched: Change nr_uninterruptible type to unsigned longAruna Ramakrishna
The commit e6fe3f422be1 ("sched: Make multiple runqueue task counters 32-bit") changed nr_uninterruptible to an unsigned int. But the nr_uninterruptible values for each of the CPU runqueues can grow to large numbers, sometimes exceeding INT_MAX. This is valid, if, over time, a large number of tasks are migrated off of one CPU after going into an uninterruptible state. Only the sum of all nr_interruptible values across all CPUs yields the correct result, as explained in a comment in kernel/sched/loadavg.c. Change the type of nr_uninterruptible back to unsigned long to prevent overflows, and thus the miscalculation of load average. Fixes: e6fe3f422be1 ("sched: Make multiple runqueue task counters 32-bit") Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250709173328.606794-1-aruna.ramakrishna@oracle.com
2025-07-13mm/page_alloc: add support for initializing pageblock as isolatedZi Yan
MIGRATE_ISOLATE is a standalone bit, so a pageblock cannot be initialized to just MIGRATE_ISOLATE. Add init_pageblock_migratetype() to enable initialize a pageblock with a migratetype and isolated. Link: https://lkml.kernel.org/r/20250617021115.2331563-4-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Richard Chang <richardycc@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13kernel,cpuset: use node-notifier instead of memory-notifierOscar Salvador
cpuset is only concerned when a numa node changes its memory state, as it needs to know the current numa nodes with memory to keep an updated mems_allowed mask. So stop using the memory notifier and use the new numa node notifer instead. Link: https://lkml.kernel.org/r/20250616135158.450136-9-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13mm, madvise: extract mm code from prctl_set_vma() to mm/madvise.cVlastimil Babka
Setting anon_name is done via madvise_set_anon_name() and behaves a lot of like other madvise operations. However, apparently because madvise() has lacked the 4th argument and prctl() not, the userspace entry point has been implemented via prctl(PR_SET_VMA, ...) and handled first by prctl_set_vma(). Currently prctl_set_vma() lives in kernel/sys.c but setting the vma->anon_name is mm-specific code so extract it to a new set_anon_vma_name() function under mm. mm/madvise.c seems to be the most straightforward place as that's where madvise_set_anon_name() lives. Stop declaring the latter in mm.h and instead declare set_anon_vma_name(). Link: https://lkml.kernel.org/r/20250624-anon_name_cleanup-v2-2-600075462a11@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Colin Cross <ccross@google.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13Merge tag 'perf_urgent_for_v6.16_rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Borislav Petkov: - Prevent perf_sigtrap() from observing an exiting task and warning about it * tag 'perf_urgent_for_v6.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix WARN in perf_sigtrap()
2025-07-12Merge branch 'mm-hotfixes-stable' into mm-stable to pick up changes whichAndrew Morton
are required for a merge of the series "mm: folio_pte_batch() improvements".
2025-07-12Merge tag 'mm-hotfixes-stable-2025-07-11-16-16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "19 hotfixes. A whopping 16 are cc:stable and the remainder address post-6.15 issues or aren't considered necessary for -stable kernels. 14 are for MM. Three gdb-script fixes and a kallsyms build fix" * tag 'mm-hotfixes-stable-2025-07-11-16-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: Revert "sched/numa: add statistics of numa balance task" mm: fix the inaccurate memory statistics issue for users mm/damon: fix divide by zero in damon_get_intervals_score() samples/damon: fix damon sample mtier for start failure samples/damon: fix damon sample wsse for start failure samples/damon: fix damon sample prcl for start failure kasan: remove kasan_find_vm_area() to prevent possible deadlock scripts: gdb: vfs: support external dentry names mm/migrate: fix do_pages_stat in compat mode mm/damon/core: handle damon_call_control as normal under kdmond deactivation mm/rmap: fix potential out-of-bounds page table access during batched unmap mm/hugetlb: don't crash when allocating a folio if there are no resv scripts/gdb: de-reference per-CPU MCE interrupts scripts/gdb: fix interrupts.py after maple tree conversion maple_tree: fix mt_destroy_walk() on root leaf node mm/vmalloc: leave lazy MMU mode on PTE mapping error scripts/gdb: fix interrupts display after MCP on x86 lib/alloc_tag: do not acquire non-existent lock in alloc_tag_top_users() kallsyms: fix build without execinfo
2025-07-11locking/rwsem: Use OWNER_NONSPINNABLE directly instead of OWNER_SPINNABLEJinliang Zheng
After commit 7d43f1ce9dd0 ("locking/rwsem: Enable time-based spinning on reader-owned rwsem"), OWNER_SPINNABLE contains all possible values except OWNER_NONSPINNABLE, namely OWNER_NULL | OWNER_WRITER | OWNER_READER. Therefore, it is better to use OWNER_NONSPINNABLE directly to determine whether to exit optimistic spin. And, remove useless OWNER_SPINNABLE to simplify the code. Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Link: https://lore.kernel.org/r/20250610130158.4876-1-alexjlzheng@tencent.com
2025-07-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.16-rc6-2). No conflicts. Adjacent changes: drivers/net/wireless/mediatek/mt76/mt7925/mcu.c c701574c5412 ("wifi: mt76: mt7925: fix invalid array index in ssid assignment during hw scan") b3a431fe2e39 ("wifi: mt76: mt7925: fix off by one in mt7925_mcu_hw_scan()") drivers/net/wireless/mediatek/mt76/mt7996/mac.c 62da647a2b20 ("wifi: mt76: mt7996: Add MLO support to mt7996_tx_check_aggr()") dc66a129adf1 ("wifi: mt76: add a wrapper for wcid access with validation") drivers/net/wireless/mediatek/mt76/mt7996/main.c 3dd6f67c669c ("wifi: mt76: Move RCU section in mt7996_mcu_add_rate_ctrl()") 8989d8e90f5f ("wifi: mt76: mt7996: Do not set wcid.sta to 1 in mt7996_mac_sta_event()") net/mac80211/cfg.c 58fcb1b4287c ("wifi: mac80211: reject VHT opmode for unsupported channel widths") 037dc18ac3fb ("wifi: mac80211: add support for storing station S1G capabilities") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11panic: Fix up description of vpanic()Nam Cao
The description above vpanic() has the wrong function name. Fix it up. Cc: John Ogness <john.ogness@linutronix.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Gabriele Monaco <gmonaco@redhat.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/23a7e8add6546b155371b7e0fbb37bb1def13d6e.1752232374.git.namcao@linutronix.de Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/lkml/20250711183802.2d8c124d@canb.auug.org.au/ Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-11bpf: Remove attach_type in bpf_tracing_linkTao Chen
Use attach_type in bpf_link, and remove it in bpf_tracing_link. Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20250710032038.888700-7-chen.dylane@linux.dev
2025-07-11bpf: Remove attach_type in bpf_netns_linkTao Chen
Use attach_type in bpf_link, and remove it in bpf_netns_link. And move netns_type field to the end to fill the byte hole. Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20250710032038.888700-6-chen.dylane@linux.dev
2025-07-11bpf: Remove location field in tcx_linkTao Chen
Use attach_type in bpf_link to replace the location filed, and remove location field in tcx_link. Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20250710032038.888700-5-chen.dylane@linux.dev
2025-07-11bpf: Remove attach_type in bpf_cgroup_linkTao Chen
Use attach_type in bpf_link, and remove it in bpf_cgroup_link. Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20250710032038.888700-3-chen.dylane@linux.dev
2025-07-11bpf: Add attach_type field to bpf_linkTao Chen
Attach_type will be set when a link is created by user. It is better to record attach_type in bpf_link generically and have it available universally for all link types. So add the attach_type field in bpf_link and move the sleepable field to avoid unnecessary gap padding. Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20250710032038.888700-2-chen.dylane@linux.dev
2025-07-11bpf: Forget ranges when refining tnum after JSETPaul Chaignon
Syzbot reported a kernel warning due to a range invariant violation on the following BPF program. 0: call bpf_get_netns_cookie 1: if r0 == 0 goto <exit> 2: if r0 & Oxffffffff goto <exit> The issue is on the path where we fall through both jumps. That path is unreachable at runtime: after insn 1, we know r0 != 0, but with the sign extension on the jset, we would only fallthrough insn 2 if r0 == 0. Unfortunately, is_branch_taken() isn't currently able to figure this out, so the verifier walks all branches. The verifier then refines the register bounds using the second condition and we end up with inconsistent bounds on this unreachable path: 1: if r0 == 0 goto <exit> r0: u64=[0x1, 0xffffffffffffffff] var_off=(0, 0xffffffffffffffff) 2: if r0 & 0xffffffff goto <exit> r0 before reg_bounds_sync: u64=[0x1, 0xffffffffffffffff] var_off=(0, 0) r0 after reg_bounds_sync: u64=[0x1, 0] var_off=(0, 0) Improving the range refinement for JSET to cover all cases is tricky. We also don't expect many users to rely on JSET given LLVM doesn't generate those instructions. So instead of improving the range refinement for JSETs, Eduard suggested we forget the ranges whenever we're narrowing tnums after a JSET. This patch implements that approach. Reported-by: syzbot+c711ce17dd78e5d4fdcf@syzkaller.appspotmail.com Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/9d4fd6432a095d281f815770608fdcd16028ce0b.1752171365.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-11bpf/arena: add bpf_arena_reserve_pages kfuncEmil Tsalapatis
Add a new BPF arena kfunc for reserving a range of arena virtual addresses without backing them with pages. This prevents the range from being populated using bpf_arena_alloc_pages(). Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250709191312.29840-2-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-11cpu: Define attack vectorsDavid Kaplan
Define 4 new attack vectors that are used for controlling CPU speculation mitigations. These may be individually disabled as part of the mitigations= command line. Attack vector controls are combined with global options like 'auto' or 'auto,nosmt' like 'mitigations=auto,no_user_kernel'. The global options come first in the mitigations= string. Cross-thread mitigations can either remain enabled fully, including potentially disabling SMT ('auto,nosmt'), remain enabled except for disabling SMT ('auto'), or entirely disabled through the new 'no_cross_thread' attack vector option. The default settings for these attack vectors are consistent with existing kernel defaults, other than the automatic disabling of VM-based attack vectors if KVM support is not present. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250707183316.1349127-3-david.kaplan@amd.com
2025-07-11Merge tag 'dma-mapping-6.16-2025-07-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fix from Marek Szyprowski: - small fix relevant to arm64 server and custom CMA configuration (Feng Tang) * tag 'dma-mapping-6.16-2025-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma-contiguous: hornor the cma address limit setup by user
2025-07-11futex: Remove support for IMMUTABLESebastian Andrzej Siewior
The FH_FLAG_IMMUTABLE flag was meant to avoid the reference counting on the private hash and so to avoid the performance regression on big machines. With the switch to per-CPU counter this is no longer needed. That flag was never useable on any released kernel. Remove any support for IMMUTABLE while preserve the flags argument and enforce it to be zero. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250710110011.384614-5-bigeasy@linutronix.de
2025-07-11futex: Make futex_private_hash_get() staticSebastian Andrzej Siewior
futex_private_hash_get() is not used outside if its compilation unit. Make it static. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250710110011.384614-4-bigeasy@linutronix.de
2025-07-11futex: Use RCU-based per-CPU reference counting instead of rcuref_tPeter Zijlstra
The use of rcuref_t for reference counting introduces a performance bottleneck when accessed concurrently by multiple threads during futex operations. Replace rcuref_t with special crafted per-CPU reference counters. The lifetime logic remains the same. The newly allocate private hash starts in FR_PERCPU state. In this state, each futex operation that requires the private hash uses a per-CPU counter (an unsigned int) for incrementing or decrementing the reference count. When the private hash is about to be replaced, the per-CPU counters are migrated to a atomic_t counter mm_struct::futex_atomic. The migration process: - Waiting for one RCU grace period to ensure all users observe the current private hash. This can be skipped if a grace period elapsed since the private hash was assigned. - futex_private_hash::state is set to FR_ATOMIC, forcing all users to use mm_struct::futex_atomic for reference counting. - After a RCU grace period, all users are guaranteed to be using the atomic counter. The per-CPU counters can now be summed up and added to the atomic_t counter. If the resulting count is zero, the hash can be safely replaced. Otherwise, active users still hold a valid reference. - Once the atomic reference count drops to zero, the next futex operation will switch to the new private hash. call_rcu_hurry() is used to speed up transition which otherwise might be delay with RCU_LAZY. There is nothing wrong with using call_rcu(). The side effects would be that on auto scaling the new hash is used later and the SET_SLOTS prctl() will block longer. [bigeasy: commit description + mm get/ put_async] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250710110011.384614-3-bigeasy@linutronix.de
2025-07-11irqdomain: Export irq_domain_free_irqs_top()Nam Cao
Export irq_domain_free_irqs_top(), making it usable for drivers compiled as modules. Reviewed-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-07-11Merge back earlier changes related to system suspend and hibernationRafael J. Wysocki
2025-07-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.16-rc6). No conflicts. Adjacent changes: Documentation/devicetree/bindings/net/allwinner,sun8i-a83t-emac.yaml 0a12c435a1d6 ("dt-bindings: net: sun8i-emac: Add A100 EMAC compatible") b3603c0466a8 ("dt-bindings: net: sun8i-emac: Rename A523 EMAC0 to GMAC0") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-10PM: hibernate: shrink shmem pages after dev_pm_ops.prepare()Samuel Zhang
When hibernate with data center dGPUs, huge number of VRAM data will be moved to shmem during dev_pm_ops.prepare(). These shmem pages take a lot of system memory so that there's no enough free memory for creating the hibernation image. This will cause hibernation fail and abort. After dev_pm_ops.prepare(), call shrink_all_memory() to force move shmem pages to swap disk and reclaim the pages, so that there's enough system memory for hibernation image and less pages needed to copy to the image. This patch can only flush and free about half shmem pages. It will be better to flush and free more pages, even all of shmem pages, so that there're less pages to be copied to the hibernation image and the overall hibernation time can be reduced. Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Link: https://lore.kernel.org/r/20250710062313.3226149-4-guoqing.zhang@amd.com Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
2025-07-10PM: sleep: add kernel parameter to disable asynchronous suspend/resumeTudor Ambarus
On some platforms, device dependencies are not properly represented by device links, which can cause issues when asynchronous power management is enabled. While it is possible to disable this via sysfs, doing so at runtime can race with the first system suspend event. This patch introduces a kernel command-line parameter, "pm_async", which can be set to "off" to globally disable asynchronous suspend and resume operations from early boot. It effectively provides a way to set the initial value of the existing pm_async sysfs knob at boot time. This offers a robust method to fall back to synchronous (sequential) operation, which can stabilize platforms with problematic dependencies and also serve as a useful debugging tool. The default behavior remains unchanged (asynchronous enabled). To disable it, boot the kernel with the "pm_async=off" parameter. Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> Link: https://patch.msgid.link/20250709-pm-async-off-v3-1-cb69a6fc8d04@linaro.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-07-09kthread: update comment for __to_kthreadJiazi Li
With commit 343f4c49f243 ("kthread: Don't allocate kthread_struct for init and umh") and commit 753550eb0ce1 ("fork: Explicitly set PF_KTHREAD"), umh task no longer have struct kthread and PF_KTHREAD flag. Update the comment to describe what the current rules are to detect is something is a kthread. Link: https://lkml.kernel.org/r/20250620100801.23185-1-jqqlijiazi@gmail.com Signed-off-by: Jiazi Li <jqqlijiazi@gmail.com> Signed-off-by: mingzhu.wang <mingzhu.wang@transsion.com> Suggested-by Eric W . Biederman <ebiederm@xmission.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09fork: clean up ifdef logic around stack allocationPasha Tatashin
There is an unneeded OR in the ifdef functions that are used to allocate and free kernel stacks based on direct map or vmap. Adding dynamic stack support would complicate this logic even further. Therefore, clean up by changing the order so OR is no longer needed. Link: https://lkml.kernel.org/r/20250618-fork-fixes-v4-1-2e05a2e1f5fc@linaro.org Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Link: https://lore.kernel.org/20240311164638.2015063-3-pasha.tatashin@soleen.com Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09uprobes: revert ref_ctr_offset in uprobe_unregister error pathJiri Olsa
There's error path that could lead to inactive uprobe: 1) uprobe_register succeeds - updates instruction to int3 and changes ref_ctr from 0 to 1 2) uprobe_unregister fails - int3 stays in place, but ref_ctr is changed to 0 (it's not restored to 1 in the fail path) uprobe is leaked 3) another uprobe_register comes and re-uses the leaked uprobe and succeds - but int3 is already in place, so ref_ctr update is skipped and it stays 0 - uprobe CAN NOT be triggered now 4) uprobe_unregister fails because ref_ctr value is unexpected Fix this by reverting the updated ref_ctr value back to 1 in step 2), which is the case when uprobe_unregister fails (int3 stays in place), but we have already updated refctr. The new scenario will go as follows: 1) uprobe_register succeeds - updates instruction to int3 and changes ref_ctr from 0 to 1 2) uprobe_unregister fails - int3 stays in place and ref_ctr is reverted to 1.. uprobe is leaked 3) another uprobe_register comes and re-uses the leaked uprobe and succeds - but int3 is already in place, so ref_ctr update is skipped and it stays 1 - uprobe CAN be triggered now 4) uprobe_unregister succeeds Link: https://lkml.kernel.org/r/20250514101809.2010193-1-jolsa@kernel.org Fixes: 1cc33161a83d ("uprobes: Support SDT markers having reference count (semaphore)") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09exit: fix misleading comment in forget_original_parent()Fushuai Wang
The commit 482a3767e508 ("exit: reparent: call forget_original_parent() under tasklist_lock") moved the comment from exit_notify() to forget_original_parent(). However, the forget_original_parent() only handles (A), while (B) is handled in kill_orphaned_pgrp(). So remove the unrelated part. Link: https://lkml.kernel.org/r/20250615030930.58051-1-wangfushuai@baidu.com Signed-off-by: Fushuai Wang <wangfushuai@baidu.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: wangfushuai <wangfushuai@baidu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09kcov: fix typo in comment of kcov_fault_in_areaWei Nanxin
change '__santizer_cov_trace_pc()' to '__sanitizer_cov_trace_pc()' Link: https://lkml.kernel.org/r/20250615123237.110144-1-n9winx@163.com Signed-off-by: Wei Nanxin <n9winx@163.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Macro Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09relayfs: support a counter tracking if data is too big to writeJason Xing
It really doesn't matter if the user/admin knows what the last too big value is. Record how many times this case is triggered would be helpful. Solve the existing issue where relay_reset() doesn't restore the value. Store the counter in the per-cpu buffer structure instead of the global buffer structure. It also solves the racy condition which is likely to happen when a few of per-cpu buffers encounter the too big data case and then access the global field last_toobig without lock protection. Remove the printk in relay_close() since kernel module can directly call relay_stats() as they want. Link: https://lkml.kernel.org/r/20250612061201.34272-6-kerneljasonxing@gmail.com Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Yushan Zhou <katrinzhou@tencent.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09blktrace: use rbuf->stats.full as a drop indicator in relayfsJason Xing
Replace internal subbuf_start in blktrace with the default policy in relayfs. Remove dropped field from struct blktrace. Correspondingly, call the common helper in relay. By incrementing full_count to keep track of how many times we encountered a full buffer issue, user space will know how many events were lost. Link: https://lkml.kernel.org/r/20250612061201.34272-5-kerneljasonxing@gmail.com Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Yushan Zhou <katrinzhou@tencent.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09relayfs: introduce getting relayfs statistics functionJason Xing
In this version, only support getting the counter for buffer full and implement the framework of how it works. Users can pass certain flag to fetch what field/statistics they expect to know. Each time it only returns one result. So do not pass multiple flags. Link: https://lkml.kernel.org/r/20250612061201.34272-4-kerneljasonxing@gmail.com Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Yushan Zhou <katrinzhou@tencent.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09relayfs: support a counter tracking if per-cpu buffers is fullJason Xing
When using relay mechanism, we often encounter the case where new data are lost or old unconsumed data are overwritten because of slow reader. Add 'full' field in per-cpu buffer structure to detect if the above case is happening. Relay has two modes: 1) non-overwrite mode, 2) overwrite mode. So buffer being full here respectively means: 1) relayfs doesn't intend to accept new data and then simply drop them, or 2) relayfs is going to start over again and overwrite old unread data with new data. Note: this counter doesn't need any explicit lock to protect from being modified by different threads for the better performance consideration. Writers calling __relay_write/relay_write should consider how to use the lock and ensure it performs under the lock protection, thus it's not necessary to add a new small lock here. Link: https://lkml.kernel.org/r/20250612061201.34272-3-kerneljasonxing@gmail.com Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Yushan Zhou <katrinzhou@tencent.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09relayfs: abolish prev_paddingJason Xing
Patch series "relayfs: misc changes", v5. The series mostly focuses on the error counters which helps every user debug their own kernel module. This patch (of 5): prev_padding represents the unused space of certain subbuffer. If the content of a call of relay_write() exceeds the limit of the remainder of this subbuffer, it will skip storing in the rest space and record the start point as buf->prev_padding in relay_switch_subbuf(). Since the buf is a per-cpu big buffer, the point of prev_padding as a global value for the whole buffer instead of a single subbuffer (whose padding info is stored in buf->padding[]) seems meaningless from the real use cases, so we don't bother to record it any more. Link: https://lkml.kernel.org/r/20250612061201.34272-1-kerneljasonxing@gmail.com Link: https://lkml.kernel.org/r/20250612061201.34272-2-kerneljasonxing@gmail.com Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Yushan Zhou <katrinzhou@tencent.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09kernel: relay: use __GFP_ZERO in relay_alloc_bufElijah Wright
Passing the __GFP_ZERO flag to alloc_page should result in less overhead th= an using memset() Link: https://lkml.kernel.org/r/20250610225639.314970-3-git@elijahs.space Signed-off-by: Elijah Wright <git@elijahs.space> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09fork: define a local GFP_VMAP_STACKLinus Walleij
The current allocation of VMAP stack memory is using (THREADINFO_GFP & ~__GFP_ACCOUNT) which is a complicated way of saying (GFP_KERNEL | __GFP_ZERO): <linux/thread_info.h>: define THREADINFO_GFP (GFP_KERNEL_ACCOUNT | __GFP_ZERO) <linux/gfp_types.h>: define GFP_KERNEL_ACCOUNT (GFP_KERNEL | __GFP_ACCOUNT) This is an unfortunate side-effect of independent changes blurring the picture: commit 19809c2da28aee5860ad9a2eff760730a0710df0 changed (THREADINFO_GFP | __GFP_HIGHMEM) to just THREADINFO_GFP since highmem became implicit. commit 9b6f7e163cd0f468d1b9696b785659d3c27c8667 then added stack caching and rewrote the allocation to (THREADINFO_GFP & ~__GFP_ACCOUNT) as cached stacks need to be accounted separately. However that code, when it eventually accounts the memory does this: ret = memcg_kmem_charge(vm->pages[i], GFP_KERNEL, 0) so the memory is charged as a GFP_KERNEL allocation. Define a unique GFP_VMAP_STACK to use GFP_KERNEL | __GFP_ZERO and move the comment there. Link: https://lkml.kernel.org/r/20250509-gfp-stack-v1-1-82f6f7efc210@linaro.org Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Reported-by: Mateusz Guzik <mjguzik@gmail.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09fork: clean-up naming of vm_stack/vm_struct variables in vmap stacks codePasha Tatashin
There are two data types: "struct vm_struct" and "struct vm_stack" that have the same local variable names: vm_stack, or vm, or s, which makes the code confusing to read. Change the code so the naming is consistent: struct vm_struct is always called vm_area struct vm_stack is always called vm_stack One change altering vfree(vm_stack) to vfree(vm_area->addr) may look like a semantic change but it is not: vm_area->addr points to the vm_stack. This was done to improve readability. [linus.walleij@linaro.org: rebased and added new users of the variable names, address review comments] Link: https://lore.kernel.org/20240311164638.2015063-4-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20250509-fork-fixes-v3-2-e6c69dd356f2@linaro.org Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: filter zone device pages returned from folio_walk_start()Alistair Popple
Previously dax pages were skipped by the pagewalk code as pud_special() or vm_normal_page{_pmd}() would be false for DAX pages. Now that dax pages are refcounted normally that is no longer the case, so the pagewalk code will start returning them. Most callers already explicitly filter for DAX or zone device pages so don't need updating. However some don't, so add checks to those callers. Link: https://lkml.kernel.org/r/4ecb7b357fc5b435588024770b88bbb695c30090.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: use folio_expected_ref_count() helper for reference countingShivank Garg
Replace open-coded folio reference count calculations with the folio_expected_ref_count(). No functional changes intended. Link: https://lkml.kernel.org/r/20250611052706.515408-2-shivankg@amd.com Signed-off-by: Shivank Garg <shivankg@amd.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Marc Rutland <mark.rutland@arm.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Namhyung kim <namhyung@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09Revert "sched/numa: add statistics of numa balance task"Chen Yu
This reverts commit ad6b26b6a0a79166b53209df2ca1cf8636296382. This commit introduces per-memcg/task NUMA balance statistics, but unfortunately it introduced a NULL pointer exception due to the following race condition: After a swap task candidate was chosen, its mm_struct pointer was set to NULL due to task exit. Later, when performing the actual task swapping, the p->mm caused the problem. CPU0 CPU1 : ... task_numa_migrate task_numa_find_cpu task_numa_compare # a normal task p is chosen env->best_task = p # p exit: exit_signals(p); p->flags |= PF_EXITING exit_mm p->mm = NULL; migrate_swap_stop __migrate_swap_task((arg->src_task, arg->dst_cpu) count_memcg_event_mm(p->mm, NUMA_TASK_SWAP)# p->mm is NULL task_lock() should be held and the PF_EXITING flag needs to be checked to prevent this from happening. After discussion, the conclusion was that adding a lock is not worthwhile for some statistics calculations. Revert the change and rely on the tracepoint for this purpose. Link: https://lkml.kernel.org/r/20250704135620.685752-1-yu.c.chen@intel.com Link: https://lkml.kernel.org/r/20250708064917.BBD13C4CEED@smtp.kernel.org Fixes: ad6b26b6a0a7 ("sched/numa: add statistics of numa balance task") Signed-off-by: Chen Yu <yu.c.chen@intel.com> Reported-by: Jirka Hladky <jhladky@redhat.com> Closes: https://lore.kernel.org/all/CAE4VaGBLJxpd=NeRJXpSCuw=REhC5LWJpC29kDy-Zh2ZDyzQZA@mail.gmail.com/ Reported-by: Srikanth Aithal <Srikanth.Aithal@amd.com> Reported-by: Suneeth D <Suneeth.D@amd.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Hladky <jhladky@redhat.com> Cc: Libo Chen <libo.chen@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09fgraph: Make pid_str size match the commentArtem Sadovnikov
The comment above buffer mentions sign, 10 bytes width for number and null terminator, but buffer itself isn't large enough to hold that much data. This is a cosmetic change, since PID cannot be negative, other than -1. Found by Linux Verification Center (linuxtesting.org) with SVACE. Link: https://lore.kernel.org/20250617152110.2530-1-a.sadovnikov@ispras.ru Signed-off-by: Artem Sadovnikov <a.sadovnikov@ispras.ru> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09tracing: ring_buffer: Rewind persistent ring buffer on rebootMasami Hiramatsu (Google)
Rewind persistent ring buffer pages which have been read in the previous boot. Those pages are highly possible to be lost before writing it to the disk if the previous kernel crashed. In this case, the trace data is kept on the persistent ring buffer, but it can not be read because its commit size has been reset after read. This skips clearing the commit size of each sub-buffer and recover it after reboot. Note: If you read the previous boot data via trace_pipe, that is not accessible in that time. But reboot without clearing (or reusing) the read data, the read data is recovered again in the next boot. Thus, when you read the previous boot data, clear it by `echo > trace`. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/174899582116.955054.773265393511190051.stgit@mhiramat.tok.corp.google.com Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09rv: Allow to configure the number of per-task monitorNam Cao
Now that there are 2 monitors for real-time applications, users may want to enable both of them simultaneously. Make the number of per-task monitor configurable. Default it to 2 for now. Cc: John Ogness <john.ogness@linutronix.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/93e83313fc4ba7f6e66f4abe80ca5f5494d658d0.1752088709.git.namcao@linutronix.de Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09rv: Add rtapp_sleep monitorNam Cao
Add a monitor for checking that real-time tasks do not go to sleep in a manner that may cause undesirable latency. Also change RV depends on TRACING to RV select TRACING to avoid the following recursive dependency: error: recursive dependency detected! symbol TRACING is selected by PREEMPTIRQ_TRACEPOINTS symbol PREEMPTIRQ_TRACEPOINTS depends on TRACE_IRQFLAGS symbol TRACE_IRQFLAGS is selected by RV_MON_SLEEP symbol RV_MON_SLEEP depends on RV symbol RV depends on TRACING Cc: John Ogness <john.ogness@linutronix.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/75bc5bcc741d153aa279c95faf778dff35c5c8ad.1752088709.git.namcao@linutronix.de Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09rv: Add rtapp_pagefault monitorNam Cao
Userspace real-time applications may have design flaws that they raise page faults in real-time threads, and thus have unexpected latencies. Add an linear temporal logic monitor to detect this scenario. Cc: John Ogness <john.ogness@linutronix.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/78fea8a2de6d058241d3c6502c1a92910772b0ed.1752088709.git.namcao@linutronix.de Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>