summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-08-02khugepaged: optimize collapse_pte_mapped_thp() by PTE batchingDev Jain
Use PTE batching to batch process PTEs mapping the same large folio. An improvement is expected due to batching mapcount manipulation on the folios, and for arm64 which supports contig mappings, the number of TLB flushes is also reduced. Note that we do not need to make a change to the check "if (folio_page(folio, i) != page)"; if i'th page of the folio is equal to the first page of our batch, then i + 1, .... i + nr_batch_ptes - 1 pages of the folio will be equal to the corresponding pages of our batch mapping consecutive pages. Link: https://lkml.kernel.org/r/20250724052301.23844-4-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Barry Song <baohua@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02khugepaged: optimize __collapse_huge_page_copy_succeeded() by PTE batchingDev Jain
Use PTE batching to batch process PTEs mapping the same large folio. An improvement is expected due to batching refcount-mapcount manipulation on the folios, and for arm64 which supports contig mappings, the number of TLB flushes is also reduced. Link: https://lkml.kernel.org/r/20250724052301.23844-3-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Barry Song <baohua@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02mm: add get_and_clear_ptes() and clear_ptes()David Hildenbrand
Patch series "Optimizations for khugepaged", v4. If the underlying folio mapped by the ptes is large, we can process those ptes in a batch using folio_pte_batch(). For arm64 specifically, this results in a 16x reduction in the number of ptep_get() calls, since on a contig block, ptep_get() on arm64 will iterate through all 16 entries to collect a/d bits. Next, ptep_clear() will cause a TLBI for every contig block in the range via contpte_try_unfold(). Instead, use clear_ptes() to only do the TLBI at the first and last contig block of the range. For split folios, there will be no pte batching; the batch size returned by folio_pte_batch() will be 1. For pagetable split folios, the ptes will still point to the same large folio; for arm64, this results in the optimization described above, and for other arches, a minor improvement is expected due to a reduction in the number of function calls and batching atomic operations. This patch (of 3): Let's add variants to be used where "full" does not apply -- which will be the majority of cases in the future. "full" really only applies if we are about to tear down a full MM. Use get_and_clear_ptes() in existing code, clear_ptes() users will be added next. Link: https://lkml.kernel.org/r/20250724052301.23844-2-dev.jain@arm.com Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02mm/mincore: hold PTL in mincore_hugetlbJinjiang Tu
Hold PTL in mincore_hugetlb() to avoid operating on stale page, as mincore_pte_range() have done. Link: https://lkml.kernel.org/r/20250724090958.455887-4-tujinjiang@huawei.com Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brahmajit Das <brahmajit.xyz@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Rientjes <rientjes@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joern Engel <joern@logfs.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thiago Jung Bauermann <thiago.bauermann@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02mm/memory-failure: hold PTL in hwpoison_hugetlb_rangeJinjiang Tu
Hold PTL in hwpoison_hugetlb_range() to avoid operating on stale page, as hwpoison_pte_range() have done. This change is not known to address any issues which users have experienced. Link: https://lkml.kernel.org/r/20250725033112.2690158-1-tujinjiang@huawei.com Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brahmajit Das <brahmajit.xyz@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Rientjes <rientjes@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joern Engel <joern@logfs.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thiago Jung Bauermann <thiago.bauermann@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02mm/mseal: rework mseal apply logicLorenzo Stoakes
The logic can be simplified - firstly by renaming the inconsistently named apply_mm_seal() to mseal_apply(). We then wrap mseal_fixup() into the main loop as the logic is simple enough to not require it, equally it isn't a hugely pleasant pattern in mprotect() etc. so it's not something we want to perpetuate. We eliminate the need for invoking vma_iter_end() on each loop by directly determining if the VMA was merged - the only thing we need concern ourselves with is whether the start/end of the (gapless) range are offset into VMAs. This refactoring also avoids the rather horrid 'pass pointer to prev around' pattern used in mprotect() et al. No functional change intended. Link: https://lkml.kernel.org/r/ddfa4376ce29f19a589d7dc8c92cb7d4f7605a4c.1753431105.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Jeff Xu <jeffxu@chromium.org> Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <kees@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02mm/mseal: simplify and rename VMA gap checkLorenzo Stoakes
The check_mm_seal() function is doing something general - checking whether a range contains only VMAs (or rather that it does NOT contain any unmapped regions). So rename this function to range_contains_unmapped(). Additionally simplify the logic, we are simply checking whether the last vma->vm_end has either a VMA starting after it or ends before the end parameter. This check is rather dubious, so it is sensible to keep it local to mm/mseal.c as at a later stage it may be removed, and we don't want any other mm code to perform such a check. No functional change intended. [lorenzo.stoakes@oracle.com: add comment explaining why we disallow gaps on mseal()] Link: https://lkml.kernel.org/r/d85b3d55-09dc-43ba-8204-b48267a96751@lucifer.local Link: https://lkml.kernel.org/r/dd50984eff1e242b5f7f0f070a3360ef760e06b8.1753431105.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Jeff Xu <jeffxu@chromium.org> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <kees@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02mm/mseal: small cleanupsLorenzo Stoakes
Drop the wholly unnecessary set_vma_sealed() helper(), which is used only once, and place VMA_ITERATOR() declarations in the correct place. Retain vma_is_sealed(), and use it instead of the confusingly named can_modify_vma(), so it's abundantly clear what's being tested, rather then a nebulous sense of 'can the VMA be modified'. No functional change intended. Link: https://lkml.kernel.org/r/98cf28d04583d632a6eb698e9ad23733bb6af26b.1753431105.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Jeff Xu <jeffxu@chromium.org> Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <kees@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02mm/mseal: update madvise() logicLorenzo Stoakes
The madvise() logic is inexplicably performed in mm/mseal.c - this ought to be located in mm/madvise.c. Additionally can_modify_vma_madv() is inconsistently named and, in combination with is_ro_anon(), is very confusing logic. Put a static function in mm/madvise.c instead - can_madvise_modify() - that spells out exactly what's happening. Also explicitly check for an anon VMA. Also add commentary to explain what's going on. Essentially - we disallow discarding of data in mseal()'d mappings in instances where the user couldn't otherwise write to that data. We retain the existing behaviour here regarding MAP_PRIVATE mappings of file-backed mappings, which entails some complexity - while this, strictly speaking - appears to violate mseal() semantics, it may interact badly with users which expect to be able to madvise(MADV_DONTNEED) .text mappings for instance. We may revisit this at a later date. No functional change intended. Link: https://lkml.kernel.org/r/492a98d9189646e92c8f23f4cce41ed323fe01df.1753431105.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Kees Cook <kees@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02mm/mseal: always define VM_SEALEDLorenzo Stoakes
Patch series "mseal cleanups", v4. Perform a number of cleanups to the mseal logic. Firstly, VM_SEALED is treated differently from every other VMA flag, it really doesn't make sense to do this, so we start by making this consistent with everything else. Next we place the madvise logic where it belongs - in mm/madvise.c. It really makes no sense to abstract this elsewhere. In doing so, we go to great lengths to explain very clearly the previously very confusing logic as to what sealed mappings are impacted here. In doing so, we retain existing logic regarding treatment of madvise() discard operations for a sealed, read-only MAP_PRIVATE file-backed mapping. This is something we likely need to revisit. We then abstract out and explain the 'are there are any gaps in this range in the mm?' check being performed as a prerequisite to mseal being performed. Finally, we simplify the actual mseal logic which is really quite straightforward. No functional change is intended. This patch (of 4): There is no reason to treat VM_SEALED in a special way, in each other case in which a VMA flag is unavailable due to configuration, we simply assign that flag to VM_NONE, so make VM_SEALED consistent with all other VMA flags in this respect. Additionally, use the next available bit for VM_SEALED, 42, rather than arbitrarily putting it at 63 and update the declaration to match all other VMA flags. No functional change intended. Link: https://lkml.kernel.org/r/cover.1753431105.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/aeb398a77029b6e7377cd944328bc9bbc3c90537.1753431105.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Kees Cook <kees@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02mm/damon/vaddr: skip isolating folios already in destination nidBijan Tabatabai
damos_va_migrate_dests_add() determines the node a folio should be in based on the struct damos_migrate_dests associated with the migration scheme and adds the folio to the linked list corresponding to that node so it can be migrated later. Currently, folios are isolated and added to the list even if they are already in the node they should be in. In using damon weighted interleave more, I've found that the overhead of needlessly adding these folios to the migration lists can be quite high. The overhead comes from isolating folios and placing them in the migration lists inside of damos_va_migrate_dests_add(), as well as the cost of handling those folios in damon_migrate_pages(). This patch eliminates that overhead by simply avoiding the addition of folios that are already in their intended location to the migration list. To show the benefit of this patch, we start the test workload and start a DAMON instance attached to that workload with a migrate_hot scheme that has one dest field sending data to the local node. This way, we are only measuring the overheads of the scheme, and not the cost of migrating pages, since data will be allocated to the local node by default. I tested with two workloads: the embedding reduction workload used in [1] and a microbenchmark that allocates 20GB of data then sleeps, which is similar to the memory usage of the embedding reduction workload. The time taken in damos_va_migrate_dests_add() and damon_migrate_pages() each aggregation interval is shown below. Before this patch: damos_va_migrate_dests_add damon_migrate_pages microbenchmark ~2ms ~3ms embedding reduction ~1s ~3s After this patch: damos_va_migrate_dests_add damon_migrate_pages microbenchmark 0us ~40us embedding reduction 0us ~100us I did not do an in depth analysis for why things are much slower in the embedding reduction workload than the microbenchmark. However, I assume it's because the embedding reduction workload oversaturates the bandwidth of the local memory node, increasing the memory access latency, and in turn making the pointer chasing involved in iterating through a linked list much slower. Regardless of that, this patch results in a significant speedup. [1] https://lore.kernel.org/damon/20250709005952.17776-1-bijan311@gmail.com/ Link: https://lkml.kernel.org/r/20250725163300.4602-1-bijan311@gmail.com Fixes: 19c1dc15c859 ("mm/damon/vaddr: use damos->migrate_dests in migrate_{hot,cold}") Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02selftests: cachestat: add tests for mmap, refactor and enhance mmap test for ↵Suresh K C
cachestat validation Add a cohesive test case that verifies cachestat behavior with memory-mapped files using mmap(). Also refactor the test logic to reduce redundancy, improve error reporting, and clarify failure messages for both shmem and mmap file types. [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20250709174657.6916-1-suresh.k.chandrappa@gmail.com Signed-off-by: Suresh K C <suresh.k.chandrappa@gmail.com> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Tested-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02mm: add process info to bad rss-counter warningXuanye Liu
Enhance the debugging information in check_mm() by including the process name and PID when reporting bad rss-counter states. This helps identify which process is associated with the memory accounting issue. Link: https://lkml.kernel.org/r/20250723100901.1909683-1-liuqiye2025@163.com Signed-off-by: Xuanye Liu <liuqiye2025@163.com> Acked-by: SeongJae Park <sj@kernel.org> Cc: Ben Segall <bsegall@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02kasan: skip quarantine if object is still accessible under RCUJann Horn
Currently, enabling KASAN masks bugs where a lockless lookup path gets a pointer to a SLAB_TYPESAFE_BY_RCU object that might concurrently be recycled and is insufficiently careful about handling recycled objects: KASAN puts freed objects in SLAB_TYPESAFE_BY_RCU slabs onto its quarantine queues, even when it can't actually detect UAF in these objects, and the quarantine prevents fast recycling. When I introduced CONFIG_SLUB_RCU_DEBUG, my intention was that enabling CONFIG_SLUB_RCU_DEBUG should cause KASAN to mark such objects as freed after an RCU grace period and put them on the quarantine, while disabling CONFIG_SLUB_RCU_DEBUG should allow such objects to be reused immediately; but that hasn't actually been working. I discovered such a UAF bug involving SLAB_TYPESAFE_BY_RCU yesterday; I could only trigger this bug in a KASAN build by disabling CONFIG_SLUB_RCU_DEBUG and applying this patch. Link: https://lkml.kernel.org/r/20250723-kasan-tsbrcu-noquarantine-v1-1-846c8645976c@google.com Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Alexander Potapenko <glider@google.com> Acked-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02mm/page-flags: remove folio_start_writeback_keepwrite()Joanne Koong
Commit cd57b77197a4 ("ext4: Convert ext4_bio_write_page() to use a folio) removed set_page_writeback_keepwrite() which was the last/only caller of folio_start_writeback_keepwrite(). Link: https://lkml.kernel.org/r/20250722182230.2114587-1-joannelkoong@gmail.com Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02selftests/mm: add process_madvise() testswang lian
Add tests for process_madvise(), focusing on verifying behavior under various conditions including valid usage and error cases. [lianux.mm@gmail.com: v7] Link: https://lkml.kernel.org/r/20250729113109.12272-1-lianux.mm@gmail.com Link: https://lkml.kernel.org/r/20250729113109.12272-1-lianux.mm@gmail.com Link: https://lkml.kernel.org/r/20250721114614.40996-1-lianux.mm@gmail.com Signed-off-by: wang lian <lianux.mm@gmail.com> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Suggested-by: David Hildenbrand <david@redhat.com> Suggested-by: Zi Yan <ziy@nvidia.com> Suggested-by: Mark Brown <broonie@kernel.org> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Tested-by: Zi Yan <ziy@nvidia.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02mm: shmem: fix the shmem large folio allocation for the i915 driverBaolin Wang
After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"), we extend the 'huge=' option to allow any sized large folios for tmpfs, which means tmpfs will allow getting a highest order hint based on the size of write() and fallocate() paths, and then will try each allowable large order. However, when the i915 driver allocates shmem memory, it doesn't provide hint information about the size of the large folio to be allocated, resulting in the inability to allocate PMD-sized shmem, which in turn affects GPU performance. Patryk added: : In my tests, the performance drop ranges from a few percent up to 13% : in Unigine Superposition under heavy memory usage on the CPU Core Ultra : 155H with the Xe 128 EU GPU. Other users have reported performance : impact up to 30% on certain workloads. Please find more in the : regressions reports: : https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14645 : https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/13845 : : I believe the change should be backported to all active kernel branches : after version 6.12. To fix this issue, we can use the inode's size as a write size hint in shmem_read_folio_gfp() to help allocate PMD-sized large folios. Link: https://lkml.kernel.org/r/f7e64e99a3a87a8144cc6b2f1dddf7a89c12ce44.1753926601.git.baolin.wang@linux.alibaba.com Fixes: acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reported-by: Patryk Kowalczyk <patryk@kowalczyk.ws> Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Tested-by: Patryk Kowalczyk <patryk@kowalczyk.ws> Suggested-by: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02tools/getdelays: add backward compatibility for taskstats versionFan Yu
Add version checks to print_delayacct() to handle differences in struct taskstats across kernel versions. Field availability depends on taskstats version (t->version), corresponding to TASKSTATS_VERSION in kernel headers (see include/uapi/linux/taskstats.h). Version feature mapping: - version >= 11 - supports COMPACT statistics - version >= 13 - supports WPCOPY statistics - version >= 14 - supports IRQ statistics - version >= 16 - supports *_max and *_min delay statistics This ensures the tool works correctly with both older and newer kernel versions by conditionally printing fields based on the reported version. eg.1 bash# grep -r "#define TASKSTATS_VERSION" /usr/include/linux/taskstats.h "#define TASKSTATS_VERSION 10" bash# ./getdelays -d -p 1 CPU count real total virtual total delay total delay average 7481 3786181709 3807098291 36393725 0.005ms IO count delay total delay average 369 1116046035 3.025ms SWAP count delay total delay average 0 0 0.000ms RECLAIM count delay total delay average 0 0 0.000ms THRASHING count delay total delay average 0 0 0.000ms eg.2 bash# grep -r "#define TASKSTATS_VERSION" /usr/include/linux/taskstats.h "#define TASKSTATS_VERSION 14" bash# ./getdelays -d -p 1 CPU count real total virtual total delay total delay average 68862 163474790046 174584722267 19962496806 0.290ms IO count delay total delay average 0 0 0.000ms SWAP count delay total delay average 0 0 0.000ms RECLAIM count delay total delay average 0 0 0.000ms THRASHING count delay total delay average 0 0 0.000ms COMPACT count delay total delay average 0 0 0.000ms WPCOPY count delay total delay average 0 0 0.000ms IRQ count delay total delay average 0 0 0.000ms Link: https://lkml.kernel.org/r/20250731225326549CttJ7g9NfjTlaqBwl015T@zte.com.cn Signed-off-by: Fan Yu <fan.yu9@zte.com.cn> Cc: Fan Yu <fan.yu9@zte.com.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Wang Yaxin <wang.yaxin@zte.com.cn> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Yang Yang <yang.yang29@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02kho: add test for kexec handoverMike Rapoport (Microsoft)
Testing kexec handover requires a kernel driver that will generate some data and preserve it with KHO on the first boot and then restore that data and verify it was preserved properly after kexec. To facilitate such test, along with the kernel driver responsible for data generation, preservation and restoration add a script that runs a kernel in a VM with a minimal /init. The /init enables KHO, loads a kernel image for kexec and runs kexec reboot. After the boot of the kexeced kernel, the driver verifies that the data was properly preserved. [rppt@kernel.org: fix section mismatch] Link: https://lkml.kernel.org/r/aIiRC8fXiOXKbPM_@kernel.org Link: https://lkml.kernel.org/r/20250727083733.2590139-1-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Pratyush Yadav <pratyush@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02delaytop: enhance error logging and add PSI feature descriptionfan.yu9@zte.com.cn
This patch improves error diagnostics and documentation for delaytop: 1) Enhanced error logging: - Added explicit error messages in critical failure paths - Implemented BOOL_FPRINT macro for robust output handling 2) PSI feature documentation: - Updated header comment to reflect PSI monitoring capability - Improved output formatting for PSI information System Pressure Information: (avg10/avg60/avg300/total) CPU some: 0.0%/ 0.0%/ 0.0%/ 345(ms) CPU full: 0.0%/ 0.0%/ 0.0%/ 0(ms) Memory full: 0.0%/ 0.0%/ 0.0%/ 0(ms) Memory some: 0.0%/ 0.0%/ 0.0%/ 0(ms) IO full: 0.0%/ 0.0%/ 0.0%/ 65(ms) IO some: 0.0%/ 0.0%/ 0.0%/ 79(ms) IRQ full: 0.0%/ 0.0%/ 0.0%/ 0(ms) Link: https://lkml.kernel.org/r/202507281628341752gMXCMN7S-Vz_LHYHum9r@zte.com.cn Signed-off-by: Fan Yu <fan.yu9@zte.com.cn> Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn> Acked-by: Yang Yang <yang.yang29@zte.com.cn> Cc: Fan Yu <fan.yu9@zte.com.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02samples: Kconfig: fix spelling mistake "instancess" -> "instances"Colin Ian King
There is a spelling mistake in the SAMPLE_TRACE_ARRAY config. Fix it. Link: https://lkml.kernel.org/r/20250724111715.141826-1-colin.i.king@gmail.com Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02fat: fix too many log in fat_chain_add()OGAWA Hirofumi
This log was excessive for a serial console. So use the ratelimited version instead. Link: https://lkml.kernel.org/r/87qzy611d9.fsf@mail.parknet.co.jp Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Reported-by: syzbot+fa7ef54f66c189c04b73@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=fa7ef54f66c189c04b73 Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02scripts/spelling.txt: add notifer||notifier to spelling.txtWangYuli
This typo was not listed in scripts/spelling.txt, thus it was more difficult to detect. Add it for convenience. Link: https://lkml.kernel.org/r/02153C05ED7B49B7+20250722073431.21983-8-wangyuli@uniontech.com Signed-off-by: WangYuli <wangyuli@uniontech.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02xen/xenbus: fix typo "notifer"WangYuli
There is a spelling mistake of 'notifer' in the comment which should be 'notifier'. Link: https://lkml.kernel.org/r/C6633C66376C709A+20250722073431.21983-7-wangyuli@uniontech.com Signed-off-by: WangYuli <wangyuli@uniontech.com> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02net: mvneta: fix typo "notifer"WangYuli
There is a spelling mistake of 'notifer' in the comment which should be 'notifier'. Link: https://lkml.kernel.org/r/0CB4300CB6F49007+20250722073431.21983-4-wangyuli@uniontech.com Signed-off-by: WangYuli <wangyuli@uniontech.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02drm/xe: fix typo "notifer"WangYuli
There is a spelling mistake of 'notifer' in the comment which should be 'notifier'. Link: https://lkml.kernel.org/r/94190C5F54A19F3E+20250722073431.21983-3-wangyuli@uniontech.com Signed-off-by: WangYuli <wangyuli@uniontech.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02cxl: mce: fix typo "notifer"WangYuli
According to the context, "mce_notifer" should be "mce_notifier". Link: https://lkml.kernel.org/r/E1EB1BA9FDF07D53+20250722073431.21983-2-wangyuli@uniontech.com Fixes: 516e5bd0b6bf ("cxl: Add mce notifier to emit aliased address for extended linear cache") Signed-off-by: WangYuli <wangyuli@uniontech.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02KVM: x86: fix typo "notifer"WangYuli
Patch series "treewide: Fix typo "notifer"", v3. There are some spelling mistakes of 'notifer' in comments which should be 'notifier'. Fix them and add it to scripts/spelling.txt. This patch (of 8): There are some spelling mistakes of 'notifer' which should be 'notifier'. Link: https://lkml.kernel.org/r/576F0D85F6853074+20250722072734.19367-1-wangyuli@uniontech.com Link: https://lkml.kernel.org/r/7F05778C3A1A9F8B+20250722073431.21983-1-wangyuli@uniontech.com Signed-off-by: WangYuli <wangyuli@uniontech.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02MAINTAINERS: add maintainers for delaytopWang Yaxin
The delaytop tool supports showing system delays and task-level delays, effectively identifying the top-n tasks with high latency in the system, which is highly beneficial for improving system performance. Wang Yaxin and her colleague Fan Yu focus on locating system delay issues. To promote the thriving development of delaytop, we hope to serve as maintainers to continuously improve it, aiming to provide a more effective solution for system latency issues in the future. Link: https://lkml.kernel.org/r/20250721094049958ImB8XG_imntcPqpQn1KfG@zte.com.cn Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn> Signed-off-by: Fan Yu <fan.yu9@zte.com.cn> Reviewed-by: Yang Yang <yang.yang29@zte.com.cn> Cc: Balbir Singh <bsingharora@gmail.com> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Krzysztof Kozlowski <krzk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02ucount: use atomic_long_try_cmpxchg() in atomic_long_inc_below()Uros Bizjak
Use atomic_long_try_cmpxchg() instead of atomic_long_cmpxchg (*ptr, old, new) == old in atomic_long_inc_below(). x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Also, atomic_long_try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg fails, enabling further code simplifications. No functional change intended. Link: https://lkml.kernel.org/r/20250721174610.28361-2-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Reviewed-by: Alexey Gladkov <legion@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Alexey Gladkov <legion@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: MengEn Sun <mengensun@tencent.com> Cc: "Thomas Weißschuh" <linux@weissschuh.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02ucount: fix atomic_long_inc_below() argument typeUros Bizjak
The type of u argument of atomic_long_inc_below() should be long to avoid unwanted truncation to int. The patch fixes the wrong argument type of an internal function to prevent unwanted argument truncation. It fixes an internal locking primitive; it should not have any direct effect on userspace. Mark said : AFAICT there's no problem in practice because atomic_long_inc_below() : is only used by inc_ucount(), and it looks like the value is : constrained between 0 and INT_MAX. : : In inc_ucount() the limit value is taken from : user_namespace::ucount_max[], and AFAICT that's only written by : sysctls, to the table setup by setup_userns_sysctls(), where : UCOUNT_ENTRY() limits the value between 0 and INT_MAX. : : This is certainly a cleanup, but there might be no functional issue in : practice as above. Link: https://lkml.kernel.org/r/20250721174610.28361-1-ubizjak@gmail.com Fixes: f9c82a4ea89c ("Increase size of ucounts to atomic_long_t") Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Alexey Gladkov <legion@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: MengEn Sun <mengensun@tencent.com> Cc: "Thomas Weißschuh" <linux@weissschuh.net> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02kexec: enable CMA based contiguous allocationAlexander Graf
When booting a new kernel with kexec_file, the kernel picks a target location that the kernel should live at, then allocates random pages, checks whether any of those patches magically happens to coincide with a target address range and if so, uses them for that range. For every page allocated this way, it then creates a page list that the relocation code - code that executes while all CPUs are off and we are just about to jump into the new kernel - copies to their final memory location. We can not put them there before, because chances are pretty good that at least some page in the target range is already in use by the currently running Linux environment. Copying is happening from a single CPU at RAM rate, which takes around 4-50 ms per 100 MiB. All of this is inefficient and error prone. To successfully kexec, we need to quiesce all devices of the outgoing kernel so they don't scribble over the new kernel's memory. We have seen cases where that does not happen properly (*cough* GIC *cough*) and hence the new kernel was corrupted. This started a month long journey to root cause failing kexecs to eventually see memory corruption, because the new kernel was corrupted severely enough that it could not emit output to tell us about the fact that it was corrupted. By allocating memory for the next kernel from a memory range that is guaranteed scribbling free, we can boot the next kernel up to a point where it is at least able to detect corruption and maybe even stop it before it becomes severe. This increases the chance for successful kexecs. Since kexec got introduced, Linux has gained the CMA framework which can perform physically contiguous memory mappings, while keeping that memory available for movable memory when it is not needed for contiguous allocations. The default CMA allocator is for DMA allocations. This patch adds logic to the kexec file loader to attempt to place the target payload at a location allocated from CMA. If successful, it uses that memory range directly instead of creating copy instructions during the hot phase. To ensure that there is a safety net in case anything goes wrong with the CMA allocation, it also adds a flag for user space to force disable CMA allocations. Using CMA allocations has two advantages: 1) Faster by 4-50 ms per 100 MiB. There is no more need to copy in the hot phase. 2) More robust. Even if by accident some page is still in use for DMA, the new kernel image will be safe from that access because it resides in a memory region that is considered allocated in the old kernel and has a chance to reinitialize that component. Link: https://lkml.kernel.org/r/20250610085327.51817-1-graf@amazon.com Signed-off-by: Alexander Graf <graf@amazon.com> Acked-by: Baoquan He <bhe@redhat.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Zhongkun He <hezhongkun.hzk@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02stackdepot: make max number of pools boot-time configurableMatt Fleming
We're hitting the WARN in depot_init_pool() about reaching the stack depot limit because we have long stacks that don't dedup very well. Introduce a new start-up parameter to allow users to set the number of maximum stack depot pools. Link: https://lkml.kernel.org/r/20250718153928.94229-1-matt@readmodwrite.com Signed-off-by: Matt Fleming <mfleming@cloudflare.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02lib/xxhash: remove unused functionsDr. David Alan Gilbert
xxh32_digest() and xxh32_update() were added in 2017 in the original xxhash commit, but have remained unused. Remove them. Link: https://lkml.kernel.org/r/20250716133245.243363-1-linux@treblig.org Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Dave Gilbert <linux@treblig.org> Cc: Nick Terrell <terrelln@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02init/Kconfig: restore CONFIG_BROKEN help textAndrew Morton
Linus added it in 2003, it later was removed. Put it back. Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Borislav Betkov <bp@alien8.de> Cc: David S. Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02mm/shmem, swap: improve cached mTHP handling and fix potential hangKairui Song
The current swap-in code assumes that, when a swap entry in shmem mapping is order 0, its cached folios (if present) must be order 0 too, which turns out not always correct. The problem is shmem_split_large_entry is called before verifying the folio will eventually be swapped in, one possible race is: CPU1 CPU2 shmem_swapin_folio /* swap in of order > 0 swap entry S1 */ folio = swap_cache_get_folio /* folio = NULL */ order = xa_get_order /* order > 0 */ folio = shmem_swap_alloc_folio /* mTHP alloc failure, folio = NULL */ <... Interrupted ...> shmem_swapin_folio /* S1 is swapped in */ shmem_writeout /* S1 is swapped out, folio cached */ shmem_split_large_entry(..., S1) /* S1 is split, but the folio covering it has order > 0 now */ Now any following swapin of S1 will hang: `xa_get_order` returns 0, and folio lookup will return a folio with order > 0. The `xa_get_order(&mapping->i_pages, index) != folio_order(folio)` will always return false causing swap-in to return -EEXIST. And this looks fragile. So fix this up by allowing seeing a larger folio in swap cache, and check the whole shmem mapping range covered by the swapin have the right swap value upon inserting the folio. And drop the redundant tree walks before the insertion. This will actually improve performance, as it avoids two redundant Xarray tree walks in the hot path, and the only side effect is that in the failure path, shmem may redundantly reallocate a few folios causing temporary slight memory pressure. And worth noting, it may seems the order and value check before inserting might help reducing the lock contention, which is not true. The swap cache layer ensures raced swapin will either see a swap cache folio or failed to do a swapin (we have SWAP_HAS_CACHE bit even if swap cache is bypassed), so holding the folio lock and checking the folio flag is already good enough for avoiding the lock contention. The chance that a folio passes the swap entry value check but the shmem mapping slot has changed should be very low. Link: https://lkml.kernel.org/r/20250728075306.12704-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20250728075306.12704-2-ryncsn@gmail.com Fixes: 809bc86517cc ("mm: shmem: support large folio swap out") Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Dev Jain <dev.jain@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02drm/bridge: fix OF node leakJohan Hovold
Make sure to drop the OF node reference taken when creating the aux bridge device when the device is later released. Fixes: 6914968a0b52 ("drm/bridge: properly refcount DT nodes in aux bridge drivers") Cc: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com> Signed-off-by: Johan Hovold <johan@kernel.org> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com> Link: https://lore.kernel.org/r/20250708085124.15445-2-johan@kernel.org Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
2025-08-02Merge tag 'fbdev-for-6.17-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev Pull fbdev updates from Helge Deller: "One potential buffer overflow fix in the framebuffer registration function, some fixes for the imxfb, nvidiafb and simplefb drivers, and a bunch of cleanups for fbcon, kyrofb and svgalib. Framework fixes: - fix potential buffer overflow in do_register_framebuffer() [Yongzhen Zhang] Driver fixes: - imxfb: prevent null-ptr-deref [Chenyuan Yang] - nvidiafb: fix build on 32-bit ARCH=um [Johannes Berg] - nvidiafb: add depends on HAS_IOPORT [Randy Dunlap] - simplefb: Use of_reserved_mem_region_to_resource() for "memory-region" [Rob Herring] Cleanups: - fbcon: various code cleanups wrt blinking [Ville Syrjälä] - kyrofb: Convert to devm_*() functions [Giovanni Di Santi] - svgalib: Coding style cleanups [Darshan R.] - Fix typo in Kconfig text for FB_DEVICE [Daniel Palmer]" * tag 'fbdev-for-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev: fbcon: Use 'bool' where appopriate fbcon: Introduce get_{fg,bg}_color() fbcon: fbcon_is_inactive() -> fbcon_is_active() fbcon: fbcon_cursor_noblink -> fbcon_cursor_blink fbdev: Fix typo in Kconfig text for FB_DEVICE fbdev: imxfb: Check fb_add_videomode to prevent null-ptr-deref fbdev: svgalib: Clean up coding style fbdev: kyro: Use devm_ioremap_wc() for screen mem fbdev: kyro: Use devm_ioremap() for mmio registers fbdev: kyro: Add missing PCI memory region request fbdev: simplefb: Use of_reserved_mem_region_to_resource() for "memory-region" fbdev: fix potential buffer overflow in do_register_framebuffer() fbdev: nvidiafb: add depends on HAS_IOPORT fbdev: nvidiafb: fix build on 32-bit ARCH=um
2025-08-02Merge tag 'firewire-updates-6.17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394 Pull firewire updates from Takashi Sakamoto: "This update replaces the remaining tasklet usage in the FireWire subsystem with workqueue for asynchronous packet transmission. With this change, tasklets are now fully eliminated from the subsystem. Asynchronous packet transmission is used for serial bus topology management as well as for the operation of the SBP-2 protocol driver (firewire-sbp2). To ensure reliability during low-memory conditions, the associated workqueue is created with the WQ_MEM_RECLAIM flag, allowing it to participate in memory reclaim paths. Other attributes are aligned with those used for isochronous packet handling, which was migrated to workqueues in v6.12. The workqueues are sleepable and support preemptible work items, making them more suitable for real-time workloads that benefit from timely task preemption at the system level. There remains an issue where 'schedule()' may be called within an RCU read-side critical section, due to a direct replacement of 'tasklet_disable_in_atomic()' with 'disable_work_sync()'. A proposed fix for this has been posted[1], and is currently under review and testing. It is expected to be sent upstream later" Link: https://lore.kernel.org/lkml/20250728015125.17825-1-o-takashi@sakamocchi.jp/ [1] * tag 'firewire-updates-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394: firewire: ohci: reduce the size of common context structure by extracting members into AT structure firewire: core: minor code refactoring to localize table of gap count firewire: ohci: use workqueue to handle events of AT request/response contexts firewire: ohci: use workqueue to handle events of AR request/response contexts firewire: core: allocate workqueue for AR/AT request/response contexts firewire: core: use from_work() macro to expand parent structure of work_struct firewire: ohci: use from_work() macro to expand parent structure of work_struct firewire: ohci: correct code comments about bus_reset tasklet
2025-08-02bpf: Fix memory leak of bpf_scc_info objectsEduard Zingerman
env->scc_info array contains references to bpf_scc_info objects allocated lazily in verifier.c:scc_visit_alloc(). env->scc_cnt was supposed to track env->scc_info array size in order to free referenced objects in verifier.c:free_states(). Fix initialization of env->scc_cnt that was omitted in verifier.c:compute_scc(). To reproduce the bug: - build with CONFIG_DEBUG_KMEMLEAK - boot and load bpf program with loops, e.g.: ./veristat -q pyperf180.bpf.o - initiate memleak scan and check results: echo scan > /sys/kernel/debug/kmemleak cat /sys/kernel/debug/kmemleak Fixes: c9e31900b54c ("bpf: propagate read/precision marks over state graph backedges") Reported-by: Jens Axboe <axboe@kernel.dk> Closes: https://lore.kernel.org/bpf/CAADnVQKXUWg9uRCPD5ebRXwN4dmBCRUFFM7kN=GxymYz3zU25A@mail.gmail.com/T/ Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Tested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250801232330.1800436-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-02ALSA: usb-audio: Don't use printk_ratelimit for debug printsSean Anderson
printk_ratelimit is deprecated, since it shares state with all other printk sites. Additionally, the suppression message is printed at warning level even though the actual messages are printed at debug and are (usually) invisible! This can result in thousands of messages like retire_capture_urb: 4992 callbacks suppressed in the console, and can inhibit debugging since it is unclear what the source of the suppressed callbacks is. Switch to dev_dbg_ratelimited which doesn't print anything unless debug is enabled. Signed-off-by: Sean Anderson <sean.anderson@linux.dev> Link: https://patch.msgid.link/20250801154710.739464-1-sean.anderson@linux.dev Signed-off-by: Takashi Iwai <tiwai@suse.de>
2025-08-02futex: Move futex cleanup to __mmdrop()Thomas Gleixner
Futex hash allocations are done in mm_init() and the cleanup happens in __mmput(). That works most of the time, but there are mm instances which are instantiated via mm_alloc() and freed via mmdrop(), which causes the futex hash to be leaked. Move the cleanup to __mmdrop(). Fixes: 56180dd20c19 ("futex: Use RCU-based per-CPU reference counting instead of rcuref_t") Reported-by: André Draszik <andre.draszik@linaro.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: André Draszik <andre.draszik@linaro.org> Link: https://lore.kernel.org/all/87ldo5ihu0.ffs@tglx Closes: https://lore.kernel.org/all/0c8cc83bb73abf080faf584f319008b67d0931db.camel@linaro.org
2025-08-02smp: Fix spelling in on_each_cpu_cond_mask()'s doc-commentRoman Kisel
"boolean" is spelt as "blooean". Fix that. Signed-off-by: Roman Kisel <romank@linux.microsoft.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250722161818.6139-1-romank@linux.microsoft.com
2025-08-02Merge tag 'bpf-next-6.17' into loongarch-nextHuacai Chen
LoongArch architecture changes for 6.17 have many bpf features such as trampoline, so merge 'bpf-next-6.17' to create a base to make bpf work well.
2025-08-01Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds
Pull bpf fixes from Alexei Starovoitov: - Fix kCFI failures in JITed BPF code on arm64 (Sami Tolvanen, Puranjay Mohan, Mark Rutland, Maxwell Bland) - Disallow tail calls between BPF programs that use different cgroup local storage maps to prevent out-of-bounds access (Daniel Borkmann) - Fix unaligned access in flow_dissector and netfilter BPF programs (Paul Chaignon) - Avoid possible use of uninitialized mod_len in libbpf (Achill Gilgenast) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Test for unaligned flow_dissector ctx access bpf: Improve ctx access verifier error message bpf: Check netfilter ctx accesses are aligned bpf: Check flow_dissector ctx accesses are aligned arm64/cfi,bpf: Support kCFI + BPF on arm64 cfi: Move BPF CFI types and helpers to generic code cfi: add C CFI type macro libbpf: Avoid possible use of uninitialized mod_len bpf: Fix oob access in cgroup local storage bpf: Move cgroup iterator helpers to bpf.h bpf: Move bpf map owner out of common struct bpf: Add cookie object to bpf maps
2025-08-01Merge tag 'perf-tools-for-v6.17-2025-08-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools Pull perf tools updates from Namhyung Kim: "Build-ID processing goodies: Build-IDs are content based hashes to link regions of memory to ELF files in post processing. They have been available in distros for quite a while: $ file /bin/bash /bin/bash: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=707a1c670cd72f8e55ffedfbe94ea98901b7ce3a, for GNU/Linux 3.2.0, stripped It is possible to ask the kernel to get it from mmap executable backing storage at time they are being put in place and send it as metadata at that moment to have in perf.data. Prefer that across the board to speed up 'record' time - it post processes the samples to find binaries touched by any samples and to save them with build-ID. It can skip reading build-ID in userspace if it comes from the kernel. perf record: * Make --buildid-mmap default. The kernel can generate MMAP2 events with a build-ID from ELF header. Use that by default instead of using inode and device ID to identify binaries. It also can be disabled with --no-buildid-mmap. * Use BPF for -u/--uid option to sample processes belong to a user. BPF can track user processes more accurately and the existing logic often fails to get the list of processes due to race with reading the /proc filesystem. * Generate PERF_RECORD_BPF_METADATA when it profiles BPF programs and they have variables starting with "bpf_metadata_". This will help to identify BPF objects used in the profile. This has been supported in bpftool for some time and allows the recording of metadata such as commit hashes, versions, etc, that now gets recorded in perf.data as well. * Collect list of DSOs touched in the sample callchains as well as in the sample itself. This would increase the processing time at the end of record, but can improve the data quality. perf stat: * Add a new 'drm' pseudo-PMU support like in 'hwmon'. It can collect DRM usage stats using fdinfo in /proc. On my Intel laptop, it shows like below: $ perf list drm ... drm: drm-active-stolen-system0 [Total memory active in one or more engines. Unit: drm_i915] drm-active-system0 [Total memory active in one or more engines. Unit: drm_i915] drm-engine-capacity-video [Engine capacity. Unit: drm_i915] drm-engine-copy [Utilization in ns. Unit: drm_i915] drm-engine-render [Utilization in ns. Unit: drm_i915] drm-engine-video [Utilization in ns. Unit: drm_i915] ... $ sudo perf stat -a -e drm-engine-render,drm-engine-video,drm-engine-capacity-video sleep 1 Performance counter stats for 'system wide': 48,137,316,988,873 ns drm-engine-render 34,452,696,746 ns drm-engine-video 20 capacity drm-engine-capacity-video 1.002086194 seconds time elapsed perf list * Add description for software events. The description is in JSON format and the event parser now can handle the software events like others (for example, it's case-insensitive and subject to wildcard matching). $ perf list software List of pre-defined events (to be used in -e or -M): software: alignment-faults [Number of kernel handled memory alignment faults. Unit: software] bpf-output [An event used by BPF programs to write to the perf ring buffer. Unit: software] cgroup-switches [Number of context switches to a task in a different cgroup. Unit: software] context-switches [Number of context switches [This event is an alias of cs]. Unit: software] cpu-clock [Per-CPU high-resolution timer based event. Unit: software] cpu-migrations [Number of times a process has migrated to a new CPU [This event is an alias of migrations]. Unit: software] cs [Number of context switches [This event is an alias of context-switches]. Unit: software] dummy [A placeholder event that doesn't count anything. Unit: software] emulation-faults [Number of kernel handled unimplemented instruction faults handled through emulation. Unit: software] faults [Number of page faults [This event is an alias of page-faults]. Unit: software] major-faults [Number of major page faults. Major faults require I/O to handle. Unit: software] migrations [Number of times a process has migrated to a new CPU [This event is an alias of cpu-migrations]. Unit: software] minor-faults [Number of minor page faults. Minor faults don't require I/O to handle. Unit: software] page-faults [Number of page faults [This event is an alias of faults]. Unit: software] task-clock [Per-task high-resolution timer based event. Unit: software] perf ftrace: * Add -e/--events option to perf ftrace latency to measure latency between the two events instead of a function. $ sudo perf ftrace latency -ab -e i915_request_wait_begin,i915_request_wait_end --hide-empty -- sleep 1 # DURATION | COUNT | GRAPH | 256 - 512 us | 4 | ###### | 2 - 4 ms | 2 | ### | 4 - 8 ms | 12 | ################### | 8 - 16 ms | 10 | ################ | # statistics (in usec) total time: 194915 avg time: 6961 max time: 12855 min time: 373 count: 28 * Add new function graph tracer options (--graph-opts) to display more info like arguments and return value. They will be passed to the kernel ftrace directly. $ sudo perf ftrace -G vfs_write --graph-opts retval,retaddr # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | ... 5) | mutex_unlock() { /* <-rb_simple_write+0xda/0x150 */ 5) 0.188 us | local_clock(); /* <-lock_release+0x2ad/0x440 ret=0x3bf2a3cf90e */ 5) | rt_mutex_slowunlock() { /* <-rb_simple_write+0xda/0x150 */ 5) | _raw_spin_lock_irqsave() { /* <-rt_mutex_slowunlock+0x4f/0x200 */ 5) 0.123 us | preempt_count_add(); /* <-_raw_spin_lock_irqsave+0x23/0x90 ret=0x0 */ 5) 0.128 us | local_clock(); /* <-__lock_acquire.isra.0+0x17a/0x740 ret=0x3bf2a3cfc8b */ 5) 0.086 us | do_raw_spin_trylock(); /* <-_raw_spin_lock_irqsave+0x4a/0x90 ret=0x1 */ 5) 0.845 us | } /* _raw_spin_lock_irqsave ret=0x292 */ ... Misc: * Add perf archive --exclude-buildids <FILE> option to skip some binaries. The format of the FILE should be same as an output of perf buildid-list. * Get rid of dependency of libcrypto. It was just to get SHA-1 hash so implement it directly like in the kernel. A side effect is that it needs -fno-strict-aliasing compiler option (again, like in the kernel). * Convert all shell script tests to use bash" * tag 'perf-tools-for-v6.17-2025-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: (179 commits) perf record: Cache build-ID of hit DSOs only perf test: Ensure lock contention using pipe mode perf python: Stop using deprecated PyUnicode_AsString() perf list: Skip ABI PMUs when printing pmu values perf list: Remove tracepoint printing code perf tp_pmu: Add event APIs perf tp_pmu: Factor existing tracepoint logic to new file perf parse-events: Remove non-json software events perf jevents: Add common software event json perf tools: Remove libtraceevent in .gitignore perf test: Fix comment ordering perf sort: Use perf_env to set arch sort keys and header perf test: Move PERF_SAMPLE_WEIGHT_STRUCT parsing to common test perf sample: Remove arch notion of sample parsing perf env: Remove global perf_env perf trace: Avoid global perf_env with evsel__env perf auxtrace: Pass perf_env from session through to mmap read perf machine: Explicitly pass in host perf_env perf bench synthesize: Avoid use of global perf_env perf top: Make perf_env locally scoped ...
2025-08-01Merge tag 'parisc-for-6.17-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux Pull parisc updates from Helge Deller: - The parisc kernel wrongly allows reading from read-protected userspace memory without faulting, e.g. when userspace uses mprotect() to read-protect a memory area and then uses a pointer to this memory in a write(2, addr, 1) syscall. To fix this issue, Dave Anglin developed a set of patches which use the proberi assembler instruction to additionally check read access permissions at runtime. - Randy Dunlap contributed two patches to fix a minor typo and to explain why a 32-bit compiler is needed although a 64-bit kernel is built * tag 'parisc-for-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: parisc: Revise __get_user() to probe user read access parisc: Revise gateway LWS calls to probe user read access parisc: Drop WARN_ON_ONCE() from flush_cache_vmap parisc: Try to fixup kernel exception in bad_area_nosemaphore path of do_page_fault() parisc: Define and use set_pte_at() parisc: Rename pte_needs_flush() to pte_needs_cache_flush() in cache.c parisc: Check region is readable by user in raw_copy_from_user() parisc: Update comments in make_insert_tlb parisc: Makefile: explain that 64BIT requires both 32-bit and 64-bit compilers parisc: Makefile: fix a typo in palo.conf
2025-08-01tracing: Have unsigned int function args displayed as hexadecimalSteven Rostedt
Most function arguments that are passed in as unsigned int or unsigned long are better displayed as hexadecimal than normal integer. For example, the functions: static void __create_object(unsigned long ptr, size_t size, int min_count, gfp_t gfp, unsigned int objflags); static bool stack_access_ok(struct unwind_state *state, unsigned long _addr, size_t len); void __local_bh_disable_ip(unsigned long ip, unsigned int cnt); Show up in the trace as: __create_object(ptr=-131387050520576, size=4096, min_count=1, gfp=3264, objflags=0) <-kmem_cache_alloc_noprof stack_access_ok(state=0xffffc9000233fc98, _addr=-60473102566256, len=8) <-unwind_next_frame __local_bh_disable_ip(ip=-2127311112, cnt=256) <-handle_softirqs Instead, by displaying unsigned as hexadecimal, they look more like this: __create_object(ptr=0xffff8881028d2080, size=0x280, min_count=1, gfp=0x82820, objflags=0x0) <-kmem_cache_alloc_node_noprof stack_access_ok(state=0xffffc90000003938, _addr=0xffffc90000003930, len=0x8) <-unwind_next_frame __local_bh_disable_ip(ip=0xffffffff8133cef8, cnt=0x100) <-handle_softirqs Which is much easier to understand as most unsigned longs are usually just pointers. Even the "unsigned int cnt" in __local_bh_disable_ip() looks better as hexadecimal as a lot of flags are passed as unsigned. Changes since v2: https://lore.kernel.org/20250801111453.01502861@gandalf.local.home - Use btf_int_encoding() instead of open coding it (Martin KaFai Lau) Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Douglas Raillard <douglas.raillard@arm.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Link: https://lore.kernel.org/20250801165601.7770d65c@gandalf.local.home Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-01Merge tag 'cxl-for-6.17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl Pull CXL updates from Dave Jiang: "The most significant changes in this pull request is the series that introduces ACQUIRE() and ACQUIRE_ERR() macros to replace conditional locking and ease the pain points of scoped_cond_guard(). The series also includes follow on changes that refactor the CXL sub-system to utilize the new macros. Detail summary: - Add documentation template for CXL conventions to document CXL platform quirks - Replace mutex_lock_io() with mutex_lock() for mailbox - Add location limit for fake CFMWS range for cxl_test, ARM platform enabling - CXL documentation typo and clarity fixes - Use correct format specifier for function cxl_set_ecs_threshold() - Make cxl_bus_type constant - Introduce new helper cxl_resource_contains_addr() to check address availability - Fix wrong DPA checking for PPR operation - Remove core/acpi.c and CXL core dependency on ACPI - Introduce ACQUIRE() and ACQUIRE_ERR() for conditional locks - Add CXL updates utilizing ACQUIRE() macro to remove gotos and improve readability - Add return for the dummy version of cxl_decoder_detach() without CONFIG_CXL_REGION - CXL events updates for spec r3.2 - Fix return of __cxl_decoder_detach() error path - CXL debugfs documentation fix" * tag 'cxl-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (28 commits) Documentation/ABI/testing/debugfs-cxl: Add 'cxl' to clear_poison path cxl/region: Fix an ERR_PTR() vs NULL bug cxl/events: Trace Memory Sparing Event Record cxl/events: Add extra validity checks for CVME count in DRAM Event Record cxl/events: Add extra validity checks for corrected memory error count in General Media Event Record cxl/events: Update Common Event Record to CXL spec rev 3.2 cxl: Fix -Werror=return-type in cxl_decoder_detach() cleanup: Fix documentation build error for ACQUIRE updates cxl: Convert to ACQUIRE() for conditional rwsem locking cxl/region: Consolidate cxl_decoder_kill_region() and cxl_region_detach() cxl/region: Move ready-to-probe state check to a helper cxl/region: Split commit_store() into __commit() and queue_reset() helpers cxl/decoder: Drop pointless locking cxl/decoder: Move decoder register programming to a helper cxl/mbox: Convert poison list mutex to ACQUIRE() cleanup: Introduce ACQUIRE() and ACQUIRE_ERR() for conditional locks cxl: Remove core/acpi.c and cxl core dependency on ACPI cxl/core: Using cxl_resource_contains_addr() to check address availability cxl/edac: Fix wrong dpa checking for PPR operation cxl/core: Introduce a new helper cxl_resource_contains_addr() ...
2025-08-01net: Add locking to protect skb->dev access in ip_outputSharath Chandra Vurukala
In ip_output() skb->dev is updated from the skb_dst(skb)->dev this can become invalid when the interface is unregistered and freed, Introduced new skb_dst_dev_rcu() function to be used instead of skb_dst_dev() within rcu_locks in ip_output.This will ensure that all the skb's associated with the dev being deregistered will be transnmitted out first, before freeing the dev. Given that ip_output() is called within an rcu_read_lock() critical section or from a bottom-half context, it is safe to introduce an RCU read-side critical section within it. Multiple panic call stacks were observed when UL traffic was run in concurrency with device deregistration from different functions, pasting one sample for reference. [496733.627565][T13385] Call trace: [496733.627570][T13385] bpf_prog_ce7c9180c3b128ea_cgroupskb_egres+0x24c/0x7f0 [496733.627581][T13385] __cgroup_bpf_run_filter_skb+0x128/0x498 [496733.627595][T13385] ip_finish_output+0xa4/0xf4 [496733.627605][T13385] ip_output+0x100/0x1a0 [496733.627613][T13385] ip_send_skb+0x68/0x100 [496733.627618][T13385] udp_send_skb+0x1c4/0x384 [496733.627625][T13385] udp_sendmsg+0x7b0/0x898 [496733.627631][T13385] inet_sendmsg+0x5c/0x7c [496733.627639][T13385] __sys_sendto+0x174/0x1e4 [496733.627647][T13385] __arm64_sys_sendto+0x28/0x3c [496733.627653][T13385] invoke_syscall+0x58/0x11c [496733.627662][T13385] el0_svc_common+0x88/0xf4 [496733.627669][T13385] do_el0_svc+0x2c/0xb0 [496733.627676][T13385] el0_svc+0x2c/0xa4 [496733.627683][T13385] el0t_64_sync_handler+0x68/0xb4 [496733.627689][T13385] el0t_64_sync+0x1a4/0x1a8 Changes in v3: - Replaced WARN_ON() with WARN_ON_ONCE(), as suggested by Willem de Bruijn. - Dropped legacy lines mistakenly pulled in from an outdated branch. Changes in v2: - Addressed review comments from Eric Dumazet - Used READ_ONCE() to prevent potential load/store tearing - Added skb_dst_dev_rcu() and used along with rcu_read_lock() in ip_output Signed-off-by: Sharath Chandra Vurukala <quic_sharathv@quicinc.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250730105118.GA26100@hu-sharathv-hyd.qualcomm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>