linux-arm.git - Russell King's ARM Linux kernel tree

Age	Commit message (Collapse)	Author
2025-07-13	mm/hugetlb: remove prepare_hugepage_range()	Peter Xu
	Only mips and loongarch implemented this API, however what it does was checking against stack overflow for either len or addr. That's already done in arch's arch_get_unmapped_area*() functions, even though it may not be 100% identical checks. For example, for both of the architectures, there will be a trivial difference on how stack top was defined. The old code uses STACK_TOP which may be slightly smaller than TASK_SIZE on either of them, but the hope is that shouldn't be a problem. It means the whole API is pretty much obsolete at least now, remove it completely. Link: https://lkml.kernel.org/r/20250627160707.2124580-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Muchun Song <muchun.song@linux.dev> Cc: Jann Horn <jannh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13	samples/damon/mtier: add parameters for node0 memory usage	Yunjeong Mun
	Change the hard-coded quota goal metric values into sysfs knobs: `node0_mem_used_bp` and `node0_mem_free_bp`. These knobs represent the used and free memory ratio of node0 in basis points (bp, where 1 bp = 0.01%). As mentioned in [1], this patch is developed under the assumption that node0 is always the fast-tier in a two-tiers memory setup. [1] https://lore.kernel.org/linux-mm/20250420194030.75838-8-sj@kernel.org/ Link: https://lkml.kernel.org/r/20250627163329.50997-1-sj@kernel.org Link: https://patch.msgid.link/20250619050313.1535-1-yunjeong.mun@sk.com Signed-off-by: Yunjeong Mun <yunjeong.mun@sk.com> Signed-off-by: SeongJae Park <sj@kernel.org> Suggested-by: Honggyu Kim <honggyu.kim@sk.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13	mm/page_isolation: remove migratetype parameter from more functions	Zi Yan
	migratetype is no longer overwritten during pageblock isolation, start_isolate_page_range(), has_unmovable_pages(), and set_migratetype_isolate() no longer need which migratetype to restore during isolation failure. For has_unmoable_pages(), it needs to know if the isolation is for CMA allocation, so adding PB_ISOLATE_MODE_CMA_ALLOC provide the information. At the same time change isolation flags to enum pb_isolate_mode (PB_ISOLATE_MODE_MEM_OFFLINE, PB_ISOLATE_MODE_CMA_ALLOC, PB_ISOLATE_MODE_OTHER). Remove REPORT_FAILURE and check PB_ISOLATE_MODE_MEM_OFFLINE, since only PB_ISOLATE_MODE_MEM_OFFLINE reports isolation failures. alloc_contig_range() no longer needs migratetype. Replace it with a newly defined acr_flags_t to tell if an allocation is for CMA. So does __alloc_contig_migrate_range(). Add ACR_FLAGS_NONE (set to 0) to indicate ordinary allocations. Link: https://lkml.kernel.org/r/20250617021115.2331563-7-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Richard Chang <richardycc@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13	mm/page_isolation: remove migratetype from undo_isolate_page_range()	Zi Yan
	Since migratetype is no longer overwritten during pageblock isolation, undoing pageblock isolation no longer needs which migratetype to restore. Link: https://lkml.kernel.org/r/20250617021115.2331563-6-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Richard Chang <richardycc@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13	mm/page_isolation: remove migratetype from move_freepages_block_isolate()	Zi Yan
	Since migratetype is no longer overwritten during pageblock isolation, moving a pageblock out of MIGRATE_ISOLATE no longer needs a new migratetype. Add pageblock_isolate_and_move_free_pages() and pageblock_unisolate_and_move_free_pages() to be explicit about the page isolation operations. Both share the common code in __move_freepages_block_isolate(), which is renamed from move_freepages_block_isolate(). Add toggle_pageblock_isolate() to flip pageblock isolation bit in __move_freepages_block_isolate(). Make set_pageblock_migratetype() only accept non MIGRATE_ISOLATE types, so that one should use set_pageblock_isolate() to isolate pageblocks. As a result, move pageblock migratetype code out of __move_freepages_block(). Link: https://lkml.kernel.org/r/20250617021115.2331563-5-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Richard Chang <richardycc@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13	mm/page_alloc: add support for initializing pageblock as isolated	Zi Yan
	MIGRATE_ISOLATE is a standalone bit, so a pageblock cannot be initialized to just MIGRATE_ISOLATE. Add init_pageblock_migratetype() to enable initialize a pageblock with a migratetype and isolated. Link: https://lkml.kernel.org/r/20250617021115.2331563-4-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Richard Chang <richardycc@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13	mm/page_isolation: make page isolation a standalone bit	Zi Yan
	During page isolation, the original migratetype is overwritten, since MIGRATE_* are enums and stored in pageblock bitmaps. Change MIGRATE_ISOLATE to be stored a standalone bit, PB_migrate_isolate, like PB_compact_skip, so that migratetype is not lost during pageblock isolation. Link: https://lkml.kernel.org/r/20250617021115.2331563-3-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Richard Chang <richardycc@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13	mm/page_alloc: pageblock flags functions clean up	Zi Yan
	Patch series "Make MIGRATE_ISOLATE a standalone bit", v10. This patchset moves MIGRATE_ISOLATE to a standalone bit to avoid being overwritten during pageblock isolation process. Currently, MIGRATE_ISOLATE is part of enum migratetype (in include/linux/mmzone.h), thus, setting a pageblock to MIGRATE_ISOLATE overwrites its original migratetype. This causes pageblock migratetype loss during alloc_contig_range() and memory offline, especially when the process fails due to a failed pageblock isolation and the code tries to undo the finished pageblock isolations. In terms of performance for changing pageblock types, no performance change is observed: 1. I used perf to collect stats of offlining and onlining all memory of a 40GB VM 10 times and see that get_pfnblock_flags_mask() and set_pfnblock_flags_mask() take about 0.12% and 0.02% of the whole process respectively with and without this patchset across 3 runs. 2. I used perf to collect stats of dd from /dev/random to a 40GB tmpfs file and find get_pfnblock_flags_mask() takes about 0.05% of the process with and without this patchset across 3 runs. This patch (of 6): No functional change is intended. 1. Add __NR_PAGEBLOCK_BITS for the number of pageblock flag bits and use roundup_pow_of_two(__NR_PAGEBLOCK_BITS) as NR_PAGEBLOCK_BITS to take right amount of bits for pageblock flags. 2. Rename PB_migrate_skip to PB_compact_skip. 3. Add {get,set,clear}_pfnblock_bit() to operate one a standalone bit, like PB_compact_skip. 3. Make {get,set}_pfnblock_flags_mask() internal functions and use {get,set}_pfnblock_migratetype() for pageblock migratetype operations. 4. Move pageblock flags common code to get_pfnblock_bitmap_bitidx(). 3. Use MIGRATETYPE_MASK to get the migratetype of a pageblock from its flags. 4. Use PB_migrate_end in the definition of MIGRATETYPE_MASK instead of PB_migrate_bits. 5. Add a comment on is_migrate_cma_folio() to prevent one from changing it to use get_pageblock_migratetype() and causing issues. Link: https://lkml.kernel.org/r/20250617021115.2331563-1-ziy@nvidia.com Link: https://lkml.kernel.org/r/20250617021115.2331563-2-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Richard Chang <richardycc@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13	mm,memory_hotplug: drop status_change_nid parameter from memory_notify	Oscar Salvador
	There no users left of status_change_nid, so drop it from memory_notify struct. Link: https://lkml.kernel.org/r/20250616135158.450136-12-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13	mm,page_ext: derive the node from the pfn	Oscar Salvador
	page_ext is the only user of 'status_change_nid', which is set in online/offline operations, to know to which node we are adding/removing memory. Prior to call any notifiers, the memmap is initialized via, which among other things, sets the node the pages belong to, to all corresponging pages. This means that there is no need to keep using 'status_change_nid' since we can derive the node from the pfn. This will allow us to finally drop 'status_change_nid' from the memory_notify struct. Link: https://lkml.kernel.org/r/20250616135158.450136-11-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13	mm,mempolicy: use node-notifier instead of memory-notifier	Oscar Salvador
	mempolicy is only concerned when a numa node changes its memory state, because it needs to take this node into account for the auto-weighted memory policy system. So stop using the memory notifier and use the new numa node notifer instead. Link: https://lkml.kernel.org/r/20250616135158.450136-10-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Rakie Kim <rakie.kim@sk.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Gregory Price <gourry@gourry.net> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13	kernel,cpuset: use node-notifier instead of memory-notifier	Oscar Salvador
	cpuset is only concerned when a numa node changes its memory state, as it needs to know the current numa nodes with memory to keep an updated mems_allowed mask. So stop using the memory notifier and use the new numa node notifer instead. Link: https://lkml.kernel.org/r/20250616135158.450136-9-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13	drivers,hmat: use node-notifier instead of memory-notifier	Oscar Salvador
	hmat driver is only concerned when a numa node changes its memory state, specifically when a numa node with memory comes into play for the first time, because it will register the memory_targets belonging to that numa node. So stop using the memory notifier and use the new numa node notifer instead. Link: https://lkml.kernel.org/r/20250616135158.450136-8-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13	drivers,cxl: use node-notifier instead of memory-notifier	Oscar Salvador
	memory-tier is only concerned when a numa node changes its memory state, specifically when a numa node with memory comes into play for the first time, because it needs to get its performance attributes to build a proper demotion chain. So stop using the memory notifier and use the new numa node notifer instead. Link: https://lkml.kernel.org/r/20250616135158.450136-7-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13	mm,memory-tiers: use node-notifier instead of memory-notifier	Oscar Salvador
	memory-tier is only concerned when a numa node changes its memory state, because it then needs to re-create the demotion list. So stop using the memory notifier and use the new numa node notifer instead. Link: https://lkml.kernel.org/r/20250616135158.450136-6-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13	mm,slub: use node-notifier instead of memory-notifier	Oscar Salvador
	slub is only concerned when a numa node changes its memory state, so stop using the memory notifier and use the new numa node notifer instead. [akpm@linux-foundation.org: slub.c needs node.h for struct node_notify] Link: https://lore.kernel.org/oe-kbuild-all/202506202144.dGkFxasv-lkp@intel.com/ Link: https://lkml.kernel.org/r/20250616135158.450136-5-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13	mm,memory_hotplug: implement numa node notifier	Oscar Salvador
	There are at least six consumers of hotplug_memory_notifier that what they really are interested in is whether any numa node changed its state, e.g: going from having memory to not having memory and vice versa. Implement a specific notifier for numa nodes when their state gets changed, which will later be used by those consumers that are only interested in numa node state changes. Add documentation as well. [dan.carpenter@linaro.org: set failure reason in offline_pages()] Link: https://lkml.kernel.org/r/be4fd31b-7d09-46b0-8329-6d0464ffa7a5@sabinyo.mountain Link: https://lkml.kernel.org/r/20250616135158.450136-4-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13	mm,memory_hotplug: remove status_change_nid_normal and update documentation	Oscar Salvador
	Now that the last user of status_change_nid_normal is gone, we can remove it. Update documentation accordingly. Link: https://lkml.kernel.org/r/20250616135158.450136-3-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13	mm,slub: do not special case N_NORMAL nodes for slab_nodes	Oscar Salvador
	Patch series "Implement numa node notifier", v7. Memory notifier is a tool that allow consumers to get notified whenever memory gets onlined or offlined in the system. Currently, there are 10 consumers of that, but 5 out of those 10 consumers are only interested in getting notifications when a numa node changes its memory state. That means going from memoryless to memory-aware of vice versa. Which means that for every {online,offline}_pages operation they get notified even though the numa node might not have changed its state. This is suboptimal, and we want to decouple numa node state changes from memory state changes. While we are doing this, remove status_change_nid_normal, as the only current user (slub) does not really need it. This allows us to further simplify and clean up the code. The first patch gets rid of status_change_nid_normal in slub. The second patch implements a numa node notifier that does just that, and have those consumers register in there, so they get notified only when they are interested. The third patch replaces 'status_change_nid{_normal}' fields within memory_notify with a 'nid', as that is only what we need for memory notifer and update the only user of it (page_ext). Consumers that are only interested in numa node states change are: - memory-tier - slub - cpuset - hmat - cxl - autoweight-mempolicy This patch (of 11): Currently, slab_mem_going_online_callback() checks whether the node has N_NORMAL memory in order to be set in slab_nodes. While it is true that getting rid of that enforcing would mean ending up with movables nodes in slab_nodes, the memory waste that comes with that is negligible. So stop checking for status_change_nid_normal and just use status_change_nid instead which works for both types of memory. Also, once we allocate the kmem_cache_node cache for the node in slab_mem_online_callback(), we never deallocate it in slab_mem_offline_callback() when the node goes memoryless, so we can just get rid of it. The side effects are that we will stop clearing the node from slab_nodes, and also that newly created kmem caches after node hotremove will now allocate their kmem_cache_node for the node(s) that was hotremoved, but these should be negligible. Link: https://lkml.kernel.org/r/20250616135158.450136-1-osalvador@suse.de Link: https://lkml.kernel.org/r/20250616135158.450136-2-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13	mm, madvise: use standard madvise locking in madvise_set_anon_name()	Vlastimil Babka
	Use madvise_lock()/madvise_unlock() in madvise_set_anon_name() in the same way as in do_madvise(). This narrows the lock scope a bit and reuses existing functionality. get_lock_mode() already picks the correct MADVISE_MMAP_WRITE_LOCK mode for __MADV_SET_ANON_VMA_NAME so we can just remove the explicit assignment. There is a user visible change in that the prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME...) might now return -EINTR. Link: https://lkml.kernel.org/r/20250624-anon_name_cleanup-v2-4-600075462a11@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Colin Cross <ccross@google.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13	mm, madvise: move madvise_set_anon_name() down the file	Vlastimil Babka
	Preparatory change so that we can use madvise_lock()/unlock() in the function without forward declarations or more thorough shuffling. No functional change. Move as a separate commit helps git heuristics to detect it properly. Link: https://lkml.kernel.org/r/20250624-anon_name_cleanup-v2-3-600075462a11@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Colin Cross <ccross@google.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13	mm, madvise: extract mm code from prctl_set_vma() to mm/madvise.c	Vlastimil Babka
	Setting anon_name is done via madvise_set_anon_name() and behaves a lot of like other madvise operations. However, apparently because madvise() has lacked the 4th argument and prctl() not, the userspace entry point has been implemented via prctl(PR_SET_VMA, ...) and handled first by prctl_set_vma(). Currently prctl_set_vma() lives in kernel/sys.c but setting the vma->anon_name is mm-specific code so extract it to a new set_anon_vma_name() function under mm. mm/madvise.c seems to be the most straightforward place as that's where madvise_set_anon_name() lives. Stop declaring the latter in mm.h and instead declare set_anon_vma_name(). Link: https://lkml.kernel.org/r/20250624-anon_name_cleanup-v2-2-600075462a11@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Colin Cross <ccross@google.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13	mm, madvise: simplify anon_name handling	Vlastimil Babka
	Patch series "madvise anon_name cleanups", v2. While reviewing Lorenzo's madvise cleanups I've noticed that we can handle anon_name in madvise code much better, so sending that as patch 1. Initially I wanted to do first move the existing logic from madvise_vma_behavior() to madvise_update_vma() as a separate patch before the actual simplification but that would require adding anon_vma_name_put() in error handling paths only to be removed again, so it's a single patch to avoid churn. It's also an opportunity to move some mm code from prctl under mm, hence patch 2. After code moving preparation in patch 3, also unify madvise lock handling for madvise_set_anon_name() in patch 4. This patch (of 4): Since the introduction in 9a10064f5625 ("mm: add a field to store names for private anonymous memory") the code to set anon_name on a vma has been using madvise_update_vma() to call replace_anon_vma_name(). Since the former is called also by a number of other madvise behaviours that do not set a new anon_name, they have been passing the existing anon_name of the vma to make replace_anon_vma_name() a no-op. This is rather wasteful as it needs anon_vma_name_eq() to determine the no-op situations, and checks for when replace_anon_vma_name() is allowed (the vma is anon/shmem) duplicate the checks already done earlier in madvise_vma_behavior(). It has also lead to commit 942341dcc574 ("mm: fix use-after-free when anon vma name is used after vma is freed") adding anon_name refcount get/put operations exactly to the cases that actually do not change anon_name - just so the replace_anon_vma_name() can keep safely determining it has nothing to do. The recent madvise cleanups made this suboptimal handling very obvious, but happily also allow for an easy fix. madvise_update_vma() now has the complete information whether it's been called to set a new anon_name, so stop passing it the existing vma's name and doing the refcount get/put in its only caller madvise_vma_behavior(). In madvise_update_vma() itself, limit calling of replace_anon_vma_name() only to cases where we are setting a new name, otherwise we know it's a no-op. We can rely solely on the __MADV_SET_ANON_VMA_NAME behaviour and can remove the duplicate checks for vma being anon/shmem that were done already in madvise_vma_behavior(). Additionally, by using vma_modify_flags() when not modifying the anon_name, avoid explicitly passing the existing vma's anon_name and storing a pointer to it in struct madv_behavior or a local variable. This prevents the danger of accessing a freed anon_name after vma merging, previously fixed by commit 942341dcc574. Link: https://lkml.kernel.org/r/20250624-anon_name_cleanup-v2-0-600075462a11@suse.cz Link: https://lkml.kernel.org/r/20250624-anon_name_cleanup-v2-1-600075462a11@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Tested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Colin Cross <ccross@google.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13	mm/madvise: eliminate very confusing manipulation of prev VMA	Lorenzo Stoakes
	The madvise code has for the longest time had very confusing code around the 'prev' VMA pointer passed around various functions which, in all cases except madvise_update_vma(), is unused and instead simply updated as soon as the function is invoked. To compound the confusion, the prev pointer is also used to indicate to the caller that the mmap lock has been dropped and that we can therefore not safely access the end of the current VMA (which might have been updated by madvise_update_vma()). Clear up this confusion by not setting prev = vma anywhere except in madvise_walk_vmas(), update all references to prev which will always be equal to vma after madvise_vma_behavior() is invoked, and adding a flag to indicate that the lock has been dropped to make this explicit. Additionally, drop a redundant BUG_ON() from madvise_collapse(), which is simply reiterating the BUG_ON(mmap_locked) above it (note that BUG_ON() is not appropriate here, but we leave existing code as-is). We finally adjust the madvise_walk_vmas() logic to be a little clearer - delaying the assignment of the end of the range to the start of the new range until the last moment and handling the lock being dropped scenario immediately. Additionally add some explanatory comments. [lorenzo.stoakes@oracle.com: fix very subtle bug] Link: https://lkml.kernel.org/r/dca94cde-8afb-4eab-8e57-3f508624d670@lucifer.local Link: https://lkml.kernel.org/r/63d281c5df2e64225ab5b4bda398b45e22818701.1750433500.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13	mm/madvise: thread all madvise state through madv_behavior	Lorenzo Stoakes
	Doing so means we can get rid of all the weird struct vm_area_struct **prev stuff, everything becomes consistent and in future if we want to make change to behaviour there's a single place where all relevant state is stored. This also allows us to update try_vma_read_lock() to be a little more succinct and set up state for us, as well as cleaning up madvise_update_vma(). We also update the debug assertion prior to madvise_update_vma() to assert that this is a write operation as correctly pointed out by Barry in the relevant thread. We can't reasonably update the madvise functions that live outside of mm/madvise.c so we leave those as-is. Link: https://lkml.kernel.org/r/7b345ab82ef51e551f8bc0c4f7be25712871629d.1750433500.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13	mm/madvise: thread VMA range state through madvise_behavior	Lorenzo Stoakes
	Rather than updating start and a confusing local parameter 'tmp' in madvise_walk_vmas(), instead store the current range being operated upon in the struct madvise_behavior helper object in a range pair and use this consistently in all operations. This makes it clearer what is going on and opens the door to further cleanup now we store state regarding what is currently being operated upon here. Link: https://lkml.kernel.org/r/518480ceb48553d3c280bc2b0b5e77bbad840147.1750433500.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13	mm/madvise: thread mm_struct through madvise_behavior	Lorenzo Stoakes
	There's no need to thread a pointer to the mm_struct nor have different functions signatures for each behaviour, instead store state in the struct madvise_behavior object consistently and use it for all madvise() actions. Link: https://lkml.kernel.org/r/a47d850b0111735e026d438c3300c0e3b7f439f4.1750433500.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13	mm/madvise: remove the visitor pattern and thread anon_vma state	Lorenzo Stoakes
	Patch series "madvise cleanup", v2. This is a series of patches that helps address a number of historic problems in the madvise() implementation: * Eliminate the visitor pattern and having the code which is implemented for both the anon_vma_name implementation and ordinary madvise() operations use the same madvise_vma_behavior() implementation. * Thread state through the madvise_behavior state object - this object, very usefully introduced by SJ, is already used to transmit state through operations. This series extends this by having all madvise() operations use this, including anon_vma_name. * Thread range, VMA state through madvise_behavior - This helps avoid a lot of the confusing code around range and VMA state and again keeps things consistent and with a single 'source of truth'. * Addressing the very strange behaviour around the passed around struct vm_area_struct *prev pointer - all read-only users do absolutely nothing with the prev pointer. The only function that uses it is madvise_update_vma(), and in all cases prev is always reset to VMA. Fix this by no longer having aything but madvise_update_vma() reference prev, and having madvise_walk_vmas() update prev in each instance. Additionally make it clear that the meaningful change in vma state is when madvise_update_vma() potentially merges a VMA, so explicitly retrieve the VMA in this case. Update and clarify the madvise_walk_vmas() function - this is a source of a great deal of confusion, so simplify, stop using prev = NULL to signify that the mmap lock has been dropped (!) and make that explicit, and add some comments to explain what's going on. This patch (of 5): Now we have the madvise_behavior helper struct we no longer need to mess around with void* pointers in order to propagate anon_vma_name, and this means we can get rid of the confusing and inconsistent visitor pattern implementation in madvise_vma_anon_name(). This means we now have a single state object that threads through most of madvise()'s logic and a single code path which executes the majority of madvise() behaviour (we maintain separate logic for failure injection and memory population for the time being). We are able to remove the visitor pattern by handling the anon_vma_name setting logic via an internal madvise flag - __MADV_SET_ANON_VMA_NAME. This uses a negative value so it isn't reasonable that we will ever add this as a UAPI flag. Additionally, the madvise_behavior_valid() check ensures that user-specified behaviours are strictly only those we permit which, of course, this flag will be excluded from. We are able to propagate the anon_vma_name object through use of the madvise_behavior helper struct. Doing this results in a can_modify_vma_madv() check for anonymous VMA name changes, however this will cause no issues as this operation is not prohibited. We can also then reuse more code and drop the redundant madvise_vma_anon_name() function altogether. Additionally separate out behaviours that update VMAs from those that do not. Link: https://lkml.kernel.org/r/cover.1750433500.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/c5094bfccb41ecd19d4e9bcaa1c4a11e00158bba.1750433500.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-12	Merge branch 'mm-hotfixes-stable' into mm-stable to pick up changes which	Andrew Morton
	are required for a merge of the series "mm: folio_pte_batch() improvements".
2025-07-09	ksm_tests: skip hugepage test when Transparent Hugepages are disabled	Li Wang
	Some systems (e.g. minimal or real-time kernels) may not enable Transparent Hugepages (THP), causing MADV_HUGEPAGE to return EINVAL. This patch introduces a runtime check using the existing THP sysfs interface and skips the hugepage merging test (`-H`) when THP is not available. To avoid those failures: # ----------------------------- # running ./ksm_tests -H -s 100 # ----------------------------- # ksm_tests: MADV_HUGEPAGE: Invalid argument # [FAIL] not ok 1 ksm_tests -H -s 100 # exit=2 # -------------------- # running ./khugepaged # -------------------- # Reading PMD pagesize failed# [FAIL] not ok 1 khugepaged # exit=1 # -------------------- # running ./soft-dirty # -------------------- # TAP version 13 # 1..15 # ok 1 Test test_simple # ok 2 Test test_vma_reuse dirty bit of allocated page # ok 3 Test test_vma_reuse dirty bit of reused address page # Bail out! Reading PMD pagesize failed# Planned tests != run tests (15 != 3) # # Totals: pass:3 fail:0 xfail:0 xpass:0 skip:0 error:0 # [FAIL] not ok 1 soft-dirty # exit=1 # SUMMARY: PASS=0 SKIP=0 FAIL=1 # ------------------- # running ./migration # ------------------- # TAP version 13 # 1..3 # # Starting 3 tests from 1 test cases. # # RUN migration.private_anon ... # # OK migration.private_anon # ok 1 migration.private_anon # # RUN migration.shared_anon ... # # OK migration.shared_anon # ok 2 migration.shared_anon # # RUN migration.private_anon_thp ... # # migration.c:196:private_anon_thp:Expected madvise(ptr, TWOMEG, MADV_HUGEPAGE) (-1) == 0 (0) # # private_anon_thp: Test terminated by assertion # # FAIL migration.private_anon_thp # not ok 3 migration.private_anon_thp # # FAILED: 2 / 3 tests passed. # # Totals: pass:2 fail:1 xfail:0 xpass:0 skip:0 error:0 # [FAIL] not ok 1 migration # exit=1 It's true that CONFIG_TRANSPARENT_HUGEPAGE=y is explicitly enabled in tools/testing/selftests/mm/config, so ideally the runtime environment should also support THP. However, in practice, we've found that on some systems: - THP is disabled at boot time (transparent_hugepage=never) - Or manually disabled via sysfs - Or unavailable in RT kernels, containers, or minimal CI environments In these cases, the test will fail with EINVAL on madvise(MADV_HUGEPAGE), even though the kernel config is correct. To make the test suite more robust and avoid false negatives, this patch adds a runtime check for /sys/kernel/mm/transparent_hugepage/enabled. If THP is not available, the hugepage test (-H) is skipped with a clear message. Link: https://lkml.kernel.org/r/20250624032748.393836-1-liwang@redhat.com Signed-off-by: Li Wang <liwang@redhat.com> Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Joey Gouly <joey.gouly@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Keith Lucas <keith.lucas@oracle.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09	selftests/mm: fix UFFDIO_API usage with proper two-step feature negotiation	Li Wang
	The current implementation of test_unmerge_uffd_wp() explicitly sets `uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP` before calling UFFDIO_API. This can cause the ioctl() call to fail with EINVAL on kernels that do not support UFFD-WP, leading the test to fail unnecessarily: # ------------------------------ # running ./ksm_functional_tests # ------------------------------ # TAP version 13 # 1..9 # # [RUN] test_unmerge # ok 1 Pages were unmerged # # [RUN] test_unmerge_zero_pages # ok 2 KSM zero pages were unmerged # # [RUN] test_unmerge_discarded # ok 3 Pages were unmerged # # [RUN] test_unmerge_uffd_wp # not ok 4 UFFDIO_API failed <----- # # [RUN] test_prot_none # ok 5 Pages were unmerged # # [RUN] test_prctl # ok 6 Setting/clearing PR_SET_MEMORY_MERGE works # # [RUN] test_prctl_fork # # No pages got merged # # [RUN] test_prctl_fork_exec # ok 7 PR_SET_MEMORY_MERGE value is inherited # # [RUN] test_prctl_unmerge # ok 8 Pages were unmerged # Bail out! 1 out of 8 tests failed # # Planned tests != run tests (9 != 8) # # Totals: pass:7 fail:1 xfail:0 xpass:0 skip:0 error:0 # [FAIL] This patch improves compatibility and robustness of the UFFD-WP test (test_unmerge_uffd_wp) by correctly implementing the UFFDIO_API two-step handshake as recommended by the userfaultfd(2) man page. Key changes: 1. Use features=0 in the initial UFFDIO_API call to query supported feature bits, rather than immediately requesting WP support. 2. Skip the test gracefully if: - UFFDIO_API fails with EINVAL (e.g. unsupported API version), or - UFFD_FEATURE_PAGEFAULT_FLAG_WP is not advertised by the kernel. 3. Close the initial userfaultfd and create a new one before enabling the required feature, since UFFDIO_API can only be called once per fd. 4. Improve diagnostics by distinguishing between expected and unexpected failures, using strerror() to report errors. This ensures the test behaves correctly across a wider range of kernel versions and configurations, while preserving the intended behavior on kernels that support UFFD-WP. [liwang@redhat.com: fail the test if sys_userfaultfd() fails, per David] Link: https://lkml.kernel.org/r/20250625004645.400520-1-liwang@redhat.com Link: https://lkml.kernel.org/r/20250624042411.395285-1-liwang@redhat.com Signed-off-by: Li Wang <liwang@redhat.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Joey Gouly <joey.gouly@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Keith Lucas <keith.lucas@oracle.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Li Wang <liwang@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09	maple_tree: add testing for restoring maple state to active	Liam R. Howlett
	Restoring maple status to ma_active on overflow/underflow when mas->node was NULL could have happened in the past, but was masked by a bug in mas_walk(). Add test cases that triggered the bug when the node was mas->node prior to fixing the maple state setup. Add a few extra tests around restoring the active maple status. Link: https://lore.kernel.org/all/202506191556.6bfc7b93-lkp@intel.com/ Link: https://lkml.kernel.org/r/20250624154823.52221-2-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09	maple_tree: fix status setup on restore to active	Liam R. Howlett
	During the initial call with a maple state, an error status may be set before a valid node is populated into the maple state node. Subsequent calls with the maple state may restore the state into an active state with no node set. This was masked by the mas_walk() always resetting the status to ma_start and result in an extra walk in this rare scenario. Don't restore the state to active unless there is a value in the structs node. This also allows mas_walk() to be fixed to use the active state without exposing an issue. User visible results are marginal performance improvements when an active state can be restored and used instead of rewalking the tree. Stable is not Cc'ed because the existing code is stable and the performance gains are not worth the risk. Link: https://lore.kernel.org/all/20250611011253.19515-1-richard.weiyang@gmail.com/ Link: https://lore.kernel.org/all/20250407231354.11771-1-richard.weiyang@gmail.com/ Link: https://lore.kernel.org/all/202506191556.6bfc7b93-lkp@intel.com/ Link: https://lkml.kernel.org/r/20250624154823.52221-1-Liam.Howlett@oracle.com Fixes: a8091f039c1e ("maple_tree: add MAS_UNDERFLOW and MAS_OVERFLOW states") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reported-by: Wei Yang <richard.weiyang@gmail.com> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202506191556.6bfc7b93-lkp@intel.com Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09	selftests/mm: remove duplicate .gitignore entries	Moon Hee Lee
	Remove redundant entries in .gitignore confirmed by: $ sort tools/testing/selftests/mm/.gitignore \| uniq -d hugetlb_dio pkey_sighandler_tests_32 pkey_sighandler_tests_64 These entries were originally added by [1], and later duplicated by [2]. [1] https://lore.kernel.org/all/20240924185911.117937-1-lorenzo.stoakes@oracle.com/ [2] https://lore.kernel.org/all/20241125064036.413536-1-lizhijian@fujitsu.com/ Link: https://lkml.kernel.org/r/20250626020758.163243-1-moonhee.lee.ca@gmail.com Signed-off-by: Moon Hee Lee <moonhee.lee.ca@gmail.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09	mm: unexport globally copy_to_kernel_nofault	Sabyrzhan Tasbolatov
	copy_to_kernel_nofault() is an internal helper which should not be visible to loadable modules – exporting it would give exploit code a cheap oracle to probe kernel addresses. Instead, keep the helper un-exported and compile the kunit case that exercises it only when mm/kasan/kasan_test.o is linked into vmlinux. [snovitoll@gmail.com: add a brief comment to `#ifndef MODULE`] Link: https://lkml.kernel.org/r/20250622141142.79332-1-snovitoll@gmail.com Link: https://lkml.kernel.org/r/20250622051906.67374-1-snovitoll@gmail.com Fixes: ca79a00bb9a8 ("kasan: migrate copy_user_test to kunit") Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com> Suggested-by: Christoph Hellwig <hch@infradead.org> Suggested-by: Marco Elver <elver@google.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09	lib/test_vmalloc.c: restrict default test mask to avoid test warnings	Uladzislau Rezki (Sony)
	When the vmalloc test is built into the kernel, it runs automatically during the boot. The current-default "run_test_mask" includes all test cases, including those which are designed to fail and which trigger kernel warnings. These kernel splats can be misinterpreted as actual kernel bugs, leading to false alarms and unnecessary reports. To address this, limit the default test mask to only the first few tests which are expected to pass cleanly. These tests are safe and should not generate any warnings unless there is a real bug. Users who wish to explicitly run specific test cases have to pass the run_test_mask as a boot parameter or at module load time. Link: https://lkml.kernel.org/r/20250623184035.581229-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: David Wang <00107082@163.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09	lib/test_vmalloc.c: use late_initcall() if built-in for init ordering	Uladzislau Rezki (Sony)
	When the vmalloc test code is compiled as a built-in, use late_initcall() instead of module_init() to defer a vmalloc test execution until most subsystems are up and running. It avoids interfering with components that may not yet be initialized at module_init() time. For example, there was a recent report of memory profiling infrastructure not being ready early enough leading to kernel crash. By using late_initcall() in the built-in case, we ensure the tests are run at a safer point during a boot sequence. Link: https://lkml.kernel.org/r/20250623184035.581229-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: David Wang <00107082@163.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09	mm/damon/sysfs: decouple from damon_ops_id	SeongJae Park
	Decouple DAMON sysfs interface from damon_ops_id. For this, define and use new mm/damon/sysfs.c internal data structure that maps the user-space keywords and damon_ops_id, instead of having the implicit and unflexible array index rule. Link: https://lkml.kernel.org/r/20250622213759.50930-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09	mm/damon/sysfs-schemes: decouple from damos_filter_type	SeongJae Park
	Decouple DAMOS sysfs interface from damos_filter_type. For this, define and use new sysfs-schemes internal data structure that maps the user-space keywords and damos_filter_type, instead of having the implicit and unflexible array index rule. Link: https://lkml.kernel.org/r/20250622213759.50930-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09	mm/damon/sysfs-schemes: decouple from damos_wmark_metric	SeongJae Park
	Decouple DAMOS sysfs interface from damos_wmark_metric. For this, define and use new sysfs-schemes internal data structure that maps the user-space keywords and damos_wmark_metric, instead of having the implicit and unflexible array index rule. Link: https://lkml.kernel.org/r/20250622213759.50930-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09	mm/damon/sysfs-schemes: decouple from damos_action	SeongJae Park
	Decouple DAMOS sysfs interface from damos_action. For this, define and use new sysfs-schemes internal data structure that maps the user-space keywords and damos_action, instead of having the implicit and unflexible array index rule. [akpm@linux-foundation.org: make damos_sysfs_action_names static] Closes: https://lore.kernel.org/oe-kbuild-all/202506271655.b8yfEZIT-lkp@intel.com/ Link: https://lkml.kernel.org/r/20250622213759.50930-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09	mm/damon/sysfs-schemes: decouple from damos_quota_goal_metric	SeongJae Park
	Patch series "mm/damon: decouple sysfs from core". DAMON sysfs interface is coupled with core layer. It maintains some of its keywords arrays be synchronized with matching DAMON core API enums. It is unnecessary coupling that makes separated changes for different layers difficult. Decouple the layers by introducing new data structure for the mappings on DAMON sysfs interface. This patch (of 5): Decouple DAMOS sysfs interface from damos_quota_goal_metric. For this, define and use new sysfs-schemes internal data structure that maps the user-space keywords and damos_quota_goal_metric, instead of having the implicit and unflexible array index rule. Link: https://lkml.kernel.org/r/20250622213759.50930-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250622213759.50930-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09	mm/ptdump: take the memory hotplug lock inside ptdump_walk_pgd()	Anshuman Khandual
	Memory hot remove unmaps and tears down various kernel page table regions as required. The ptdump code can race with concurrent modifications of the kernel page tables. When leaf entries are modified concurrently, the dump code may log stale or inconsistent information for a VA range, but this is otherwise not harmful. But when intermediate levels of kernel page table are freed, the dump code will continue to use memory that has been freed and potentially reallocated for another purpose. In such cases, the ptdump code may dereference bogus addresses, leading to a number of potential problems. To avoid the above mentioned race condition, platforms such as arm64, riscv and s390 take memory hotplug lock, while dumping kernel page table via the sysfs interface /sys/kernel/debug/kernel_page_tables. Similar race condition exists while checking for pages that might have been marked W+X via /sys/kernel/debug/kernel_page_tables/check_wx_pages which in turn calls ptdump_check_wx(). Instead of solving this race condition again, let's just move the memory hotplug lock inside generic ptdump_check_wx() which will benefit both the scenarios. Drop get_online_mems() and put_online_mems() combination from all existing platform ptdump code paths. Link: https://lkml.kernel.org/r/20250620052427.2092093-1-anshuman.khandual@arm.com Fixes: bbd6ec605c0f ("arm64/mm: Enable memory hot remove") Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> [s390] Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09	selftests/mm: reduce uffd-unit-test poison test to minimum	Peter Xu
	The test will still generate quite some unwanted MCE error messages to syslog. There was old proposal ratelimiting the MCE messages from kernel, but that has risk of hiding real useful information on production systems. We can at least reduce the test to minimum to not over-pollute dmesg, however trying to not lose its coverage too much. [peterx@redhat.com: reduce uffd-unit-test poison test to minimum] Link: https://lkml.kernel.org/r/aF2RSsjuEOtzXcUa@x1.local Link: https://lkml.kernel.org/r/20250620150058.1729489-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09	maple tree: use goto label to simplify code	Dev Jain
	Use the underflow goto label to set the status to ma_underflow and return NULL, as is being done elsewhere. [akpm@linux-foundation.org: add newline, per Liam (and remove one, per akpm)] Link: https://lkml.kernel.org/r/20250624080748.4855-1-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09	selftets/damon: add a test for memcg_path leak	SeongJae Park
	There was a memory leak bug in DAMOS sysfs memcg_path file. Add a selftest to ensure the bug never comes back. Link: https://lkml.kernel.org/r/20250619183608.6647-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09	mm/memremap: remove unused devmap_managed_key	Alistair Popple
	It's no longer used so remove it. Link: https://lkml.kernel.org/r/11516e39f33f809292ffccab1d46062f9bc248b3.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09	mm: remove callers of pfn_t functionality	Alistair Popple
	All PFN_* pfn_t flags have been removed. Therefore there is no longer a need for the pfn_t type and all uses can be replaced with normal pfns. Link: https://lkml.kernel.org/r/bbedfa576c9822f8032494efbe43544628698b1f.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09	mm: remove PFN_DEV, PFN_MAP, PFN_SPECIAL, PFN_SG_CHAIN and PFN_SG_LAST	Alistair Popple
	The PFN_MAP flag is no longer used for anything, so remove it. The PFN_SG_CHAIN and PFN_SG_LAST flags never appear to have been used so also remove them. The last user of PFN_SPECIAL was removed by 653d7825c149 ("dcssblk: mark DAX broken, remove FS_DAX_LIMITED support"). Users of PFN_DEV were removed earlier in this series by "mm: Remove remaining uses of PFN_DEV". Link: https://lkml.kernel.org/r/670b3950d70b4d97b905bb597dadfd3633de4314.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09	mm: remove devmap related functions and page table bits	Alistair Popple
	Now that DAX and all other reference counts to ZONE_DEVICE pages are managed normally there is no need for the special devmap PTE/PMD/PUD page table bits. So drop all references to these, freeing up a software defined page table bit on architectures supporting it. Link: https://lkml.kernel.org/r/6389398c32cc9daa3dfcaa9f79c7972525d310ce.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: Will Deacon <will@kernel.org> # arm64 Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: Chunyan Zhang <zhang.lyra@gmail.com> Reviewed-by: Björn Töpel <bjorn@rivosinc.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>