summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2025-03-17mm/huge_memory: add two new (not yet used) functions for folio_split()Zi Yan
This is a preparation patch, both added functions are not used yet. The added __split_unmapped_folio() is able to split a folio with its mapping removed in two manners: 1) uniform split (the existing way), and 2) buddy allocator like (or non-uniform) split. The added __split_folio_to_order() can split a folio into any lower order. For uniform split, __split_unmapped_folio() calls it once to split the given folio to the new order. For buddy allocator like (non-uniform) split, __split_unmapped_folio() calls it (folio_order - new_order) times and each time splits the folio containing the given page to one lower order. [ziy@nvidia.com: unfreeze head folio after page cache entries are updated] Link: https://lkml.kernel.org/r/0F15DA7F-1977-412F-9A3E-F06B515D4BD2@nvidia.com [ziy@nvidia.com: use NULL instead of 0 for folio->private assignment] Link: https://lkml.kernel.org/r/1E11B9DD-3A87-4C9C-8FB4-E1324FB6A21A@nvidia.com Link: https://lkml.kernel.org/r/20250307174001.242794-3-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: swap_cgroup: remove double initialization of localsJohannes Weiner
Fixes: 6769183166b3 ("mm/swap_cgroup: decouple swap cgroup recording and clearing") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <kasong@tencent.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/vmalloc: refactor __vmalloc_node_range_noprof()Liu Ye
According to the code logic, the first parameter of the sub-function __get_vm_area_node() should be size instead of real_size. Then in __get_vm_area_node(), the size will be aligned, so the redundant alignment operation is deleted. The use of the real_size variable causes code redundancy, so it is removed to simplify the code. The real prefix is generally used to indicate the adjusted value of a parameter, but according to the code logic, it should indicate the original value, so it is recommended to rename it to original_align. Link: https://lkml.kernel.org/r/20250306072131.800499-1-liuye@kylinos.cn Signed-off-by: Liu Ye <liuye@kylinos.cn> Reviewed-by: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Christop Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: page_owner: use new iteration APILuiz Capitulino
The page_ext_next() function assumes that page extension objects for a page order allocation always reside in the same memory section, which may not be true and could lead to crashes. Use the new page_ext iteration API instead. Link: https://lkml.kernel.org/r/93c80b040960fa2ebab4a9729073f77a30649862.1741301089.git.luizcap@redhat.com Fixes: cf54f310d0d3 ("mm/hugetlb: use __GFP_COMP for gigantic folios") Signed-off-by: Luiz Capitulino <luizcap@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: page_table_check: use new iteration APILuiz Capitulino
The page_ext_next() function assumes that page extension objects for a page order allocation always reside in the same memory section, which may not be true and could lead to crashes. Use the new page_ext iteration API instead. Link: https://lkml.kernel.org/r/ca2d53a020fe1cd65c442627ff6c0c40d591cbd8.1741301089.git.luizcap@redhat.com Fixes: cf54f310d0d3 ("mm/hugetlb: use __GFP_COMP for gigantic folios") Signed-off-by: Luiz Capitulino <luizcap@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: page_ext: add an iteration API for page extensionsLuiz Capitulino
Patch series "mm: page_ext: Introduce new iteration API", v3. Introduction ============ [ Thanks to David Hildenbrand for identifying the root cause of this issue and proving guidance on how to fix it. The new API idea, bugs and misconceptions are all mine though ] Currently, trying to reserve 1G pages with page_owner=on and sparsemem causes a crash. The reproducer is very simple: 1. Build the kernel with CONFIG_SPARSEMEM=y and the table extensions 2. Pass 'default_hugepagesz=1 page_owner=on' in the kernel command-line 3. Reserve one 1G page at run-time, this should crash (see patch 1 for backtrace) [ A crash with page_table_check is also possible, but harder to trigger ] Apparently, starting with commit cf54f310d0d3 ("mm/hugetlb: use __GFP_COMP for gigantic folios") we now pass the full allocation order to page extension clients and the page extension implementation assumes that all PFNs of an allocation range will be stored in the same memory section (which is not true for 1G pages). To fix this, this series introduces a new iteration API for page extension objects. The API checks if the next page extension object can be retrieved from the current section or if it needs to look up for it in another section. Please, find all details in patch 1. I tested this series on arm64 and x86 by reserving 1G pages at run-time and doing kernel builds (always with page_owner=on and page_table_check=on). This patch (of 3): The page extension implementation assumes that all page extensions of a given page order are stored in the same memory section. The function page_ext_next() relies on this assumption by adding an offset to the current object to return the next adjacent page extension. This behavior works as expected for flatmem but fails for sparsemem when using 1G pages. The commit cf54f310d0d3 ("mm/hugetlb: use __GFP_COMP for gigantic folios") exposes this issue, making it possible for a crash when using page_owner or page_table_check page extensions. The problem is that for 1G pages, the page extensions may span memory section boundaries and be stored in different memory sections. This issue was not visible before commit cf54f310d0d3 ("mm/hugetlb: use __GFP_COMP for gigantic folios") because alloc_contig_pages() never passed more than MAX_PAGE_ORDER to post_alloc_hook(). However, the series introducing mentioned commit changed this behavior allowing the full 1G page order to be passed. Reproducer: 1. Build the kernel with CONFIG_SPARSEMEM=y and table extensions support 2. Pass 'default_hugepagesz=1 page_owner=on' in the kernel command-line 3. Reserve one 1G page at run-time, this should crash (backtrace below) To address this issue, this commit introduces a new API for iterating through page extensions. The main iteration macro is for_each_page_ext() and it must be called with the RCU read lock taken. Here's an usage example: """ struct page_ext_iter iter; struct page_ext *page_ext; ... rcu_read_lock(); for_each_page_ext(page, 1 << order, page_ext, iter) { struct my_page_ext *obj = get_my_page_ext_obj(page_ext); ... } rcu_read_unlock(); """ The loop construct uses page_ext_iter_next() which checks to see if we have crossed sections in the iteration. In this case, page_ext_iter_next() retrieves the next page_ext object from another section. Thanks to David Hildenbrand for helping identify the root cause and providing suggestions on how to fix and optmize the solution (final implementation and bugs are all mine through). Lastly, here's the backtrace, without kasan you can get random crashes: [ 76.052526] BUG: KASAN: slab-out-of-bounds in __update_page_owner_handle+0x238/0x298 [ 76.060283] Write of size 4 at addr ffff07ff96240038 by task tee/3598 [ 76.066714] [ 76.068203] CPU: 88 UID: 0 PID: 3598 Comm: tee Kdump: loaded Not tainted 6.13.0-rep1 #3 [ 76.076202] Hardware name: WIWYNN Mt.Jade Server System B81.030Z1.0007/Mt.Jade Motherboard, BIOS 2.10.20220810 (SCP: 2.10.20220810) 2022/08/10 [ 76.088972] Call trace: [ 76.091411] show_stack+0x20/0x38 (C) [ 76.095073] dump_stack_lvl+0x80/0xf8 [ 76.098733] print_address_description.constprop.0+0x88/0x398 [ 76.104476] print_report+0xa8/0x278 [ 76.108041] kasan_report+0xa8/0xf8 [ 76.111520] __asan_report_store4_noabort+0x20/0x30 [ 76.116391] __update_page_owner_handle+0x238/0x298 [ 76.121259] __set_page_owner+0xdc/0x140 [ 76.125173] post_alloc_hook+0x190/0x1d8 [ 76.129090] alloc_contig_range_noprof+0x54c/0x890 [ 76.133874] alloc_contig_pages_noprof+0x35c/0x4a8 [ 76.138656] alloc_gigantic_folio.isra.0+0x2c0/0x368 [ 76.143616] only_alloc_fresh_hugetlb_folio.isra.0+0x24/0x150 [ 76.149353] alloc_pool_huge_folio+0x11c/0x1f8 [ 76.153787] set_max_huge_pages+0x364/0xca8 [ 76.157961] __nr_hugepages_store_common+0xb0/0x1a0 [ 76.162829] nr_hugepages_store+0x108/0x118 [ 76.167003] kobj_attr_store+0x3c/0x70 [ 76.170745] sysfs_kf_write+0xfc/0x188 [ 76.174492] kernfs_fop_write_iter+0x274/0x3e0 [ 76.178927] vfs_write+0x64c/0x8e0 [ 76.182323] ksys_write+0xf8/0x1f0 [ 76.185716] __arm64_sys_write+0x74/0xb0 [ 76.189630] invoke_syscall.constprop.0+0xd8/0x1e0 [ 76.194412] do_el0_svc+0x164/0x1e0 [ 76.197891] el0_svc+0x40/0xe0 [ 76.200939] el0t_64_sync_handler+0x144/0x168 [ 76.205287] el0t_64_sync+0x1ac/0x1b0 Link: https://lkml.kernel.org/r/cover.1741301089.git.luizcap@redhat.com Link: https://lkml.kernel.org/r/a45893880b7e1601082d39d2c5c8b50bcc096305.1741301089.git.luizcap@redhat.com Fixes: cf54f310d0d3 ("mm/hugetlb: use __GFP_COMP for gigantic folios") Signed-off-by: Luiz Capitulino <luizcap@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Luiz Capitulino <luizcap@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: remove redundant return in set_huge_zero_folio()Dev Jain
It is the responsibility of the caller to check pmd_none(); in any case, we are not achieving anything by returning since there is no return value to tell the caller that we succeeded or not. So remove this check. Link: https://lkml.kernel.org/r/20250306144315.21907-1-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon: remove damon_operations->reset_aggregatedSeongJae Park
The operations layer hook was introduced to let operations set do any aggregation data reset if needed. But it is not really be used now. Remove it. Link: https://lkml.kernel.org/r/20250306175908.66300-14-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon: remove damon_callback->before_damos_applySeongJae Park
The hook was introduced to let DAMON kernel API users access DAMOS schemes-eligible regions in a safe way. Now it is no more used by anyone, and the functionality is provided in a better way by damos_walk(). Remove it. Link: https://lkml.kernel.org/r/20250306175908.66300-13-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon: remove damon_callback->after_samplingSeongJae Park
The callback was used by DAMON sysfs interface for reading DAMON internal data. But it is no more being used, and damon_call() can do similar works in a better way. Remove it. Link: https://lkml.kernel.org/r/20250306175908.66300-12-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon: remove ->before_start of damon_callbackSeongJae Park
The function pointer field was added to be used as a place to do some initialization works just before DAMON starts working. However, nobody is using it now. Remove it. Link: https://lkml.kernel.org/r/20250306175908.66300-11-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon/sysfs-schemes: remove obsolete comment for ↵SeongJae Park
damon_sysfs_schemes_clear_regions() The comment on damon_sysfs_schemes_clear_regions() function is obsolete, since it has updated to directly called from DAMON sysfs interface code. Remove the outdated comment. Link: https://lkml.kernel.org/r/20250306175908.66300-9-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon/sysfs: remove damon_sysfs_cmd_request and its readersSeongJae Park
damon_sysfs_cmd_request is DAMON sysfs interface's own synchronization mechanism for accessing DAMON internal data via damon_callback hooks. All the users are now migrated to damon_call() and damos_walk(), so nobody really uses it. No one writes to the data structure but reading code is still remained. Remove the reading code and the entire data structure. Link: https://lkml.kernel.org/r/20250306175908.66300-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon/sysfs: remove damon_sysfs_cmd_request_callback() and its callersSeongJae Park
damon_sysfs_cmd_request_callback() is the damon_callback hook functions that were used to handle user requests that need to read and/or write DAMON internal data. All the usages are now updated to use damon_call() or damos_walk(), though. Remove it and its callers. Link: https://lkml.kernel.org/r/20250306175908.66300-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon/sysfs: remove damon_sysfs_cmd_request code from ↵SeongJae Park
damon_sysfs_handle_cmd() damon_sysfs_handle_cmd() handles user requests that it can directly handle on its own. For requests that need to be handled from damon_callback hooks, it uses DAMON sysfs interface's own synchronous damon_callback hooks management mechanism, namely damon_sysfs_cmd_request. Now all user requests are handled without damon_callback hooks, so damon_sysfs_cmd_request client code in damon_sysfs_andle_cmd() does nothing in real. Remove the unnecessary code. Link: https://lkml.kernel.org/r/20250306175908.66300-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon/sysfs: handle commit command using damon_call()SeongJae Park
DAMON sysfs interface is using damon_callback->after_aggregation hook with its self-implemented synchronization mechanism for the hook. It is inefficient, complicated, and take up to one aggregation interval to complete, which can be long on some configs. Use damon_call() instead. It provides a synchronization mechanism that built inside DAMON's core layer, so more efficient than DAMON sysfs interface's own one. Also it isolates the implementation inside the core layer, and hence it makes the code easier to maintain. Finally, it takes up to one sampling interval, which is much shorter than the aggregation interval in common setups. Link: https://lkml.kernel.org/r/20250306175908.66300-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon/core: make damon_set_attrs() be safe to be called from damon_call()SeongJae Park
Currently all DAMON kernel API callers do online DAMON parameters commit from damon_callback->after_aggregation because only those are safe place to call the DAMON monitoring attributes update function, namely damon_set_attrs(). Because damon_callback hooks provide no synchronization, the callers work in asynchronous ways or implement their own inefficient and complicated synchronization mechanisms. It also means online DAMON parameters commit can take up to one aggregation interval. On large systems having long aggregation intervals, that can be too slow. The synchronization can be done in more efficient and simple way while removing the latency constraint if it can be done using damon_call(). The fact that damon_call() can be executed in the middle of the aggregation makes damon_set_attrs() unsafe to be called from it, though. Two real problems can occur in the case. First, converting the not yet completely aggregated nr_accesses for new user-set intervals can arguably degrade the accuracy or at least make the logic complicated. Second, kdamond_reset_aggregated() will not be called after the monitoring results update, so next aggregation starts from unclean state. This can result in inconsistent and unexpected nr_accesses_bp. Make it safe as follows. Catch the middle-of-the-aggregation case from damon_set_attrs() by checking the passed_sample_intervals and next_aggregationsis of the context. And pass the information to nr_accesses conversion logic. The logic works as before if it is not the case (called after the current aggregation is completed). If it is the case (committing parameters in the middle of the aggregation), it drops the nr_accesses information that so far aggregated, and make the status same to the beginning of this aggregation, but as if the last aggregation was started with the updated sampling/aggregation intervals. The middle-of-aggregastion check introduce yet another edge case, though. This happens because kdamond_tune_intervals() can also call damon_set_attrs() with the middle-of-aggregation check. Consider damon_call() for parameters commit and kdamond_tune_intervals() are called in same iteration of kdamond main loop. Because kdamond_tune_interval() is called for aggregation intervals, it should be the end of the aggregation. The first damon_set_attrs() call from kdamond_call() understands it is the end of the aggregation and correctly handle it. But, because the damon_set_attrs() updated next_aggregation_sis of the context. Hence, the second damon_set_attrs() invocation from kdamond_tune_interval() believes it is called in the middle of the aggregation. It therefore resets aggregated information so far. After that, kdamond_reset_interval() is called and double-reset the aggregated information. Avoid this case, too, by setting the next_aggregation_sis before kdamond_tune_intervals() is invoked. Link: https://lkml.kernel.org/r/20250306175908.66300-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon/core: invoke kdamond_call() after merging is done if possibleSeongJae Park
kdamond_call() callers may iterate the regions, so better to call it when the number of regions is as small as possible. It is when kdamond_merge_regions() is finished. Invoke it on the point. This change is also aimed to make future changes for carrying online parameters commit with damon_call() easier. The commit operation should be able to make sequence between other aggregation interval based operations including regioins merging and aggregation reset. Placing damon_call() invocation after the regions merging makes the sequence handling simpler. Link: https://lkml.kernel.org/r/20250306175908.66300-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon/sysfs: validate user inputs from damon_sysfs_commit_input()SeongJae Park
Patch series "mm/damon/sysfs: commit parameters online via damon_call()". Due to the lack of ways to synchronously access DAMON internal data, DAMON sysfs interface is using damon_callback hooks with its own synchronization mechanism. The mechanism is built on top of damon_callback hooks in an ineifficient and complicated way. Patch series "mm/damon: replace most damon_callback usages in sysfs with new core functions", which starts with commit e035320fd38e ("mm/damon/sysfs-schemes: remove unnecessary schemes existence check in damon_sysfs_schemes_clear_regions()") introduced two new DAMON kernel API functions that providing the synchronous access, replaced most damon_callback hooks usage in DAMON sysfs interface, and cleaned up unnecessary code. Continue the replacement and cleanup works. Update the last DAMON sysfs' usage of its own synchronization mechanism, namely online DAMON parameters commit, to use damon_call() instead of the damon_callback hooks and the hard-to-maintain core-external synchronization mechanism. Then remove the no more be used code due to the change, and more unused code that just not yet cleaned up. The first four patches (patches 1-4) of this series makes DAMON sysfs interface's online parameters commit to use damon_call(). Then, following three patches (patches 5-7) remove the DAMON sysfs interface's own synchronization mechanism and its usages, which is no more be used by anyone due to the first four patches. Finally, six patches (8-13) do more cleanup of outdated comment and unused code. This patch (of 13): Online DAMON parameters commit via DAMON sysfs interface can make kdamond stop. This behavior was made because it can make the implementation simpler. The implementation tries committing the parameter without validation. If it finds something wrong in the middle of the parameters update, it returns error without reverting the partially committed parameters back. It is safe though, since it immediately breaks kdamond main loop in the case of the error return. Users can make the wrong parameters by mistake, though. Stopping kdamond in the case is not very useful behavior. Also this makes it difficult to utilize damon_call() instead of damon_callback hook for online parameters update, since damon_call() cannot immediately break kdamond main loop in the middle. Validate the input parameters and return error when it fails before starting parameters updates. In case of mistakenly wrong parameters, kdamond can continue running with the old and valid parameters. Link: https://lkml.kernel.org/r/20250306175908.66300-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250306175908.66300-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17arch, mm: make releasing of memory to page allocator more explicitMike Rapoport (Microsoft)
The point where the memory is released from memblock to the buddy allocator is hidden inside arch-specific mem_init()s and the call to memblock_free_all() is needlessly duplicated in every artiste cure and after introduction of arch_mm_preinit() hook, mem_init() implementation on many architecture only contains the call to memblock_free_all(). Pull memblock_free_all() call into mm_core_init() and drop mem_init() on relevant architectures to make it more explicit where the free memory is released from memblock to the buddy allocator and to reduce code duplication in architecture specific code. Link: https://lkml.kernel.org/r/20250313135003.836600-14-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86] Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k] Tested-by: Mark Brown <broonie@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Guo Ren (csky) <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Russel King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17arch, mm: introduce arch_mm_preinitMike Rapoport (Microsoft)
Currently, implementation of mem_init() in every architecture consists of one or more of the following: * initializations that must run before page allocator is active, for instance swiotlb_init() * a call to memblock_free_all() to release all the memory to the buddy allocator * initializations that must run after page allocator is ready and there is no arch-specific hook other than mem_init() for that, like for example register_page_bootmem_info() in x86 and sparc64 or simple setting of mem_init_done = 1 in several architectures * a bunch of semi-related stuff that apparently had no better place to live, for example a ton of BUILD_BUG_ON()s in parisc. Introduce arch_mm_preinit() that will be the first thing called from mm_core_init(). On architectures that have initializations that must happen before the page allocator is ready, move those into arch_mm_preinit() along with the code that does not depend on ordering with page allocator setup. On several architectures this results in reduction of mem_init() to a single call to memblock_free_all() that allows its consolidation next. Link: https://lkml.kernel.org/r/20250313135003.836600-13-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86] Tested-by: Mark Brown <broonie@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Guo Ren (csky) <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Russel King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17arch, mm: streamline HIGHMEM freeingMike Rapoport (Microsoft)
All architectures that support HIGHMEM have their code that frees high memory pages to the buddy allocator while __free_memory_core() is limited to freeing only low memory. There is no actual reason for that. The memory map is completely ready by the time memblock_free_all() is called and high pages can be released to the buddy allocator along with low memory. Remove low memory limit from __free_memory_core() and drop per-architecture code that frees high memory pages. Link: https://lkml.kernel.org/r/20250313135003.836600-12-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86] Tested-by: Mark Brown <broonie@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Guo Ren (csky) <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Russel King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17arch, mm: set high_memory in free_area_init()Mike Rapoport (Microsoft)
high_memory defines upper bound on the directly mapped memory. This bound is defined by the beginning of ZONE_HIGHMEM when a system has high memory and by the end of memory otherwise. All this is known to generic memory management initialization code that can set high_memory while initializing core mm structures. Add a generic calculation of high_memory to free_area_init() and remove per-architecture calculation except for the architectures that set and use high_memory earlier than that. Link: https://lkml.kernel.org/r/20250313135003.836600-11-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86] Tested-by: Mark Brown <broonie@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Guo Ren (csky) <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Russel King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17arch, mm: set max_mapnr when allocating memory map for FLATMEMMike Rapoport (Microsoft)
max_mapnr is essentially the size of the memory map for systems that use FLATMEM. There is no reason to calculate it in each and every architecture when it's anyway calculated in alloc_node_mem_map(). Drop setting of max_mapnr from architecture code and set it once in alloc_node_mem_map(). While on it, move definition of mem_map and max_mapnr to mm/mm_init.c so there won't be two copies for MMU and !MMU variants. Link: https://lkml.kernel.org/r/20250313135003.836600-10-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86] Tested-by: Mark Brown <broonie@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Guo Ren (csky) <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Russel King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17page_io: zswap: do not crash the kernel on decompression failureNhat Pham
Currently, we crash the kernel when a decompression failure occurs in zswap (either because of memory corruption, or a bug in the compression algorithm). This is overkill. We should only SIGBUS the unfortunate process asking for the zswap entry on zswap load, and skip the corrupted entry in zswap writeback. See [1] for a recent upstream discussion about this. The zswap writeback case is relatively straightforward to fix. For the zswap_load() case, we change the return behavior: * Return 0 on success. * Return -ENOENT (with the folio locked) if zswap does not own the swapped out content. * Return -EIO if zswap owns the swapped out content, but encounters a decompression failure for some reasons. The folio will be unlocked, but not be marked up-to-date, which will eventually cause the process requesting the page to SIGBUS (see the handling of not-up-to-date folio in do_swap_page() in mm/memory.c), without crashing the kernel. * Return -EINVAL if we encounter a large folio, as large folio should not be swapped in while zswap is being used. Similar to the -EIO case, we also unlock the folio but do not mark it as up-to-date to SIGBUS the faulting process. As a side effect, we require one extra zswap tree traversal in the load and writeback paths. Quick benchmarking on a kernel build test shows no performance difference: With the new scheme: real: mean: 125.1s, stdev: 0.12s user: mean: 3265.23s, stdev: 9.62s sys: mean: 2156.41s, stdev: 13.98s The old scheme: real: mean: 125.78s, stdev: 0.45s user: mean: 3287.18s, stdev: 5.95s sys: mean: 2177.08s, stdev: 26.52s [nphamcs@gmail.com: fix documentation of zswap_load()] Link: https://lkml.kernel.org/r/20250306222453.1269456-1-nphamcs@gmail.com Link: https://lore.kernel.org/all/ZsiLElTykamcYZ6J@casper.infradead.org/ [1] Link: https://lkml.kernel.org/r/20250306205011.784787-1-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Suggested-by: Yosry Ahmed <yosry.ahmed@linux.dev> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/hugetlb: update nr_huge_pages and surplus_huge_pages togetherLiu Shixin
In alloc_surplus_hugetlb_folio(), we increase nr_huge_pages and surplus_huge_pages separately. In the middle window, if we set nr_hugepages to smaller and satisfy count < persistent_huge_pages(h), the surplus_huge_pages will be increased by adjust_pool_surplus(). After adding delay in the middle window, we can reproduce the problem easily by following step: 1. echo 3 > /proc/sys/vm/nr_overcommit_hugepages 2. mmap two hugepages. When nr_huge_pages=2 and surplus_huge_pages=1, goto step 3. 3. echo 0 > /proc/sys/vm/nr_huge_pages Finally, nr_huge_pages is less than surplus_huge_pages. To fix the problem, call only_alloc_fresh_hugetlb_folio() instead and move down __prep_account_new_huge_page() into the hugetlb_lock. Link: https://lkml.kernel.org/r/20250305035409.2391344-1-liushixin2@huawei.com Fixes: 0c397daea1d4 ("mm, hugetlb: further simplify hugetlb allocation API") Signed-off-by: Liu Shixin <liushixin2@huawei.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon/sysfs-schemes: return error when for attempts to install filters on ↵SeongJae Park
wrong sysfs directory Return error if the user tries to install a DAMOS filter on DAMOS filters sysfs directory that assumed to be used for filters that handled by a DAMON layer that not same to that for the installing filter. Link: https://lkml.kernel.org/r/20250305222733.59089-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon/sysfs-schemes: record filters of which layer should be added to the ↵SeongJae Park
given filters directory Unlike their name and assumed purposes, {core,ops}_filters DAMOS sysfs directories are allowing installing any type of filters. As a first step for preventing such wrong installments, add information about filters that handled by what layer should the installed to the given filters directory in the DAMOS sysfs internal data structures. Link: https://lkml.kernel.org/r/20250305222733.59089-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon/core: expose damos_filter_for_ops() to DAMON kernel API callersSeongJae Park
damos_filter_for_ops() can be useful to avoid putting wrong type of filters in wrong place. Make it be exposed to DAMON kernel API callers. Link: https://lkml.kernel.org/r/20250305222733.59089-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon/sysfs-schemes: commit filters in {core,ops}_filters directoriesSeongJae Park
Connect user inputs for files under core_filters and ops_filters with DAMON, so that the files can really function. Becasuse {core,ops}_filters are easier to be managed in terms of expecting filters evaluation order, add filters in {core,ops}_filters before 'filters' directory. Link: https://lkml.kernel.org/r/20250305222733.59089-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon/sysfs-schemes: implement core_filters and ops_filters directoriesSeongJae Park
Implement two DAMOS sysfs directories for managing core and operations layer handled filters separately. Those are named as 'core_filters' and 'ops_filters', and have files hierarchy same to 'filters'. This commit is only populating and cleaning up the directories, not really connecting the files with DAMON. Following changes will make the connections. Link: https://lkml.kernel.org/r/20250305222733.59089-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon/sysfs-schemes: let damon_sysfs_scheme_set_filters() be used for ↵SeongJae Park
different named directories Patch series "mm/damon: add sysfs dirs for managing DAMOS filters based on handling layers". DAMOS filters are categorized into two groups based on their handling layers, namely core and operations layers. The categorization affects when each filter is evaluated. Core layer handled filters are evaluated first. The order meant nothing before, but introduction of allow filters changed that. DAMOS sysfs interface provides single directory for filters, namely 'filters'. Users can install any filters in any order there. DAMON will internally categorize those into core and operations layer handled ones, and apply the evaluation order rule. The ordering rule is clearly documented. But the interface could still confuse users since it is allowed to install filters on the directory in mixed ways. Add two sysfs directories for managing filters by handling layers, namely 'core_filters' and 'ops_filters' for filters that handled by core and operations layer, respectively. Those are avoided to be used for installing filters that not handled by the assumed layers. For backward compatibility, keep 'filters' directory with its curernt behavior. Filters installed in the directory will be added to DAMON after those of 'core_filters' and 'ops_filters' directories, with the automatic categorizations. Also recommend users to use the new directories while noticing 'filters' directory could be deprecated in future on the usage documents. Note that new directories provide all features that were provided with 'filters', but just in a more clear way. Deprecating 'filters' in future will hence not make an irreversal feature loss. This patch (of 8): damon_sysfs_scheme_set_filters() is using a hard-coded directory name, "filters". Refactor for general named directories of same files hierarchy, to use from upcoming changes for adding sibling directories having files same to those of "filters", and named as "core_filters" and "ops_filters". [arnd@arndb.deL avoid Wformat-security warning] Link: https://lkml.kernel.org/r/20250310135142.4176976-1-arnd@kernel.org Link: https://lkml.kernel.org/r/20250305222733.59089-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250305222733.59089-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: stop maintaining the per-page mapcount of large folios ↵David Hildenbrand
(CONFIG_NO_PAGE_MAPCOUNT) Everything is in place to stop using the per-page mapcounts in large folios: the mapcount of tail pages will always be logically 0 (-1 value), just like it currently is for hugetlb folios already, and the page mapcount of the head page is either 0 (-1 value) or contains a page type (e.g., hugetlb). Maintaining _nr_pages_mapped without per-page mapcounts is impossible, so that one also has to go with CONFIG_NO_PAGE_MAPCOUNT. There are two remaining implications: (1) Per-node, per-cgroup and per-lruvec stats of "NR_ANON_MAPPED" ("mapped anonymous memory") and "NR_FILE_MAPPED" ("mapped file memory"): As soon as any page of the folio is mapped -- folio_mapped() -- we now account the complete folio as mapped. Once the last page is unmapped -- !folio_mapped() -- we account the complete folio as unmapped. This implies that ... * "AnonPages" and "Mapped" in /proc/meminfo and /sys/devices/system/node/*/meminfo * cgroup v2: "anon" and "file_mapped" in "memory.stat" and "memory.numa_stat" * cgroup v1: "rss" and "mapped_file" in "memory.stat" and "memory.numa_stat ... can now appear higher than before. But note that these folios do consume that memory, simply not all pages are actually currently mapped. It's worth nothing that other accounting in the kernel (esp. cgroup charging on allocation) is not affected by this change. [why oh why is "anon" called "rss" in cgroup v1] (2) Detecting partial mappings Detecting whether anon THPs are partially mapped gets a bit more unreliable. As long as a single MM maps such a large folio ("exclusively mapped"), we can reliably detect it. Especially before fork() / after a short-lived child process quit, we will detect partial mappings reliably, which is the common case. In essence, if the average per-page mapcount in an anon THP is < 1, we know for sure that we have a partial mapping. However, as soon as multiple MMs are involved, we might miss detecting partial mappings: this might be relevant with long-lived child processes. If we have a fully-mapped anon folio before fork(), once our child processes and our parent all unmap (zap/COW) the same pages (but not the complete folio), we might not detect the partial mapping. However, once the child processes quit we would detect the partial mapping. How relevant this case is in practice remains to be seen. Swapout/migration will likely mitigate this. In the future, RMAP walkers could check for that for that case (e.g., when collecting access bits during reclaim) and simply flag them for deferred-splitting. Link: https://lkml.kernel.org/r/20250303163014.1128035-21-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: CONFIG_NO_PAGE_MAPCOUNT to prepare for not maintain per-page mapcounts ↵David Hildenbrand
in large folios We're close to the finishing line: let's introduce a new CONFIG_NO_PAGE_MAPCOUNT config option where we will incrementally remove any dependencies on per-page mapcounts in large folios. Once that's done, we'll stop maintaining the per-page mapcounts with this config option enabled. CONFIG_NO_PAGE_MAPCOUNT will be EXPERIMENTAL for now, as we'll have to learn about some of the real world impact of some of the implications. As writing "!CONFIG_NO_PAGE_MAPCOUNT" is really nasty, let's introduce a helper config option "CONFIG_PAGE_MAPCOUNT" that expresses the negation. Link: https://lkml.kernel.org/r/20250303163014.1128035-16-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: convert folio_likely_mapped_shared() to folio_maybe_mapped_shared()David Hildenbrand
Let's reuse our new MM ownership tracking infrastructure for large folios to make folio_likely_mapped_shared() never return false negatives -- never indicating "not mapped shared" although the folio *is* mapped shared. With that, we can rename it to folio_maybe_mapped_shared() and get rid of the dependency on the mapcount of the first folio page. The semantics are now arguably clearer: no mixture of "false negatives" and "false positives", only the remaining possibility for "false positives". Thoroughly document the new semantics. We might now detect that a large folio is "maybe mapped shared" although it *no longer* is -- but once was. Now, if more than two MMs mapped a folio at the same time, and the MM mapping the folio exclusively at the end is not one tracked in the two folio MM slots, we will detect the folio as "maybe mapped shared". For anonymous folios, usually (except weird corner cases) all PTEs that target a "maybe mapped shared" folio are R/O. As soon as a child process would write to them (iow, actively use them), we would CoW and effectively replace these PTEs. Most cases (below) are not expected to really matter with large anonymous folios for this reason. Most importantly, there will be no change at all for: * small folios * hugetlb folios * PMD-mapped PMD-sized THPs (single mapping) This change has the potential to affect existing callers of folio_likely_mapped_shared() -> folio_maybe_mapped_shared(): (1) fs/proc/task_mmu.c: no change (hugetlb) (2) khugepaged counts PTEs that target shared folios towards max_ptes_shared (default: HPAGE_PMD_NR / 2), meaning we could skip a collapse where we would have previously collapsed. This only applies to anonymous folios and is not expected to matter in practice. Worth noting that this change sorts out case (A) documented in commit 1bafe96e89f0 ("mm/khugepaged: replace page_mapcount() check by folio_likely_mapped_shared()") by removing the possibility for "false negatives". (3) MADV_COLD / MADV_PAGEOUT / MADV_FREE will not try splitting PTE-mapped THPs that are considered shared but not fully covered by the requested range, consequently not processing them. PMD-mapped PMD-sized THP are not affected, or when all PTEs are covered. These functions are usually only called on anon/file folios that are exclusively mapped most of the time (no other file mappings or no fork()), so the "false negatives" are not expected to matter in practice. (4) mbind() / migrate_pages() / move_pages() will refuse to migrate shared folios unless MPOL_MF_MOVE_ALL is effective (requires CAP_SYS_NICE). We will now reject some folios that could be migrated. Similar to (3), especially with MPOL_MF_MOVE_ALL, so this is not expected to matter in practice. Note that cpuset_migrate_mm_workfn() calls do_migrate_pages() with MPOL_MF_MOVE_ALL. (5) NUMA hinting mm/migrate.c:migrate_misplaced_folio_prepare() will skip file folios that are probably shared libraries (-> "mapped shared" and executable). This check would have detected it as a shared library at some point (at least 3 MMs mapping it), so detecting it afterwards does not sound wrong (still a shared library). Not expected to matter. mm/memory.c:numa_migrate_check() will indicate TNF_SHARED in MAP_SHARED file mappings when encountering a shared folio. Similar reasoning, not expected to matter. mm/mprotect.c:change_pte_range() will skip folios detected as shared in CoW mappings. Similarly, this is not expected to matter in practice, but if it would ever be a problem we could relax that check a bit (e.g., basing it on the average page-mapcount in a folio), because it was only an optimization when many (e.g., 288) processes were mapping the same folios -- see commit 859d4adc3415 ("mm: numa: do not trap faults on shared data section pages.") (6) mm/rmap.c:folio_referenced_one() will skip exclusive swapbacked folios in dying processes. Applies to anonymous folios only. Without "false negatives", we'll now skip all actually shared ones. Skipping ones that are actually exclusive won't really matter, it's a pure optimization, and is not expected to matter in practice. In theory, one can detect the problematic scenario: folio_mapcount() > 0 and no folio MM slot is occupied ("state unknown"). One could reset the MM slots while doing an rmap walk, which migration / folio split already do when setting everything up. Further, when batching PTEs we might naturally learn about a owner (e.g., folio_mapcount() == nr_ptes) and could update the owner. However, we'll defer that until the scenarios where it would really matter are clear. Link: https://lkml.kernel.org/r/20250303163014.1128035-15-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: Copy-on-Write (COW) reuse support for PTE-mapped THPDavid Hildenbrand
Currently, we never end up reusing PTE-mapped THPs after fork. This wasn't really a problem with PMD-sized THPs, because they would have to be PTE-mapped first, but it's getting a problem with smaller THP sizes that are effectively always PTE-mapped. With our new "mapped exclusively" vs "maybe mapped shared" logic for large folios, implementing CoW reuse for PTE-mapped THPs is straight forward: if exclusively mapped, make sure that all references are from these (our) mappings. Add some helpful comments to explain the details. CONFIG_TRANSPARENT_HUGEPAGE selects CONFIG_MM_ID. If we spot an anon large folio without CONFIG_TRANSPARENT_HUGEPAGE in that code, something is seriously messed up. There are plenty of things we can optimize in the future: For example, we could remember that the folio is fully exclusive so we could speedup the next fault further. Also, we could try "faulting around", turning surrounding PTEs that map the same folio writable. But especially the latter might increase COW latency, so it would need further investigation. Link: https://lkml.kernel.org/r/20250303163014.1128035-14-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/rmap: basic MM owner tracking for large folios (!hugetlb)David Hildenbrand
For small folios, we traditionally use the mapcount to decide whether it was "certainly mapped exclusively" by a single MM (mapcount == 1) or whether it "maybe mapped shared" by multiple MMs (mapcount > 1). For PMD-sized folios that were PMD-mapped, we were able to use a similar mechanism (single PMD mapping), but for PTE-mapped folios and in the future folios that span multiple PMDs, this does not work. So we need a different mechanism to handle large folios. Let's add a new mechanism to detect whether a large folio is "certainly mapped exclusively", or whether it is "maybe mapped shared". We'll use this information next to optimize CoW reuse for PTE-mapped anonymous THP, and to convert folio_likely_mapped_shared() to folio_maybe_mapped_shared(), independent of per-page mapcounts. For each large folio, we'll have two slots, whereby a slot stores: (1) an MM id: unique id assigned to each MM (2) a per-MM mapcount If a slot is unoccupied, it can be taken by the next MM that maps folio page. In addition, we'll remember the current state -- "mapped exclusively" vs. "maybe mapped shared" -- and use a bit spinlock to sync on updates and to reduce the total number of atomic accesses on updates. In the future, it might be possible to squeeze a proper spinlock into "struct folio". For now, keep it simple, as we require the whole thing with THP only, that is incompatible with RT. As we have to squeeze this information into the "struct folio" of even folios of order-1 (2 pages), and we generally want to reduce the required metadata, we'll assign each MM a unique ID that can fit into an int. In total, we can squeeze everything into 4x int (2x long) on 64bit. 32bit support is a bit challenging, because we only have 2x long == 2x int in order-1 folios. But we can make it work for now, because we neither expect many MMs nor very large folios on 32bit. We will reliably detect folios as "mapped exclusively" vs. "mapped shared" as long as only two MMs map pages of a folio at one point in time -- for example with fork() and short-lived child processes, or with apps that hand over state from one instance to another. As soon as three MMs are involved at the same time, we might detect "maybe mapped shared" although the folio is "mapped exclusively". Example 1: (1) App1 faults in a (shmem/file-backed) folio page -> Tracked as MM0 (2) App2 faults in a folio page -> Tracked as MM1 (4) App1 unmaps all folio pages -> We will detect "mapped exclusively". Example 2: (1) App1 faults in a (shmem/file-backed) folio page -> Tracked as MM0 (2) App2 faults in a folio page -> Tracked as MM1 (3) App3 faults in a folio page -> No slot available, tracked as "unknown" (4) App1 and App2 unmap all folio pages -> We will detect "maybe mapped shared". Make use of __always_inline to keep possible performance degradation when (un)mapping large folios to a minimum. Note: by squeezing the two flags into the "unsigned long" that stores the MM ids, we can use non-atomic __bit_spin_unlock() and non-atomic setting/clearing of the "maybe mapped shared" bit, effectively not adding any new atomics on the hot path when updating the large mapcount + new metadata, which further helps reduce the runtime overhead in micro-benchmarks. Link: https://lkml.kernel.org/r/20250303163014.1128035-13-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/rmap: use folio_large_nr_pages() in add/remove functionsDavid Hildenbrand
Let's just use the "large" variant in code where we are sure that we have a large folio in our hands: this way we are sure that we don't perform any unnecessary "large" checks. While at it, convert the VM_BUG_ON_VMA to a VM_WARN_ON_ONCE. Maybe in the future there will not be a difference in that regard between large and small folios; in that case, unifying the handling again will be easy. E.g., folio_large_nr_pages() will simply translate to folio_nr_pages() until we replace all instances. Link: https://lkml.kernel.org/r/20250303163014.1128035-12-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/rmap: abstract large mapcount operations for large folios (!hugetlb)David Hildenbrand
Let's abstract the operations so we can extend these operations easily. Link: https://lkml.kernel.org/r/20250303163014.1128035-10-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/rmap: pass vma to __folio_add_rmap()David Hildenbrand
We'll need access to the destination MM when modifying the mapcount large folios next. So pass in the VMA. Link: https://lkml.kernel.org/r/20250303163014.1128035-9-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/rmap: pass dst_vma to folio_dup_file_rmap_pte() and friendsDavid Hildenbrand
We'll need access to the destination MM when modifying the large mapcount of a non-hugetlb large folios next. So pass in the destination VMA. Link: https://lkml.kernel.org/r/20250303163014.1128035-8-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: move _entire_mapcount in folio to page[2] on 32bitDavid Hildenbrand
Let's free up some space on 32bit in page[1] by moving the _pincount to page[2]. Ordinary folios only use the entire mapcount with PMD mappings, so order-1 folios don't apply. Similarly, hugetlb folios are always larger than order-1, turning the entire mapcount essentially unused for all order-1 folios. Moving it to order-1 folios will not change anything. On 32bit, simply check in folio_entire_mapcount() whether we have an order-1 folio, and return 0 in that case. Note that THPs on 32bit are not particularly common (and we don't care too much about performance), but we want to keep it working reliably, because likely we want to use large folios there as well in the future, independent of PMD leaf support. Once we dynamically allocate "struct folio", the 32bit specifics will go away again; even small folios could then have a pincount. Link: https://lkml.kernel.org/r/20250303163014.1128035-7-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: move _pincount in folio to page[2] on 32bitDavid Hildenbrand
Let's free up some space on 32bit in page[1] by moving the _pincount to page[2]. For order-1 folios (never anon folios!) on 32bit, we will now also use the GUP_PIN_COUNTING_BIAS approach. A fully-mapped order-1 folio requires 2 references. With GUP_PIN_COUNTING_BIAS being 1024, we'd detect such folios as "maybe pinned" with 512 full mappings, instead of 1024 for order-0. As anon folios are out of the picture (which are the most relevant users of checking for pinnings on *mapped* pages) and we are talking about 32bit, this is not expected to cause any trouble. In __dump_page(), copy one additional folio page if we detect a folio with an order > 1, so we can dump the pincount on order > 1 folios reliably. Note that THPs on 32bit are not particularly common (and we don't care too much about performance), but we want to keep it working reliably, because likely we want to use large folios there as well in the future, independent of PMD leaf support. Once we dynamically allocate "struct folio", fortunately the 32bit specifics will likely go away again; even small folios could then have a pincount and folio_has_pincount() would essentially always return "true". Link: https://lkml.kernel.org/r/20250303163014.1128035-6-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: move hugetlb specific things in folio to page[3]David Hildenbrand
Let's just move the hugetlb specific stuff to a separate page, and stop letting it overlay other fields for now. This frees up some space in page[2], which we will use on 32bit to free up some space in page[1]. While we could move these things to page[3] instead, it's cleaner to just move the hugetlb specific things out of the way and pack the core-folio stuff as tight as possible. ... and we can minimize the work required in dump_folio. We can now avoid re-initializing &folio->_deferred_list in hugetlb code. Hopefully dynamically allocating "strut folio" in the future will further clean this up. Link: https://lkml.kernel.org/r/20250303163014.1128035-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: let _folio_nr_pages overlay memcg_data in first tail pageDavid Hildenbrand
Let's free up some more of the "unconditionally available on 64BIT" space in order-1 folios by letting _folio_nr_pages overlay memcg_data in the first tail page (second folio page). Consequently, we have the optimization now whenever we have CONFIG_MEMCG, independent of 64BIT. We have to make sure that page->memcg on tail pages does not return "surprises". page_memcg_check() already properly refuses PageTail(). Let's do that earlier in print_page_owner_memcg() to avoid printing wrong "Slab cache page" information. No other code should touch that field on tail pages of compound pages. Reset the "_nr_pages" to 0 when splitting folios, or when freeing them back to the buddy (to avoid false page->memcg_data "bad page" reports). Note that in __split_huge_page(), folio_nr_pages() would stop working already as soon as we start messing with the subpages. Most kernel configs should have at least CONFIG_MEMCG enabled, even if disabled at runtime. 64byte "struct memmap" is what we usually have on 64BIT. While at it, rename "_folio_nr_pages" to "_nr_pages". Hopefully memdescs / dynamically allocating "strut folio" in the future will further clean this up, e.g., making _nr_pages available in all configs and maybe even in small folios. Doing that should be fairly easy on top of this change. [david@redhat.com: make "make htmldoc" happy] Link: https://lkml.kernel.org/r/a97f8a91-ec41-4796-81e3-7c9e0e491ba4@redhat.com Link: https://lkml.kernel.org/r/20250303163014.1128035-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/mremap: thread state through move page table operationLorenzo Stoakes
Finish refactoring the page table logic by threading the PMC state throughout the operation, allowing us to control the operation as we go. Additionally, update the old_addr, new_addr fields in move_page_tables() as we progress through the process making use of the fact we have this state object now to track this. With these changes made, not only is the code far more readable, but we can finally transmit state throughout the entire operation, which lays the groundwork for sensibly making changes in future to how the mremap() operation is performed. Additionally take the opportunity to refactor the means of determining the progress of the operation, abstracting this to pmc_progress() and simplifying the logic to make it clearer what's going on. Link: https://lkml.kernel.org/r/230dd7a2b7b01a6eef442678f284d575e800356e.1741639347.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/mremap: refactor move_page_tables(), abstracting stateLorenzo Stoakes
A lot of state is threaded throughout the page table moving logic within the mremap code, including boolean values which control behaviour specifically in regard to whether rmap locks need be held over the operation and whether the VMA belongs to a temporary stack being moved by move_arg_pages() (and consequently, relocate_vma_down()). As we already transmit state throughout this operation, it is neater and more readable to maintain a small state object. We do so in the form of pagetable_move_control. In addition, this allows us to update parameters within the state as we manipulate things, for instance with regard to the page table realignment logic. In future I want to add additional functionality to the page table logic, so this is an additional motivation for making it easier to do so. This patch changes move_page_tables() to accept a pointer to a pagetable_move_control struct, and performs changes at this level only. Further page table logic will be updated in a subsequent patch. We additionally also take the opportunity to add significant comments describing the address realignment logic to make it abundantly clear what is going on in this code. Link: https://lkml.kernel.org/r/e20180add9c8746184aa3f23a61fff69a06cdaa9.1741639347.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/mremap: complete refactor of move_vma()Lorenzo Stoakes
We invoke ksm_madvise() with an intentionally dummy flags field, so no need to pass around. Additionally, the code tries to be 'clever' with account_start, account_end, using these to both check that vma->vm_start != 0 and that we ought to account the newly split portion of VMA post-move, either before or after it. We need to do this because we intentionally removed VM_ACCOUNT on the VMA prior to unmapping, so we don't erroneously unaccount memory (we have already calculated the correct amount to account and accounted it, any subsequent subtraction will be incorrect). This patch significantly expands the comment (from 2002!) about 'concealing' the flag to make it abundantly clear what's going on, as well as adding and expanding a number of other comments also. We can remove account_start, account_end by instead tracking when we account (i.e. vma->vm_flags has the VM_ACCOUNT flag set, and this is not an MREMAP_DONTUNMAP operation), and figuring out when to reinstate the VM_ACCOUNT flag on prior/subsequent VMAs separately. We additionally break the function into logical pieces and attack the very confusing error handling logic (where, for instance, new_addr is set to err). After this change the code is considerably more readable and easy to manipulate. Link: https://lkml.kernel.org/r/e7eaa307e444ba2b04d94fd985c907c8e896f893.1741639347.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/mremap: initial refactor of move_vma()Lorenzo Stoakes
Update move_vma() to use the threaded VRM object, de-duplicate code and separate into smaller functions to aid readability and debug-ability. This in turn allows further simplification of expand_vma() as we can simply thread VRM through the function. We also take the opportunity to abstract the account charging page count into the VRM in order that we can correctly thread this through the operation. We additionally do the same for tracking mm statistics - exec_vm, stack_vm, data_vm, and locked_vm. As part of this change, we slightly modify when locked pages statistics are counted for in mm_struct statistics. However this should cause no issues, as there is no chance of underflow, nor will any rlimit failures occur as a result. This is an intermediate step before a further refactoring of move_vma() in order to aid review. Link: https://lkml.kernel.org/r/ab611d6efae11bddab2db2b8bb3925b1d1954c7d.1741639347.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/mremap: introduce and use vma_remap_struct threaded stateLorenzo Stoakes
A number of mremap() calls both pass around and modify a large number of parameters, making the code less readable and often repeatedly having to determine things such as VMA, size delta, and more. Avoid this by using the common pattern of passing a state object through the operation, updating it as we go. We introduce the vma_remap_struct or 'VRM' for this purpose. This also gives us the ability to accumulate further state through the operation that would otherwise require awkward and error-prone pointer passing. We can also now trivially define helper functions that operate on a VRM object. This pattern has proven itself to be very powerful when implemented for VMA merge, VMA unmapping and memory mapping operations, so it is battle-tested and functional. We both introduce the data structure and use it, introducing helper functions as needed to make things readable, we move some state such as mmap lock and mlock() status to the VRM, we introduce a means of classifying the type of mremap() operation and de-duplicate the get_unmapped_area() lookup. We also neatly thread userfaultfd state throughout the operation. Note that there is further refactoring to be done, chiefly adjust move_vma() to accept a VRM parameter. We defer this as there is pre-requisite work required to be able to do so which we will do in a subsequent patch. Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>