summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-03-16mm: set folio swapbacked iff folios are dirty in try_to_unmap_oneBarry Song
Patch series "mm: batched unmap lazyfree large folios during reclamation", v4. Commit 735ecdfaf4e8 ("mm/vmscan: avoid split lazyfree THP during shrink_folio_list()") prevents the splitting of MADV_FREE'd THP in madvise.c. However, those folios are still added to the deferred_split list in try_to_unmap_one() because we are unmapping PTEs and removing rmap entries one by one. Firstly, this has rendered the following counter somewhat confusing, /sys/kernel/mm/transparent_hugepage/hugepages-size/stats/split_deferred The split_deferred counter was originally designed to track operations such as partial unmap or madvise of large folios. However, in practice, most split_deferred cases arise from memory reclamation of aligned lazyfree mTHPs as observed by Tangquan. This discrepancy has made the split_deferred counter highly misleading. Secondly, this approach is slow because it requires iterating through each PTE and removing the rmap one by one for a large folio. In fact, all PTEs of a pte-mapped large folio should be unmapped at once, and the entire folio should be removed from the rmap as a whole. Thirdly, it also increases the risk of a race condition where lazyfree folios are incorrectly set back to swapbacked, as a speculative folio_get may occur in the shrinker's callback. deferred_split_scan() might call folio_try_get(folio) since we have added the folio to split_deferred list while removing rmap for the 1st subpage, and while we are scanning the 2nd to nr_pages PTEs of this folio in try_to_unmap_one(), the entire mTHP could be transitioned back to swap-backed because the reference count is incremented, which can make "ref_count == 1 + map_count" within try_to_unmap_one() false. /* * The only page refs must be one from isolation * plus the rmap(s) (dropped by discard:). */ if (ref_count == 1 + map_count && (!folio_test_dirty(folio) || ... (vma->vm_flags & VM_DROPPABLE))) { dec_mm_counter(mm, MM_ANONPAGES); goto discard; } This patchset resolves the issue by marking only genuinely dirty folios as swap-backed, as suggested by David, and transitioning to batched unmapping of entire folios in try_to_unmap_one(). Consequently, the deferred_split count drops to zero, and memory reclamation performance improves significantly — reclaiming 64KiB lazyfree large folios is now 2.5x faster(The specific data is embedded in the changelog of patch 3/4). By the way, while the patchset is primarily aimed at PTE-mapped large folios, Baolin and Lance also found that try_to_unmap_one() handles lazyfree redirtied PMD-mapped large folios inefficiently — it splits the PMD into PTEs and iterates over them. This patchset removes the unnecessary splitting, enabling us to skip redirtied PMD-mapped large folios 3.5X faster during memory reclamation. (The specific data can be found in the changelog of patch 4/4). This patch (of 4): The refcount may be temporarily or long-term increased, but this does not change the fundamental nature of the folio already being lazy- freed. Therefore, we only reset 'swapbacked' when we are certain the folio is dirty and not droppable. Link: https://lkml.kernel.org/r/20250214093015.51024-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20250214093015.51024-2-21cnbao@gmail.com Fixes: 6c8e2a256915 ("mm: fix race between MADV_FREE reclaim and blkdev direct IO read") Signed-off-by: Barry Song <v-songbaohua@oppo.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Cc: Mauricio Faria de Oliveira <mfo@canonical.com> Cc: Chis Li <chrisl@kernel.org> (Google) Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kairui Song <kasong@tencent.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Tangquan Zheng <zhengtangquan@oppo.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gavin Shan <gshan@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Shaoqin Huang <shahuang@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16filemap: remove redundant folio_test_large check in filemap_free_folioGuanjun
The folio_test_large() check in filemap_free_folio() is unnecessary because folio_nr_pages(), which is called internally already performs this check. Removing the redundant condition simplifies the code and avoids double validation. This change improves code readability and reduces unnecessary operations in the folio freeing path. Link: https://lkml.kernel.org/r/20250213055612.490993-1-guanjun@linux.alibaba.com Signed-off-by: Guanjun <guanjun@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16maple_tree: remove a BUG_ON() in mas_alloc_nodes()Petr Tesarik
Remove a BUG_ON() right before a WARN_ON() with the same condition. Calling WARN_ON() and BUG_ON() here is definitely wrong. Since the goal is generally to remove BUG_ON() invocations from the kernel, keep only the WARN_ON(). Link: https://lkml.kernel.org/r/20250213114453.1078318-1-ptesarik@suse.com Fixes: 067311d33e65 ("maple_tree: separate ma_state node from status") Signed-off-by: Petr Tesarik <ptesarik@suse.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16tools/selftests: add file/shmem-backed mapping guard region testsLorenzo Stoakes
Extend the guard region self tests to explicitly assert that guard regions work correctly for functionality specific to file-backed and shmem mappings. In addition to testing all of the existing guard region functionality that is currently tested against anonymous mappings against file-backed and shmem mappings (except those which are exclusive to anonymous mapping), we now also: * Test that MADV_SEQUENTIAL does not cause unexpected readahead behaviour. * Test that MAP_PRIVATE behaves as expected with guard regions installed in both a shared and private mapping of an fd. * Test that a read-only file can correctly establish guard regions. * Test a probable fault-around case does not interfere with guard regions (or vice-versa). * Test that truncation does not eliminate guard regions. * Test that hole punching functions as expected in the presence of guard regions. * Test that a read-only mapping of a memfd write sealed mapping can have guard regions established within it and function correctly without violation of the seal. * Test that guard regions installed into a mapping of the anonymous zero page function correctly. Link: https://lkml.kernel.org/r/90c16bec5fcaafcd1700dfa3e9988c3e1aa9ac1d.1739469950.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16tools/selftests: expand all guard region tests to file-backedLorenzo Stoakes
Extend the guard region tests to allow for test fixture variants for anon, shmem, and local file files. This allows us to assert that each of the expected behaviours of anonymous memory also applies correctly to file-backed (both shmem and an a file created locally in the current working directory) and thus asserts the same correctness guarantees as all the remaining tests do. The fixture teardown is now performed in the parent process rather than child forked ones, meaning cleanup is always performed, including unlinking any generated temporary files. Additionally the variant fixture data type now contains an enum value indicating the type of backing store and the mmap() invocation is abstracted to allow for the mapping of whichever backing store the variant is testing. We adjust tests as necessary to account for the fact they may now reference files rather than anonymous memory. Link: https://lkml.kernel.org/r/ab42228d2bd9b8aa18e9faebcd5c88732a7e5820.1739469950.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16selftests/mm: rename guard-pages to guard-regionsLorenzo Stoakes
The feature formerly referred to as guard pages is more correctly referred to as 'guard regions', as in fact no pages are ever allocated in the process of installing the regions. To avoid confusion, rename the tests accordingly. [lorenzo.stoakes@oracle.com: fix guard regions invocation] Link: https://lkml.kernel.org/r/13426c71-d069-4407-9340-b227ff8b8736@lucifer.local Link: https://lkml.kernel.org/r/1c3cd04a3f69b5756b94bda701ac88325a9be18b.1739469950.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: allow guard regions in file-backed and read-only mappingsLorenzo Stoakes
Patch series "mm: permit guard regions for file-backed/shmem mappings". The guard regions feature was initially implemented to support anonymous mappings only, excluding shmem. This was done so as to introduce the feature carefully and incrementally and to be conservative when considering the various caveats and corner cases that are applicable to file-backed mappings but not to anonymous ones. Now this feature has landed in 6.13, it is time to revisit this and to extend this functionality to file-backed and shmem mappings. In order to make this maximally useful, and since one may map file-backed mappings read-only (for instance ELF images), we also remove the restriction on read-only mappings and permit the establishment of guard regions in any non-hugetlb, non-mlock()'d mapping. It is permissible to permit the establishment of guard regions in read-only mappings because the guard regions only reduce access to the mapping, and when removed simply reinstate the existing attributes of the underlying VMA, meaning no access violations can occur. While the change in kernel code introduced in this series is small, the majority of the effort here is spent in extending the testing to assert that the feature works correctly across numerous file-backed mapping scenarios. Every single guard region self-test performed against anonymous memory (which is relevant and not anon-only) has now been updated to also be performed against shmem and a mapping of a file in the working directory. This confirms that all cases also function correctly for file-backed guard regions. In addition a number of other tests are added for specific file-backed mapping scenarios. There are a number of other concerns that one might have with regard to guard regions, addressed below: Readahead ~~~~~~~~~ Readahead is a process through which the page cache is populated on the assumption that sequential reads will occur, thus amortising I/O and, through a clever use of the PG_readahead folio flag establishing during major fault and checked upon minor fault, provides for asynchronous I/O to occur as dat is processed, reducing I/O stalls as data is faulted in. Guard regions do not alter this mechanism which operates at the folio and fault level, but does of course prevent the faulting of folios that would otherwise be mapped. In the instance of a major fault prior to a guard region, synchronous readahead will occur including populating folios in the page cache which the guard regions will, in the case of the mapping in question, prevent access to. In addition, if PG_readahead is placed in a folio that is now inaccessible, this will prevent asynchronous readahead from occurring as it would otherwise do. However, there are mechanisms for heuristically resetting this within readahead regardless, which will 'recover' correct readahead behaviour. Readahead presumes sequential data access, the presence of a guard region clearly indicates that, at least in the guard region, no such sequential access will occur, as it cannot occur there. So this should have very little impact on any real workload. The far more important point is as to whether readahead causes incorrect or inappropriate mapping of ranges disallowed by the presence of guard regions - this is not the case, as readahead does not 'pre-fault' memory in this fashion. At any rate, any mechanism which would attempt to do so would hit the usual page fault paths, which correctly handle PTE markers as with anonymous mappings. Fault-Around ~~~~~~~~~~~~ The fault-around logic, in a similar vein to readahead, attempts to improve efficiency with regard to file-backed memory mappings, however it differs in that it does not try to fetch folios into the page cache that are about to be accessed, but rather pre-maps a range of folios around the faulting address. Guard regions making use of PTE markers makes this relatively trivial, as this case is already handled - see filemap_map_folio_range() and filemap_map_order0_folio() - in both instances, the solution is to simply keep the established page table mappings and let the fault handler take care of PTE markers, as per the comment: /* * NOTE: If there're PTE markers, we'll leave them to be * handled in the specific fault path, and it'll prohibit * the fault-around logic. */ This works, as establishing guard regions results in page table mappings with PTE markers, and clearing them removes them. Truncation ~~~~~~~~~~ File truncation will not eliminate existing guard regions, as the truncation operation will ultimately zap the range via unmap_mapping_range(), which specifically excludes PTE markers. Zapping ~~~~~~~ Zapping is, as with anonymous mappings, handled by zap_nonpresent_ptes(), which specifically deals with guard entries, leaving them intact except in instances such as process teardown or munmap() where they need to be removed. Reclaim ~~~~~~~ When reclaim is performed on file-backed folios, it ultimately invokes try_to_unmap_one() via the rmap. If the folio is non-large, then map_pte() will ultimately abort the operation for the guard region mapping. If large, then check_pte() will determine that this is a non-device private entry/device-exclusive entry 'swap' PTE and thus abort the operation in that instance. Therefore, no odd things happen in the instance of reclaim being attempted upon a file-backed guard region. Hole Punching ~~~~~~~~~~~~~ This updates the page cache and ultimately invokes unmap_mapping_range(), which explicitly leaves PTE markers in place. Because the establishment of guard regions zapped any existing mappings to file-backed folios, once the guard regions are removed then the hole-punched region will be faulted in as usual and everything will behave as expected. One thing to note with this series is that it now implies file-backed VMAs which install guard regions will now have an anon_vma installed if not already present (i.e. if not post-CoW MAP_PRIVATE). I have audited kernel source for instances of vma->anon_vma checks and found nowhere where this would be problematic for pure file-backed mappings. I also discussed (off-list) with Matthew who confirmed he can't see any issue with this. In effect, we treat these VMAs as if they are MAP_PRIVATE, only with 0 CoW'd pages. As a result, the rmap never has a reason to reference the anon_vma from folios at any point and thus no unexpected or weird behaviour results. The anon_vma logic tries to avoid unnecessary anon_vma propagation on fork so we ought to at least minimise overhead. However, this is still overhead, and unwelcome overhead. The whole reason we do this (in madvise_guard_install()) is to ensure that fork _copies page tables_. Otherwise, in vma_needs_copy(), nothing will indicate that we should do so. This was already an unpleasant thing to have to do, but without a new VMA flag we really have no reasonable means of ensuring this happens. Going forward, I intend to add a new VMA flag, VM_MAYBE_GUARDED or something like this. This would have specific behaviour - for the purposes of merging, it would be ignored. However on both split and merge, it will be propagated. It is therefore 'sticky'. This is to avoid having to traverse page tables to determine which parts of a VMA contain guard regions and of course to maintain the desirable qualities of guard regions - the lack of VMA propagation (+ thus slab allocations of VMAs). Adding this flag and adjusting vma_needs_copy() to reference it would resolve the issue. However :) we have a VMA flag space issue, so it'd render this a 64-bit feature only. Having discussed with Matthew a plan by which to perhaps extend available flags for 32-bit we may going forward be able to avoid this. But this may be a longer term project. In the meantime, we'd have to resort to the anon_vma hack for 32-bit, using the flag for 64-bit. The issue with this however is if we do then intend to allow the flag to enable /proc/$pid/maps visibility (something this could allow), it would also end up being 64-bit only which would be a pity. Regardless - I wanted to highlight this behaviour as it is perhaps somewhat surprising. This patch (of 4): There is no reason to disallow guard regions in file-backed mappings - readahead and fault-around both function correctly in the presence of PTE markers, equally other operations relating to memory-mapped files function correctly. Additionally, read-only mappings if introducing guard-regions, only restrict the mapping further, which means there is no violation of any access rights by permitting this to be so. Removing this restriction allows for read-only mapped files (such as executable files) correctly which would otherwise not be permitted. Link: https://lkml.kernel.org/r/cover.1739469950.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/d885cb259174736c2830a5dfe07f81b214ef3faa.1739469950.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/mm_init.c: use round_up() to calculate usermap sizeWei Yang
Since pageblock_nr_pages and BITS_PER_LONG are power of 2, we could use round_up() to calculate it. Also we have renamed blockflags to pageblock_flags, adjust the comment accordingly. Link: https://lkml.kernel.org/r/20250212013818.873-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Suggested-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16selftests/mm: allow tests to run with no huge pages supportMark Brown
Currently the mm selftests refuse to run if huge pages are not available in the current system but this is an optional feature and not all the tests actually require them. Change the test during startup to be non-fatal and skip or omit tests which actually rely on having huge pages, allowing the other tests to be run. The gup_test does support using madvise() to configure huge pages but it ignores the error code so we just let it run. Link: https://lkml.kernel.org/r/20250212-kselftest-mm-no-hugepages-v1-2-44702f538522@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Nico Pache <npache@redhat.com> Cc: Mariano Pache <npache@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/mmu_gather: clean up the stale code commentBaoquan He
In commit d7f861b9c43a ("mm/mmu_gather: add __tlb_remove_folio_pages()"), helper function __tlb_remove_folio_pages_size() was added. And based on the helper, wrapper functions __tlb_remove_folio_pages() and __tlb_remove_page_size() are created and used by upper level functions. So let's update the code comment to reflect the current code about tlb_remove_page()/tlb_remove_page_size(), etc. Link: https://lkml.kernel.org/r/20250211034348.39531-2-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/mmu_gather: remove unused __tlb_remove_page()Baoquan He
Nobody is using __tlb_remove_page() now, clean it up. And also remove the code comment above tlb_remove_page() because it's not meaningful any more. Link: https://lkml.kernel.org/r/Z6si0/A/zzEF/bFJ@MiWiFi-R3L-srv Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16maple_tree: use ma_dead_node() in mte_dead_node()I Hsin Cheng
Utilize ma_dead_node() in mte_dead_node(). It can prevent decoding the maple enode for a second time. Use the "node" to find parent for comparison. Link: https://lkml.kernel.org/r/20250211071850.330632-1-richard120310@gmail.com Signed-off-by: I Hsin Cheng <richard120310@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Shuah khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/mm_init.c: only align start of ZONE_MOVABLE on nodes with memoryWei Yang
At the beginning of find_zone_movable_pfns_for_nodes(), it has properly set node_states[N_MEMORY] in early_calculate_totalpages(). Instead of iterating over all possible nodes, we can just do the alignment on nodes with memory. Link: https://lkml.kernel.org/r/20250211082900.10877-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16Docs/admin-guide/mm/damon/usage: document hugepage_size filter typeUsama Arif
This includes both the 'hugepage_size' filter type and the min/max files used to decide range of sizes to filter on. Link: https://lkml.kernel.org/r/20250211124437.278873-5-usamaarif642@gmail.com Signed-off-by: Usama Arif <usamaarif642@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16Docs/ABI/damon: document DAMOS sysfs files to set the min/max folio_sizeUsama Arif
This will be used to decide the min and max folio size to operate on for pa_stat. Link: https://lkml.kernel.org/r/20250211124437.278873-4-usamaarif642@gmail.com Signed-off-by: Usama Arif <usamaarif642@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/damon/sysfs-schemes: add files for setting damos_filter->sz_rangeUsama Arif
Add min and max files for damon filters to let the userspace decide the min/max folio size to operate on. This will be needed to decide what folio sizes to give pa_stat for. Link: https://lkml.kernel.org/r/20250211124437.278873-3-usamaarif642@gmail.com Signed-off-by: Usama Arif <usamaarif642@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/damon: introduce DAMOS filter type hugepage_sizeUsama Arif
Patch series "mm/damon: add support for hugepage_size DAMOS filter", v5. hugepage_size DAMOS filter can be used to gather statistics to check if memory regions of specific access tempratures are backed by hugepages of a size in a specific range. This filter can help to observe and prove the effectivenes of different schemes for shrinking/collapsing hugepages. This patch (of 4): This is to gather statistics to check if memory regions of specific access tempratures are backed by pages of a size in a specific range. This filter can help to observe and prove the effectivenes of different schemes for shrinking/collapsing hugepages. [sj@kernel.org: add kernel-doc comment for damos_filter->sz_range] Link: https://lkml.kernel.org/r/20250218223058.52459-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250211124437.278873-1-usamaarif642@gmail.com Link: https://lkml.kernel.org/r/20250211124437.278873-2-usamaarif642@gmail.com Signed-off-by: Usama Arif <usamaarif642@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Usama Arif <usamaarif642@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/mmu_gather: update comment on RCU freeingBrendan Jackman
Some recent discussion on LMKL [0] brought up some interesting and useful additional context on RCU-freeing for pagetables. Note down some extra info in here, in particular a) be concrete about the reason why an arch might not have an IPI and b) add the interesting paravirt details. [0] https://lore.kernel.org/linux-kernel/20250206044346.3810242-2-riel@surriel.com/ Link: https://lkml.kernel.org/r/20250211-mmugather-comment-v1-1-1ac1e0c765d2@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/vmstat: revert "fix a W=1 clang compiler warning"Bart Van Assche
Commit 75b5ab134bb5 enabled -Wenum-enum-conversion in W=1 builds. Commit 30c2de0a267c ("mm/vmstat: fix a W=1 clang compiler warning") fixed a -Wenum-enum-conversion warning. Commit 8f6629c004b1 removed the -Wenum-enum-conversion option again from W=1 builds. Since the W=1 compiler warning fix is no longer necessary, revert it. Link: https://lkml.kernel.org/r/20250211205911.1707684-1-bvanassche@acm.org Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ivan Shapovalov <intelfx@intelfx.name> Cc: David Laight <david.laight@aculab.com> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16fb_defio: do not use deprecated page->mapping, index fieldsLorenzo Stoakes
With the introduction of mapping_wrprotect_range() there is no need to use folio_mkclean() in order to write-protect mappings of frame buffer pages, and therefore no need to inappropriately set kernel-allocated page->index, mapping fields to permit this operation. Instead, store the pointer to the page cache object for the mapped driver in the fb_deferred_io object, and use the already stored page offset from the pageref object to look up mappings in order to write-protect them. This is justified, as for the page objects to store a mapping pointer at the point of assignment of pages, they must all reference the same underlying address_space object. Since the life time of the pagerefs is also the lifetime of the fb_deferred_io object, storing the pointer here makes sense. This eliminates the need for all of the logic around setting and maintaining page->index,mapping which we remove. This eliminates the use of folio_mkclean() entirely but otherwise should have no functional change. [lorenzo.stoakes@oracle.com: fixup unused variable warnings] Link: https://lkml.kernel.org/r/d4018405-2762-4385-a816-e54cc23839ac@lucifer.local Link: https://lkml.kernel.org/r/81171ab16c14e3df28f6de9d14982cee528d8519.1739029358.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Tested-by: Kajtar Zsolt <soci@c64.rulez.org> Acked-by: Thomas Zimmermann <tzimmermann@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Helge Deller <deller@gmx.de> Cc: Jaya Kumar <jayakumar.lkml@gmail.com> Cc: Maíra Canal <mcanal@igalia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Simona Vetter <simona.vetter@ffwll.ch> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: provide mapping_wrprotect_range() functionLorenzo Stoakes
In the fb_defio video driver, page dirty state is used to determine when frame buffer pages have been changed, allowing for batched, deferred I/O to be performed for efficiency. This implementation had only one means of doing so effectively - the use of the folio_mkclean() function. However, this use of the function is inappropriate, as the fb_defio implementation allocates kernel memory to back the framebuffer, and then is forced to specified page->index, mapping fields in order to permit the folio_mkclean() rmap traversal to proceed correctly. It is not correct to specify these fields on kernel-allocated memory, and moreover since these are not folios, page->index, mapping are deprecated fields, soon to be removed. We therefore need to provide a means by which we can correctly traverse the reverse mapping and write-protect mappings for a page backing an address_space page cache object at a given offset. This patch provides this - mapping_wrprotect_range() - which allows for this operation to be performed for a specified address_space, offset, PFN and size, without requiring a folio nor, of course, an inappropriate use of page->index, mapping. With this provided, we can subsequently adjust the fb_defio implementation to make use of this function and avoid incorrect invocation of folio_mkclean() and more importantly, incorrect manipulation of page->index and mapping fields. Link: https://lkml.kernel.org/r/e5bf969d64e7f2f2ae944d42341fc8994b736a81.1739029358.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Helge Deller <deller@gmx.de> Cc: Jaya Kumar <jayakumar.lkml@gmail.com> Cc: Kajtar Zsolt <soci@c64.rulez.org> Cc: Maíra Canal <mcanal@igalia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Simona Vetter <simona.vetter@ffwll.ch> Cc: Thomas Zimemrmann <tzimmermann@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: refactor rmap_walk_file() to separate out traversal logicLorenzo Stoakes
Patch series "expose mapping wrprotect, fix fb_defio use", v3. Right now the only means by which we can write-protect a range using the reverse mapping is via folio_mkclean(). However this is not always the appropriate means of doing so, specifically in the case of the framebuffer deferred I/O logic (fb_defio enabled by CONFIG_FB_DEFERRED_IO). There, kernel pages are mapped read-only and write-protect faults used to batch up I/O operations. Each time the deferred work is done, folio_mkclean() is used to mark the framebuffer page as having had I/O performed on it. However doing so requires the kernel page (perhaps allocated via vmalloc()) to have its page->mapping, index fields set so the rmap can find everything that maps it in order to write-protect. This is problematic as firstly, these fields should not be set for kernel-allocated memory, and secondly these are not folios (it's not user memory) and page->index, mapping fields are now deprecated and soon to be removed. The removal of these fields is imminent, rendering this series more urgent than it might first appear. The implementers cannot be blamed for having used this however, as there is simply no other way of performing this operation correctly. This series fixes this - we provide the mapping_wrprotect_range() function to allow the reverse mapping to be used to look up mappings from the page cache object (i.e. its address_space pointer) at a specific offset. The fb_defio logic already stores this offset, and can simply be expanded to keep track of the page cache object, so the change then becomes straight-forward. This series should have no functional change. This patch (of 3): In order to permit the traversal of the reverse mapping at a specified mapping and offset rather than those specified by an input folio, we need to separate out the portion of the rmap file logic which deals with this traversal from those parts of the logic which interact with the folio. This patch achieves this by adding a new static __rmap_walk_file() function which rmap_walk_file() invokes. This function permits the ability to pass NULL folio, on the assumption that the caller has provided for this correctly in the callbacks specified in the rmap_walk_control object. Though it provides for this, and adds debug asserts to ensure that, should a folio be specified, these are equal to the mapping and offset specified in the folio, there should be no functional change as a result of this patch. The reason for adding this is to enable for future changes to permit users to be able to traverse mappings of userland-mapped kernel memory, write-protecting those mappings to enable page_mkwrite() or pfn_mkwrite() fault handlers to be retriggered on subsequent dirty. Link: https://lkml.kernel.org/r/cover.1739029358.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/0d1acec0cba1e5a12f9b53efcabc397541c90517.1739029358.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Helge Deller <deller@gmx.de> Cc: Jaya Kumar <jayakumar.lkml@gmail.com> Cc: Kajtar Zsolt <soci@c64.rulez.org> Cc: Maíra Canal <mcanal@igalia.com> Cc: Matthew Wilcox <willy@infradead.org> [English fixes] Cc: Simona Vetter <simona.vetter@ffwll.ch> Cc: Thomas Zimemrmann <tzimmermann@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16selftests: mm: fix typoEric Salem
Fix misspelling. Link: https://lkml.kernel.org/r/77e0e915-36c3-4c95-84b8-0b73aaa17951@gmail.com Signed-off-by: Eric Salem <ericsalem@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: remove the access_ok() call from gup_fast_fallback()David Laight
Historiaclly the code relied on access_ok() to validate the address range. Commit 26f4c328079d7 added an explicit wrap check before access_ok(). Commit c28b1fc70390d then changed the wrap test to use check_add_overflow(). Commit 6014bc27561f2 relaxed the checks in x86-64's access_ok() and added an explicit check for TASK_SIZE here to make up for it. That left a pointless access_ok() call with its associated 'lfence' that can never actually fail. So just delete the test. Link: https://lkml.kernel.org/r/20250209174711.60889-1-david.laight.linux@gmail.com Signed-off-by: David Laight <david.laight.linux@gmail.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16maple_tree: correct comment for mas_start()I Hsin Cheng
There's no mas->status of "mas_start", what the function is checking is whether mas->status equals to "ma_start". Correct the comment for the function. Link: https://lkml.kernel.org/r/20250209181023.228856-1-richard120310@gmail.com Signed-off-by: I Hsin Cheng <richard120310@gmail.com> Reviewed-by: Liam R. Howlett <howlett@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16vmscan, cleanup: add for_each_managed_zone_pgdat macroBertrand Wlodarczyk
The macro is introduced to eliminate redundancy in the repeated iteration over managed zones in pgdat data structure, reducing the potential for errors. This change doesn't introduce any functional modifications. Due to concentration of the pattern in vmscan.c the macro is placed locally in that file. Link: https://lkml.kernel.org/r/20250210160818.686-1-bertrand.wlodarczyk@intel.com Signed-off-by: Bertrand Wlodarczyk <bertrand.wlodarczyk@intel.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dwaipayan Ray <dwaipayanray1@gmail.com> Cc: Joe Perches <joe@perches.com> Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/damon/core: do damos walking in entire regions granularitySeongJae Park
damos_walk_control can be installed while DAMOS is walking the regions. This means the walk callback function invocations can be started from a region at the middle of the regions list. This makes it hard to be used reliably. Particularly, DAMOS tried regions update for collecting monitoring results gets problematic results. Increase the walk_control_lock critical section to do walking in entire regions granularity. Link: https://lkml.kernel.org/r/20250210182737.134994-4-sj@kernel.org Fixes: bf0eaba0ff9c ("mm/damon/core: implement damos_walk()") Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/damon/core: do not call damos_walk_control->walk() if walk is completedSeongJae Park
damos_walk() invokes callback functions of schemes until all schemes finishes at least one round of walks. If there are multiple DAMOS schemes having different apply_interval, the callback functions for longer apply interval scheme will be called for more than a round of the walk. The behavior is different from the document (see damos_walk() kernel-doc comment), and not useful. Make the behavior be same to the documented one, by stopping invoking the callback if the walk for the given scheme is completed. Link: https://lkml.kernel.org/r/20250210182737.134994-3-sj@kernel.org Fixes: bf0eaba0ff9c ("mm/damon/core: implement damos_walk()") Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/damon/core: unset damos->walk_completed after confimed setSeongJae Park
Patch series "mm/damon/core: fix wrong and/or useless damos_walk() behaviors". damos_walk() can finish working earlier or later than expected, and start earlier than practical. First two behaviors are clearly wrong behavior (doesn't follow the documentation) and all three behaviors are only making the feature useless. Fix those. This patch (of 3): damos->walk_completed is only set, not unset. This can cause next damos_walk() finish earlier than expected. Unset it after all walk_completed is confirmed. Link: https://lkml.kernel.org/r/20250210182737.134994-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250210182737.134994-2-sj@kernel.org Fixes: bf0eaba0ff9c ("mm/damon/core: implement damos_walk()") Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/mm_init.c: use round_up() to align movable rangeWei Yang
Since MAX_ORDER_NR_PAGES is power of 2, let's use a faster version. Link: https://lkml.kernel.org/r/20250207100453.9989-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: use READ/WRITE_ONCE() for vma->vm_flags on migrate, mprotectLorenzo Stoakes
According to the syzbot report referenced here, it is possible to encounter a race between mprotect() writing to the vma->vm_flags field and migration checking whether the VMA is locked. There is no real problem with timing here per se, only that torn reads/writes may occur. Therefore, as a proximate fix, ensure both operations READ_ONCE() and WRITE_ONCE() to avoid this. This race is possible due to the ability to look up VMAs via the rmap, which migration does in this case, which takes no mmap or VMA lock and therefore does not preclude an operation to modify a VMA. When the final update of VMA flags is performed by mprotect, this will cause the rmap lock to be taken while the VMA is inserted on split/merge. However the means by which we perform splits/merges in the kernel is that we perform the split/merge operation on the VMA, acquiring/releasing locks as needed, and only then, after having done so, modifying fields. We should carefully examine and determine whether we can combine the two operations so as to avoid such races, and whether it might be possible to otherwise annotate these rmap field accesses. Link: https://lkml.kernel.org/r/20250207172442.78836-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: syzbot+c2e5712cbb14c95d4847@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67a34e60.050a0220.50516.0040.GAE@google.com/ Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/damon: avoid applying DAMOS action to same entity multiple timesSeongJae Park
'paddr' DAMON operations set can apply a DAMOS scheme's action to a large folio multiple times in single DAMOS-regions-walk if the folio is laid on multiple DAMON regions. Add a field for DAMOS scheme object that can be used by the underlying ops to know what was the last entity that the scheme's action has applied. The core layer unsets the field when each DAMOS-regions-walk is done for the given scheme. And update 'paddr' ops to use the infrastructure to avoid the problem. Link: https://lkml.kernel.org/r/20250207212033.45269-3-sj@kernel.org Fixes: 57223ac29584 ("mm/damon/paddr: support the pageout scheme") Signed-off-by: SeongJae Park <sj@kernel.org> Reported-by: Usama Arif <usamaarif642@gmail.com> Closes: https://lore.kernel.org/20250203225604.44742-3-usamaarif642@gmail.com Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/damon/ops: have damon_get_folio return folio even for tail pagesUsama Arif
Patch series "mm/damon/paddr: fix large folios access and schemes handling". DAMON operations set for physical address space, namely 'paddr', treats tail pages as unaccessed always. It can also apply DAMOS action to a large folio multiple times within single DAMOS' regions walking. As a result, the monitoring output has poor quality and DAMOS works in unexpected ways when large folios are being used. Fix those. The patches were parts of Usama's hugepage_size DAMOS filter patch series[1]. The first fix has collected from there with a slight commit message change for the subject prefix. The second fix is re-written by SJ and posted as an RFC before this series. The second one also got a slight commit message change for the subject prefix. [1] https://lore.kernel.org/20250203225604.44742-1-usamaarif642@gmail.com [2] https://lore.kernel.org/20250206231103.38298-1-sj@kernel.org This patch (of 2): This effectively adds support for large folios in damon for paddr, as damon_pa_mkold/young won't get a null folio from this function and won't ignore it, hence access will be checked and reported. This also means that larger folios will be considered for different DAMOS actions like pageout, prioritization and migration. As these DAMOS actions will consider larger folios, iterate through the region at folio_size and not PAGE_SIZE intervals. This should not have an affect on vaddr, as damon_young_pmd_entry considers pmd entries. Link: https://lkml.kernel.org/r/20250207212033.45269-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250207212033.45269-2-sj@kernel.org Fixes: a28397beb55b ("mm/damon: implement primitives for physical address space monitoring") Signed-off-by: Usama Arif <usamaarif642@gmail.com> Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16samples: kmemleak: print the raw pointers for debugging purposesCatalin Marinas
The kmemleak-test.c module is meant to leak some pointers for debugging the kmemleak detection, pointer information dumping. It's no use if it prints the hashed values of such pointers. Change the printk() format from %p to %px. While at it, also display the raw __percpu pointer rather than this_cpu_ptr() since kmemleak now tracks such pointers independently of the standard allocations. Link: https://lkml.kernel.org/r/20250206114537.2597764-3-catalin.marinas@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: kmemleak: add support for dumping physical and __percpu object infoCatalin Marinas
Patch series "mm: kmemleak: Usability improvements". Following a recent false positive tracking that led to commit 488b5b9eca68 ("mm: kmemleak: fix upper boundary check for physical address objects"), I needed kmemleak to give me more debug information about the objects it is tracking. This lead to the first patch of this series. The second patch changes the kmemleak-test module to show the raw pointers for debugging purposes. This patch (of 2): Currently, echo dump=... > /sys/kernel/debug/kmemleak only looks up the main virtual address object tree. However, for debugging, it's useful to dump information about physical address and __percpu objects. Search all three object trees for the dump= command and also print the type of the object if not virtual: "(phys)" or "(percpu)". In addition, allow search by alias (pointer within the object). Link: https://lkml.kernel.org/r/20250206114537.2597764-1-catalin.marinas@arm.com Link: https://lkml.kernel.org/r/20250206114537.2597764-2-catalin.marinas@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm, percpu: do not consider sleepable allocations atomicMichal Hocko
28307d938fb2 ("percpu: make pcpu_alloc() aware of current gfp context") has fixed a reclaim recursion for scoped GFP_NOFS context. It has done that by avoiding taking pcpu_alloc_mutex. This is a correct solution as the worker context with full GFP_KERNEL allocation/reclaim power and which is using the same lock cannot block the NOFS pcpu_alloc caller. On the other hand this is a very conservative approach that could lead to failures because pcpu_alloc lockless implementation is quite limited. We have a bug report about premature failures when scsi array of 193 devices is scanned. Sometimes (not consistently) the scanning aborts because the iscsid daemon fails to create the queue for a random scsi device during the scan. iscsid itslef is running with PR_SET_IO_FLUSHER set so all allocations from this process context are GFP_NOIO. This in turn makes any pcpu_alloc lockless (without pcpu_alloc_mutex) which leads to pre-mature failures. It has turned out that iscsid has worked around this by dropping PR_SET_IO_FLUSHER (https://github.com/open-iscsi/open-iscsi/pull/382) when scanning host. But we can do better in this case on the kernel side and use pcpu_alloc_mutex for NOIO resp. NOFS constrained allocation scopes too. We just need the WQ worker to never trigger IO/FS reclaim. Achieve that by enforcing scoped GFP_NOIO for the whole execution of pcpu_balance_workfn (this will imply NOFS constrain as well). This will remove the dependency chain and preserve the full allocation power of the pcpu_alloc call. While at it make is_atomic really test for blockable allocations. Link: https://lkml.kernel.org/r/20250206122633.167896-1-mhocko@kernel.org Fixes: 28307d938fb2 ("percpu: make pcpu_alloc() aware of current gfp context") Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dennis Zhou <dennis@kernel.org> Cc: Filipe David Manana <fdmanana@suse.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/swapfile.c: open code cluster_alloc_swap()Baoquan He
It's only called in scan_swap_map_slots(). And also remove the stale code comment in scan_swap_map_slots() because it's not fit for the current cluster allocation mechanism. Link: https://lkml.kernel.org/r/20250205092721.9395-13-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/swapfile.c: remove the incorrect code commentBaoquan He
Since commit eb085574a752 ("mm, swap: fix race between swapoff and some swap operations"), the non_swap_entry() checking has been taken off from function __swap_duplicate(). Hence, in the kernel-doc comment, the line 'swp_entry is migration entry -> EINVAL' is obsolete. Remove that line to avoid misleading people. Link: https://lkml.kernel.org/r/20250205092721.9395-12-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/swap: rename swap_swapcount() to swap_entry_swapped()Baoquan He
The new function name can reflect the real behaviour of the function more clearly and more accurately. And the renaming avoids the confusion between swap_swapcount() and swp_swapcount(). Link: https://lkml.kernel.org/r/20250205092721.9395-11-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/swapfile.c: remove the unneeded checkingBaoquan He
In free_swap_and_cache_nr(), invocation of get_swap_device() has done the checking if it's a swap entry. So remove the redundant checking here. Link: https://lkml.kernel.org/r/20250205092721.9395-10-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/swap_state.c: remove the meaningless code commentBaoquan He
Since commit 8d93b41c09d1 ("mm: Convert add_to_swap_cache to XArray"), there's no returned _EEXIT, so the code comment doesn't make sense any more. Link: https://lkml.kernel.org/r/20250205092721.9395-9-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/swapfile.c: optimize code in setup_clusters()Baoquan He
In the last 'for' loop inside setup_clusters(), using two local variable 'k' and 'j' are obvisouly redundant. Using 'j' is enough and simpler. And also move macro SWAP_CLUSTER_COLS close to its only user setup_clusters(). Link: https://lkml.kernel.org/r/20250205092721.9395-8-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/swapfile.c: update the code comment above swap_count_continued()Baoquan He
Now, swap_count_continued() has two callers, __swap_duplicate() and __swap_entry_free_locked(), the relevant code comment is stale. Update it to reflect the current situation. [bhe@redhat.com: v2] Link: https://lkml.kernel.org/r/Z6V0/UvG1fvkQ4t/@fedora Link: https://lkml.kernel.org/r/20250205092721.9395-7-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/swap: rename swap_is_has_cache() to swap_only_has_cache()Baoquan He
There are two predicates in the name of swap_is_has_cache() which is confusing. Renaming it to remove the confusion and can better reflect its functionality. Link: https://lkml.kernel.org/r/20250205092721.9395-6-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/swap: skip scanning cluster range if it's empty clusterBaoquan He
Since ci->lock has been taken when isolating cluster from si->free_clusters or taking si->percpu_cluster->next[order], it's unnecessary to scan and check the cluster range availability if i'ts empty cluster, and this can accelerate the huge page swapping. Link: https://lkml.kernel.org/r/20250205092721.9395-5-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/swap: remove SWAP_FLAG_PRIO_SHIFTBaoquan He
It doesn't make sense to have a zero value of shift. Remove it to avoid confusion. Link: https://lkml.kernel.org/r/20250205092721.9395-4-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/swap_state.c: optimize the code in clear_shadow_from_swap_cache()Baoquan He
Use ALIGN to achieve the same effect and simplify the code. Link: https://lkml.kernel.org/r/20250205092721.9395-3-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/swap_state.c: fix the obsolete code commentBaoquan He
Patch series "Tiny cleanup and improvements about SWAP code". These are all made during review and from reading the patchset "[PATCH v3 00/13] mm, swap: rework of swap allocator locks" from Kairui. This patch (of 12): Since commit 85a1333417a7 ("mm/swap: use dedicated entry for swap in folio"), there's a dedicated field in folio for swap entry. Let's update the code comment above add_to_swap_cache() accordingly. Link: https://lkml.kernel.org/r/20250205092721.9395-1-bhe@redhat.com Link: https://lkml.kernel.org/r/20250205092721.9395-2-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Kairui Song <kasong@tencent.com> Cc: Baoquan he <bhe@redhat.com> Cc: Chris Li <chrisl@kernel.org> (Google) Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/memfd: fix spelling and grammatical issuesLiu Ye
The comment "If a private mapping then writability is irrelevant" contains a typo. It should be "If a private mapping then writability is irrelevant". The comment "SEAL_EXEC implys SEAL_WRITE, making W^X from the start." contains a typo. It should be "SEAL_EXEC implies SEAL_WRITE, making W^X from the start." Link: https://lkml.kernel.org/r/20250206060958.98010-1-liuye@kylinos.cn Signed-off-by: Liu Ye <liuye@kylinos.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/madvise: remove redundant mmap_lock operations from process_madvise()SeongJae Park
Optimize redundant mmap lock operations from process_madvise() by directly doing the mmap locking first, and then the remaining works for all ranges in the loop. [akpm@linux-foundation.org: update comment, per Lorenzo] Link: https://lkml.kernel.org/r/20250206061517.2958-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Liam R. Howlett <howlett@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>