summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2025-09-13mm/zswap: store <PAGE_SIZE compression failed page as-isSeongJae Park
When zswap writeback is enabled and it fails compressing a given page, the page is swapped out to the backing swap device. This behavior breaks the zswap's writeback LRU order, and hence users can experience unexpected latency spikes. If the page is compressed without failure, but results in a size of PAGE_SIZE, the LRU order is kept, but the decompression overhead for loading the page back on the later access is unnecessary. Keep the LRU order and optimize unnecessary decompression overheads in those cases, by storing the original content as-is in zpool. The length field of zswap_entry will be set appropriately, as PAGE_SIZE. Hence whether it is saved as-is or not (whether decompression is unnecessary) is identified by 'zswap_entry->length == PAGE_SIZE'. Because the uncompressed data is saved in zpool, same to the compressed ones, this introduces no change in terms of memory management including movability and migratability of involved pages. This change is also not increasing per zswap entry metadata overhead. But as the number of incompressible pages increases, total zswap metadata overhead is proportionally increased. The overhead should not be problematic in usual cases, since the zswap metadata for single zswap entry is much smaller than PAGE_SIZE, and in common zswap use cases there should be a sufficient amount of compressible pages. Also it can be mitigated by the zswap writeback. When the writeback is disabled, the additional overhead could be problematic. For the case, keep the current behavior that just returns the failure and let swap_writeout() put the page back to the active LRU list in the case. Knowing how many incompressible pages are stored at the given moment will be useful for future investigations. Add a new debugfs file called stored_incompressible_pages for the purpose. Tests ----- I tested this patch using a simple self-written microbenchmark that is available at GitHub[1]. You can reproduce the test I did by executing run_tests.sh of the repo on your system. Note that the repo's documentation is not good as of this writing, so you may need to read and use the code. The basic test scenario is simple. Run a test program making artificial accesses to memory having artificial content under memory.high-set memory limit and measure how many accesses were made in a given time. The test program repeatedly and randomly access three anonymous memory regions. The regions are all 500 MiB size, and be accessed in the same probability. Two of those are filled up with a simple content that can easily be compressed, while the remaining one is filled up with a content that s read from /dev/urandom, which is easy to fail at compressing to a size smaller than PAGE_SIZE. The program runs for two minutes and prints out the number of accesses made every five seconds. The test script runs the program under below four configurations. - 0: memory.high is set to 2 GiB, zswap is disabled. - 1-1: memory.high is set to 1350 MiB, zswap is disabled. - 1-2: On 1-1, zswap is enabled without this patch. - 1-3: On 1-2, this patch is applied. For all zswap enabled cases, zswap shrinker is enabled. Configuration '0' is for showing the original memory performance. Configurations 1-1, 1-2 and 1-3 are for showing the performance of swap, zswap, and this patch under a level of memory pressure (~10% of working set). Configurations 0 and 1-1 are not the main focus of this patch, but I'm adding those since their results transparently show how far this microbenchmark test is from the real world. Because the per-5 seconds performance is not very reliable, I measured the average of that for the last one minute period of the test program run. I also measured a few vmstat counters including zswpin, zswpout, zswpwb, pswpin and pswpout during the test runs. The measurement results are as below. To save space, I show performance numbers that are normalized to that of the configuration '0' (no memory pressure). The averaged accesses per 5 seconds of configuration '0' was 36493417.75. config 0 1-1 1-2 1-3 perf_normalized 1.0000 0.0057 0.0235 0.0367 perf_stdev_ratio 0.0582 0.0652 0.0167 0.0346 zswpin 0 0 3548424 1999335 zswpout 0 0 3588817 2361689 zswpwb 0 0 10214 340270 pswpin 0 485806 772038 340967 pswpout 0 649543 144773 340270 'perf_normalized' is the performance metric, normalized to that of configuration '0' (no pressure). 'perf_stdev_ratio' is the standard deviation of the averaged data points, as a ratio to the averaged metric value. For example, configuration '0' performance was showing 5.8% stdev. Configurations 1-1 and 1-3 were having about 6.5% and 6.1% stdev. Also the results were highly variable between multiple runs. So this result is not very stable but just showing ball park figures. Please keep this in your mind when reading these results. Under about 10% of working set memory pressure, the performance was dropped to about 0.57% of no-pressure one, when the normal swap is used (1-1). Note that ~10% working set pressure is already extreme, at least on this test setup. No one would desire system setups that can degrade performance to 0.57% of the best case. By turning zswap on (1-2), the performance was improved about 4x, resulting in about 2.35% of no-pressure one. Because of the incompressible pages in the third memory region, a significant amount of (non-zswap) swap I/O operations were made, though. By applying this patch (1-3), about 56% performance improvement was made, resulting in about 3.67% of no-pressure one. Reduced pswpin of 1-3 compared to 1-2 let us see where this improvement came from. Tests without Zswap Shrinker ---------------------------- Zswap shrinker is not enabled by default, so I ran the above test after disabling zswap shrinker. The results are as below. config 0 1-1 1-2 1-3 perf_normalized 1.0000 0.0056 0.0185 0.0260 perf_stdev_ratio 0.0467 0.0348 0.1832 0.3387 zswpin 0 0 2506765 6049078 zswpout 0 0 2534357 6115426 zswpwb 0 0 0 0 pswpin 0 463694 472978 0 pswpout 0 686227 612149 0 The overall normalized performance of the different configs are very similar to those of zswap shrinker enabled case. By adding the memory pressure, the performance was dropped to 0.56% of the original one. By enabling zswap without zswap shrinker, the performance was increased to 1.85% of the original one. By applying this patch on it, the performance was further increased to 2.6% of the original one. Even though zswap shrinker is disabled, 1-2 shows high numbers of pswpin and pswpout because the incompressible pages are directly swapped out. In the case of 1-3, it shows zero pswpin and pswpout since it saves incompressible pages in the memory, and shows higher performance. Note that the performance of 1-2 and 1-3 varies pretty much. Standard deviation of the performance for 1-2 was about 18.32% of the performance, while that for 1-3 was about 33.87%. Because zswap shrinker is disabled and the memory pressure is induced by memory.high, the workload got penalty_jiffies sleeps, and this resulted in the unstabilized performance. Related Works ------------- This is not an entirely new attempt. Nhat Pham and Takero Funaki tried very similar approaches in October 2023[2] and April 2024[3], respectively. The two approaches didn't get merged mainly due to the metadata overhead concern. I described why I think that shouldn't be a problem for this change, which is automatically disabled when writeback is disabled, at the beginning of this changelog. This patch is not particularly different from those, and actually built upon those. I wrote this from scratch again, though. Hence adding Suggested-by tags for them. Actually Nhat first suggested this to me offlist. Historically, writeback disabling was introduced partially as a way to solve the LRU order issue. Yosry pointed out[4] this is still suboptimal when the incompressible pages are cold, since the incompressible pages will continuously be tried to be zswapped out, and burn CPU cycles for compression attempts that will anyway fail. One imaginable solution for the problem is reusing the swapped-out page and its struct page to store in the zswap pool. But that's out of the scope of this patch. [sj@kernel.org: mark zswap_stored_incompressible_pages as static] Link: https://lkml.kernel.org/r/20250821161750.78192-1-sj@kernel.org [sj@kernel.org: v5] Link: https://lkml.kernel.org/r/20250822190817.49287-1-sj@kernel.org [sj@kernel.org: cleanup incompressible pages handling code] Link: https://lkml.kernel.org/r/20250828163913.57957-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250819193404.46680-1-sj@kernel.org Link: https://github.com/sjp38/eval_zswap/blob/master/run.sh [1] Link: https://lore.kernel.org/20231017003519.1426574-3-nphamcs@gmail.com [2] Link: https://lore.kernel.org/20240706022523.1104080-6-flintglass@gmail.com [3] Link: https://lore.kernel.org/CAJD7tkZXS-UJVAFfvxJ0nNgTzWBiqepPYA4hEozi01_qktkitg@mail.gmail.com [4] Signed-off-by: SeongJae Park <sj@kernel.org> Suggested-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Takero Funaki <flintglass@gmail.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: SeongJae Park <sj@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Kairui Song <kasong@tencent.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm: remove redundant __GFP_NOWARNQianfeng Rong
Commit 16f5dfbc851b ("gfp: include __GFP_NOWARN in GFP_NOWAIT") made GFP_NOWAIT implicitly include __GFP_NOWARN. Therefore, explicit __GFP_NOWARN combined with GFP_NOWAIT (e.g., `GFP_NOWAIT | __GFP_NOWARN`) is now redundant. Let's clean up these redundant flags across subsystems. No functional changes. Link: https://lkml.kernel.org/r/20250812135225.274316-1-rongqianfeng@vivo.com Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm: convert core mm to mm_flags_*() accessorsLorenzo Stoakes
As part of the effort to move to mm->flags becoming a bitmap field, convert existing users to making use of the mm_flags_*() accessors which will, when the conversion is complete, be the only means of accessing mm_struct flags. This will result in the debug output being that of a bitmap output, which will result in a minor change here, but since this is for debug only, this should have no bearing. Otherwise, no functional changes intended. [akpm@linux-foundation.org: fix typo in comment]Link: https://lkml.kernel.org/r/1eb2266f4408798a55bda00cb04545a3203aa572.1755012943.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Barry Song <baohua@kernel.org> Cc: Ben Segall <bsegall@google.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dev Jain <dev.jain@arm.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Kees Cook <kees@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Mariano Pache <npache@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Namhyung kim <namhyung@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm: add persistent huge zero folioPankaj Raghav
Many places in the kernel need to zero out larger chunks, but the maximum segment that can be zeroed out at a time by ZERO_PAGE is limited by PAGE_SIZE. This is especially annoying in block devices and filesystems where multiple ZERO_PAGEs are attached to the bio in different bvecs. With multipage bvec support in block layer, it is much more efficient to send out larger zero pages as a part of single bvec. This concern was raised during the review of adding Large Block Size support to XFS[1][2]. Usually huge_zero_folio is allocated on demand, and it will be deallocated by the shrinker if there are no users of it left. At moment, huge_zero_folio infrastructure refcount is tied to the process lifetime that created it. This might not work for bio layer as the completions can be async and the process that created the huge_zero_folio might no longer be alive. And, one of the main points that came up during discussion is to have something bigger than zero page as a drop-in replacement. Add a config option PERSISTENT_HUGE_ZERO_FOLIO that will result in allocating the huge zero folio during early init and never free the memory by disabling the shrinker. This makes using the huge_zero_folio without having to pass any mm struct and does not tie the lifetime of the zero folio to anything, making it a drop-in replacement for ZERO_PAGE. If PERSISTENT_HUGE_ZERO_FOLIO config option is enabled, then mm_get_huge_zero_folio() will simply return the allocated page instead of dynamically allocating a new PMD page. Use this option carefully in resource constrained systems as it uses one full PMD sized page for zeroing purposes. [1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/ [2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@infradead.org/ Link: https://lkml.kernel.org/r/20250811084113.647267-4-kernel@pankajraghav.com Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Co-developed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Christoph Hellwig <hch@lst.de> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Mariano Pache <npache@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Kiryl Shutsemau <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm: rename MMF_HUGE_ZERO_PAGE to MMF_HUGE_ZERO_FOLIOPankaj Raghav
As all the helper functions has been renamed from *_page to *_folio, rename the MM flag from MMF_HUGE_ZERO_PAGE to MMF_HUGE_ZERO_FOLIO. No functional changes. Link: https://lkml.kernel.org/r/20250811084113.647267-3-kernel@pankajraghav.com Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Christoph Hellwig <hch@lst.de> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Mariano Pache <npache@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Kiryl Shutsemau <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm: rename huge_zero_page to huge_zero_folioPankaj Raghav
Patch series "add persistent huge zero folio support", v3. Many places in the kernel need to zero out larger chunks, but the maximum segment we can zero out at a time by ZERO_PAGE is limited by PAGE_SIZE. This concern was raised during the review of adding Large Block Size support to XFS[1][2]. This is especially annoying in block devices and filesystems where multiple ZERO_PAGEs are attached to the bio in different bvecs. With multipage bvec support in block layer, it is much more efficient to send out larger zero pages as a part of single bvec. Some examples of places in the kernel where this could be useful: - blkdev_issue_zero_pages() - iomap_dio_zero() - vmalloc.c:zero_iter() - rxperf_process_call() - fscrypt_zeroout_range_inline_crypt() - bch2_checksum_update() ... Usually huge_zero_folio is allocated on demand, and it will be deallocated by the shrinker if there are no users of it left. At the moment, huge_zero_folio infrastructure refcount is tied to the process lifetime that created it. This might not work for bio layer as the completions can be async and the process that created the huge_zero_folio might no longer be alive. And, one of the main point that came during discussion is to have something bigger than zero page as a drop-in replacement. Add a config option PERSISTENT_HUGE_ZERO_FOLIO that will always allocate the huge_zero_folio, and disable the shrinker so that huge_zero_folio is never freed. This makes using the huge_zero_folio without having to pass any mm struct and does not tie the lifetime of the zero folio to anything, making it a drop-in replacement for ZERO_PAGE. I have converted blkdev_issue_zero_pages() as an example as a part of this series. I also noticed close to 4% performance improvement just by replacing ZERO_PAGE with persistent huge_zero_folio. I will send patches to individual subsystems using the huge_zero_folio once this gets upstreamed. [1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/ [2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@infradead.org/ As the transition already happened from exposing huge_zero_page to huge_zero_folio, change the name of the shrinker and the other helper function to reflect that. No functional changes. Link: https://lkml.kernel.org/r/20250811084113.647267-1-kernel@pankajraghav.com Link: https://lkml.kernel.org/r/20250811084113.647267-2-kernel@pankajraghav.com Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Christoph Hellwig <hch@lst.de> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Mariano Pache <npache@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Kiryl Shutsemau <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page()David Hildenbrand
... and hide it behind a kconfig option. There is really no need for any !xen code to perform this check. The naming is a bit off: we want to find the "normal" page when a PTE was marked "special". So it's really not "finding a special" page. Improve the documentation, and add a comment in the code where XEN ends up performing the pte_mkspecial() through a hypercall. More details can be found in commit 923b2919e2c3 ("xen/gntdev: mark userspace PTEs as special on x86 PV guests"). Link: https://lkml.kernel.org/r/20250811112631.759341-12-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Juegren Gross <jgross@suse.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mariano Pache <npache@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm: introduce and use vm_normal_page_pud()David Hildenbrand
Let's introduce vm_normal_page_pud(), which ends up being fairly simple because of our new common helpers and there not being a PUD-sized zero folio. Use vm_normal_page_pud() in folio_walk_start() to resolve a TODO, structuring the code like the other (pmd/pte) cases. Defer introducing vm_normal_folio_pud() until really used. Note that we can so far get PUDs with hugetlb, daxfs and PFNMAP entries. Link: https://lkml.kernel.org/r/20250811112631.759341-11-david@redhat.com Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Juegren Gross <jgross@suse.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mariano Pache <npache@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm/memory: factor out common code from vm_normal_page_*()David Hildenbrand
Let's reduce the code duplication and factor out the non-pte/pmd related magic into __vm_normal_page(). To keep it simpler, check the pfn against both zero folios, which shouldn't really make a difference. It's a good question if we can even hit the !CONFIG_ARCH_HAS_PTE_SPECIAL scenario in the PMD case in practice: but doesn't really matter, as it's now all unified in vm_normal_page_pfn(). Add kerneldoc for all involved functions. Note that, as a side product, we now: * Support the find_special_page special thingy also for PMD * Don't check for is_huge_zero_pfn() anymore if we have CONFIG_ARCH_HAS_PTE_SPECIAL and the PMD is not special. The VM_WARN_ON_ONCE would catch any abuse No functional change intended. Link: https://lkml.kernel.org/r/20250811112631.759341-10-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Juegren Gross <jgross@suse.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mariano Pache <npache@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm/memory: convert print_bad_pte() to print_bad_page_map()David Hildenbrand
print_bad_pte() looks like something that should actually be a WARN or similar, but historically it apparently has proven to be useful to detect corruption of page tables even on production systems -- report the issue and keep the system running to make it easier to actually detect what is going wrong (e.g., multiple such messages might shed a light). As we want to unify vm_normal_page_*() handling for PTE/PMD/PUD, we'll have to take care of print_bad_pte() as well. Let's prepare for using print_bad_pte() also for non-PTEs by adjusting the implementation and renaming the function to print_bad_page_map(). Provide print_bad_pte() as a simple wrapper. Document the implicit locking requirements for the page table re-walk. To make the function a bit more readable, factor out the ratelimit check into is_bad_page_map_ratelimited() and place the printing of page table content into __print_bad_page_map_pgtable(). We'll now dump information from each level in a single line, and just stop the table walk once we hit something that is not a present page table. The report will now look something like (dumping pgd to pmd values): [ 77.943408] BUG: Bad page map in process XXX pte:80000001233f5867 [ 77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ... [ 77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067 Not using pgdp_get(), because that does not work properly on some arm configs where pgd_t is an array. Note that we are dumping all levels even when levels are folded for simplicity. [david@redhat.com: drop warning] Link: https://lkml.kernel.org/r/923b279c-de33-44dd-a923-2959afad8626@redhat.com Link: https://lkml.kernel.org/r/20250811112631.759341-9-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Juegren Gross <jgross@suse.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mariano Pache <npache@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm/rmap: convert "enum rmap_level" to "enum pgtable_level"David Hildenbrand
Let's factor it out, and convert all checks for unsupported levels to BUILD_BUG(). The code is written in a way such that force-inlining will optimize out the levels. [nathan@kernel.org: always inline __folio_rmap_sanity_checks()] Link: https://lkml.kernel.org/r/20250814-rmap-fix-build_bug-conversion-v1-1-fb7b10a0b362@kernel.org Link: https://lkml.kernel.org/r/20250811112631.759341-8-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Juegren Gross <jgross@suse.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mariano Pache <npache@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm/huge_memory: mark PMD mappings of the huge zero folio specialDavid Hildenbrand
The huge zero folio is refcounted (+mapcounted -- is that a word?) differently than "normal" folios, similarly (but different) to the ordinary shared zeropage. For this reason, we special-case these pages in vm_normal_page*/vm_normal_folio*, and only allow selected callers to still use them (e.g., GUP can still take a reference on them). vm_normal_page_pmd() already filters out the huge zero folio, to indicate it a special (return NULL). However, so far we are not making use of pmd_special() on architectures that support it (CONFIG_ARCH_HAS_PTE_SPECIAL), like we would with the ordinary shared zeropage. Let's mark PMD mappings of the huge zero folio similarly as special, so we can avoid the manual check for the huge zero folio with CONFIG_ARCH_HAS_PTE_SPECIAL next, and only perform the check on !CONFIG_ARCH_HAS_PTE_SPECIAL. In copy_huge_pmd(), where we have a manual pmd_special() check to handle PFNMAP, we have to manually rule out the huge zero folio. That code needs a serious cleanup, but that's something for another day. While at it, update the doc regarding the shared zero folios. No functional change intended: vm_normal_page_pmd() still returns NULL when it encounters the huge zero folio. Link: https://lkml.kernel.org/r/20250811112631.759341-6-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Juegren Gross <jgross@suse.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mariano Pache <npache@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd()David Hildenbrand
Just like we do for vmf_insert_page_mkwrite() -> ... -> insert_page_into_pte_locked() with the shared zeropage, support the huge zero folio in vmf_insert_folio_pmd(). When (un)mapping the huge zero folio in page tables, we neither adjust the refcount nor the mapcount, just like for the shared zeropage. For now, the huge zero folio is not marked as special yet, although vm_normal_page_pmd() really wants to treat it as special. We'll change that next. Link: https://lkml.kernel.org/r/20250811112631.759341-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Juegren Gross <jgross@suse.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mariano Pache <npache@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm/huge_memory: move more common code into insert_pud()David Hildenbrand
Let's clean it all further up. No functional change intended. Link: https://lkml.kernel.org/r/20250811112631.759341-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Juegren Gross <jgross@suse.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mariano Pache <npache@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm/huge_memory: move more common code into insert_pmd()David Hildenbrand
Patch series "mm: vm_normal_page*() improvements", v3. Cleanup and unify vm_normal_page_*() handling, also marking the huge zerofolio as special in the PMD. Add+use vm_normal_page_pud() and cleanup that XEN vm_ops->find_special_page thingy. There are plans of using vm_normal_page_*() more widely soon. This patch (of 11): Let's clean it all further up. No functional change intended. Link: https://lkml.kernel.org/r/20250811112631.759341-1-david@redhat.com Link: https://lkml.kernel.org/r/20250811112631.759341-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Juegren Gross <jgross@suse.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mariano Pache <npache@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13treewide: remove MIGRATEPAGE_SUCCESSDavid Hildenbrand
At this point MIGRATEPAGE_SUCCESS is misnamed for all folio users, and now that we remove MIGRATEPAGE_UNMAP, it's really the only "success" return value that the code uses and expects. Let's just get rid of MIGRATEPAGE_SUCCESS completely and just use "0" for success. Link: https://lkml.kernel.org/r/20250811143949.1117439-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> [mm] Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com> [jfs] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Byungchul Park <byungchul@sk.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Lance Yang <lance.yang@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm/migrate: remove MIGRATEPAGE_UNMAPDavid Hildenbrand
migrate_folio_unmap() is the only user of MIGRATEPAGE_UNMAP. We want to remove MIGRATEPAGE_* completely. It's rather weird to have a generic MIGRATEPAGE_UNMAP, documented to be returned from address-space callbacks, when it's only used for an internal helper. Let's start by having only a single "success" return value for migrate_folio_unmap() -- 0 -- by moving the "folio was already freed" check into the single caller. There is a remaining comment for PG_isolated, which we renamed to PG_movable_ops_isolated recently and forgot to update. While we might still run into that case with zsmalloc, it's something we want to get rid of soon. So let's just focus that optimization on real folios only for now by excluding movable_ops pages. Note that concurrent freeing can happen at any time and this "already freed" check is not relevant for correctness. [david@redhat.com: no need to pass "reason" to migrate_folio_unmap(), per Lance] Link: https://lkml.kernel.org/r/3bb725f8-28d7-4aa2-b75f-af40d5cab280@redhat.com Link: https://lkml.kernel.org/r/20250811143949.1117439-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Byungchul Park <byungchul@sk.com> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm/mincore: use a helper for checking the swap cacheKairui Song
Introduce a mincore_swap helper for checking swap entries. Move all swap related logic and sanity debug check into it, and separate them from page cache checking. The performance is better after this commit. mincore_page is never called on a swap cache space now, so the logic can be simpler. The sanity check also covers more potential cases now, previously the WARN_ON only catches potentially corrupted page table, now if shmem contains a swap entry with !CONFIG_SWAP, a WARN will be triggered. This changes the mincore value when the WARN is triggered, but this shouldn't matter. The WARN_ON means the data is already corrupted or something is very wrong, so it really should not happen. Before this series: mincore on a swaped out 16G anon mmap range: Took 488220 us mincore on 16G shmem mmap range: Took 530272 us. After this commit: mincore on a swaped out 16G anon mmap range: Took 446763 us mincore on 16G shmem mmap range: Took 460496 us. About ~10% faster. Link: https://lkml.kernel.org/r/20250811172018.48901-3-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm/mincore, swap: consolidate swap cache checking for mincoreKairui Song
Patch series "mm/mincore: minor clean up for swap cache checking". This series cleans up a swap cache helper only used by mincore, move it back into mincore code. Also separate the swap cache related logics out of shmem / page cache logics in mincore. With this series we have less lines of code and better performance. Before this series: mincore on a swaped out 16G anon mmap range: Took 488220 us mincore on 16G shmem mmap range: Took 530272 us. After this series: mincore on a swaped out 16G anon mmap range: Took 446763 us mincore on 16G shmem mmap range: Took 460496 us. About ~10% faster. This patch (of 2): The filemap_get_incore_folio (previously find_get_incore_page) helper was introduced by commit 61ef18655704 ("mm: factor find_get_incore_page out of mincore_page") to be used by later commit f5df8635c5a3 ("mm: use find_get_incore_page in memcontrol"), so memory cgroup charge move code can be simplified. But commit 6b611388b626 ("memcg-v1: remove charge move code") removed that user completely, it's only used by mincore now. So this commit basically reverts commit 61ef18655704 ("mm: factor find_get_incore_page out of mincore_page"). Move it back to mincore side to simplify the code. Link: https://lkml.kernel.org/r/20250811172018.48901-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20250811172018.48901-2-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm/kasan/init.c: remove unnecessary pointer variablesXichao Zhao
Simplify the code to enhance readability and maintain a consistent coding style. Link: https://lkml.kernel.org/r/20250811034257.154862-1-zhao.xichao@vivo.com Signed-off-by: Xichao Zhao <zhao.xichao@vivo.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm/damon/vaddr: support stat-purpose DAMOS filtersYueyang Pan
This patch extends DAMOS_STAT handling of the DAMON operations sets for virtual address spaces for ops-level DAMOS filters. It leverages the walk_page_range to walk the page table and gets the folio from page table. The last folio scanned is stored in damos->last_applied to prevent double counting. Link: https://lkml.kernel.org/r/264a4b5ea202fd73c01b349c9694d8bf9978c441.1754135312.git.pyyjason@gmail.com Signed-off-by: Yueyang Pan <pyyjason@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Usama Arif <usamaarif642@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm/damon/paddr: move filters existence check function to ops-commonYueyang Pan
Patch series "mm/damon/vaddr: support stat-purpose DAMOS filters", v4. Extend DAMOS_STAT handling of the DAMON operations sets for virtual address spaces for ops-level DAMOS filters. Functionality Test ================== I wrote a small test program which allocates 10GB of DRAM, use madvise(MADV_HUGEPAGE) to convert the base pages to 2MB huge pages Then my program does the following things in order: 1. Write sequentially to the whole 10GB region 2. Read the first 5GB region sequentially for 10 times 3. Sleep 5s 4. Read the second 5GB region sequentially for 10 times With a proper damon setting, we are expected to see df-passed to be 10GB and hot region move around with the read $ # Start DAMON $ sudo ./damo/damo start "./my_test/test" --monitoring_intervals 100ms\ 1s 2s $ # Show DAMON-generated access pattern snapshot $ sudo ./damo/damo report access --snapshot_damos_filter allow \ hugepage_size 2MiB 2MiB heatmap: # min/max temperatures: -600,000,000, 100,001,000, column size: 137.352 MiB intervals: sample 100 ms aggr 1 s (max access hz 10) # damos filters (df): reject none hugepage_size [2.000 MiB, 2.000 MiB] df-pass: # min/max temperatures: -400,000,000, 100,001,000, column size: 128.031 MiB 0 addr 85.373 TiB size 745.555 MiB access 0 hz age 6 s df-passed 0 B 1 addr 127.608 TiB size 877.664 MiB access 3.000 hz age 0 ns df-passed 878.000 MiB 2 addr 127.609 TiB size 219.418 MiB access 2.000 hz age 0 ns df-passed 220.000 MiB 3 addr 127.609 TiB size 316.613 MiB access 1.000 hz age 1 s df-passed 316.000 MiB 4 addr 127.609 TiB size 474.922 MiB access 1.000 hz age 1 s df-passed 476.000 MiB 5 addr 127.610 TiB size 407.188 MiB access 1.000 hz age 0 ns df-passed 406.000 MiB 6 addr 127.610 TiB size 610.781 MiB access 1.000 hz age 0 ns df-passed 612.000 MiB 7 addr 127.611 TiB size 697.309 MiB access 0 hz age 0 ns df-passed 696.000 MiB 8 addr 127.611 TiB size 77.480 MiB access 1.000 hz age 0 ns df-passed 78.000 MiB 9 addr 127.611 TiB size 573.102 MiB access 1.000 hz age 0 ns df-passed 574.000 MiB 10 addr 127.612 TiB size 245.617 MiB access 2.000 hz age 0 ns df-passed 246.000 MiB 11 addr 127.612 TiB size 295.102 MiB access 1.000 hz age 1 s df-passed 294.000 MiB 12 addr 127.612 TiB size 295.105 MiB access 1.000 hz age 1 s df-passed 296.000 MiB 13 addr 127.613 TiB size 67.172 MiB access 1.000 hz age 1 s df-passed 66.000 MiB 14 addr 127.613 TiB size 604.570 MiB access 0 hz age 1 s df-passed 606.000 MiB 15 addr 127.613 TiB size 389.578 MiB access 0 hz age 4 s df-passed 388.000 MiB 16 addr 127.614 TiB size 259.719 MiB access 0 hz age 4 s df-passed 260.000 MiB 17 addr 127.614 TiB size 817.941 MiB access 0 hz age 4 s df-passed 818.000 MiB 18 addr 127.615 TiB size 204.488 MiB access 0 hz age 4 s df-passed 204.000 MiB 19 addr 127.615 TiB size 730.902 MiB access 0 hz age 4 s df-passed 732.000 MiB 20 addr 127.616 TiB size 182.727 MiB access 0 hz age 4 s df-passed 182.000 MiB 21 addr 127.616 TiB size 926.824 MiB access 0 hz age 2 s df-passed 928.000 MiB 22 addr 127.617 TiB size 102.984 MiB access 0 hz age 2 s df-passed 102.000 MiB 23 addr 127.617 TiB size 86.527 MiB access 0 hz age 2 s df-passed 86.000 MiB 24 addr 127.617 TiB size 778.777 MiB access 0 hz age 2 s df-passed 776.000 MiB 25 addr 127.999 TiB size 132.000 KiB access 0 hz age 6 s df-passed 0 B memory bw estimate: 6.524 GiB per second df-passed: 6.527 GiB per second total size: 10.731 GiB df-passed 10.000 GiB record DAMON intervals: sample 100 ms, aggr 1 s $ # Show DAMON-generated access pattern snapshot again $ sudo ./damo/damo report access --snapshot_damos_filter allow \ hugepage_size 2MiB 2MiB heatmap: # min/max temperatures: -1,100,000,000, 2,000, column size: 137.352 MiB intervals: sample 100 ms aggr 1 s (max access hz 10) # damos filters (df): reject none hugepage_size [2.000 MiB, 2.000 MiB] df-pass: # min/max temperatures: -900,000,000, 2,000, column size: 128.031 MiB 0 addr 85.373 TiB size 745.555 MiB access 0 hz age 11 s df-passed 0 B 1 addr 127.608 TiB size 579.715 MiB access 2.000 hz age 0 ns df-passed 580.000 MiB 2 addr 127.608 TiB size 144.930 MiB access 2.000 hz age 0 ns df-passed 146.000 MiB 3 addr 127.608 TiB size 452.453 MiB access 2.000 hz age 0 ns df-passed 452.000 MiB 4 addr 127.609 TiB size 113.117 MiB access 1.000 hz age 0 ns df-passed 114.000 MiB 5 addr 127.609 TiB size 182.367 MiB access 2.000 hz age 0 ns df-passed 182.000 MiB 6 addr 127.609 TiB size 182.371 MiB access 2.000 hz age 0 ns df-passed 182.000 MiB 7 addr 127.609 TiB size 350.488 MiB access 1.000 hz age 0 ns df-passed 350.000 MiB 8 addr 127.610 TiB size 525.738 MiB access 1.000 hz age 0 ns df-passed 526.000 MiB 9 addr 127.610 TiB size 401.352 MiB access 1.000 hz age 0 ns df-passed 402.000 MiB 10 addr 127.611 TiB size 100.340 MiB access 1.000 hz age 0 ns df-passed 100.000 MiB 11 addr 127.611 TiB size 19.523 MiB access 0 hz age 0 ns df-passed 20.000 MiB 12 addr 127.611 TiB size 175.727 MiB access 0 hz age 0 ns df-passed 176.000 MiB 13 addr 127.611 TiB size 106.629 MiB access 0 hz age 0 ns df-passed 106.000 MiB 14 addr 127.611 TiB size 959.676 MiB access 0 hz age 0 ns df-passed 960.000 MiB 15 addr 127.612 TiB size 424.469 MiB access 1.000 hz age 0 ns df-passed 424.000 MiB 16 addr 127.612 TiB size 424.469 MiB access 1.000 hz age 0 ns df-passed 424.000 MiB 17 addr 127.613 TiB size 201.648 MiB access 0 hz age 6 s df-passed 202.000 MiB 18 addr 127.613 TiB size 806.609 MiB access 0 hz age 6 s df-passed 806.000 MiB 19 addr 127.614 TiB size 862.125 MiB access 0 hz age 9 s df-passed 862.000 MiB 20 addr 127.614 TiB size 215.535 MiB access 0 hz age 9 s df-passed 216.000 MiB 21 addr 127.615 TiB size 104.500 MiB access 0 hz age 9 s df-passed 104.000 MiB 22 addr 127.615 TiB size 940.523 MiB access 0 hz age 9 s df-passed 942.000 MiB 23 addr 127.616 TiB size 640.281 MiB access 0 hz age 7 s df-passed 640.000 MiB 24 addr 127.616 TiB size 426.855 MiB access 0 hz age 7 s df-passed 426.000 MiB 25 addr 127.617 TiB size 90.105 MiB access 0 hz age 7 s df-passed 90.000 MiB 26 addr 127.617 TiB size 810.965 MiB access 0 hz age 7 s df-passed 808.000 MiB 27 addr 127.999 TiB size 132.000 KiB access 0 hz age 11 s df-passed 0 B memory bw estimate: 5.297 GiB per second df-passed: 5.297 GiB per second total size: 10.731 GiB df-passed 10.000 GiB record DAMON intervals: sample 100 ms, aggr 1 s As you can see the total df-passed region is 10GiB and the hot region moves as the seq read keeps going This patch (of 2): This patch moves damon_pa_scheme_has_filter to ops-common. renaming to damos_ops_has_filter. Doing so allows us to reuse its logic in the vaddr version of DAMOS_STAT. Link: https://lkml.kernel.org/r/cover.1754135312.git.pyyjason@gmail.com Link: https://lkml.kernel.org/r/cbe01740f7ac5ac7c9fd1ca367d297c3d7f2a69d.1754135312.git.pyyjason@gmail.com Signed-off-by: Yueyang Pan <pyyjason@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Usama Arif <usamaarif642@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm/damon/core: skip needless update of damon_attrs in damon_commit_ctx()Bijan Tabatabai
Currently, damon_commit_ctx() always calls damon_set_attrs() even if the attributes have not been changed. This can be problematic when the DAMON state is committed relatively frequently because damon_set_attrs() resets ctx->next_{aggregation,ops_update}_sis, causing aggregation and ops update operations to be needlessly delayed. This patch avoids this by only calling damon_set_attrs() in damon_commit_ctx when the attributes have been changed. [akpm@linux-foundation.org: Link: https://lkml.kernel.org/r/20250807001924.76275-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250806234254.10572-1-bijan311@gmail.com Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Bijan Tabatabai <bijan311@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm/rmap: do __folio_mod_stat() in __folio_add_rmap()Wei Yang
It is required to modify folio statistic after rmap changes, so it looks reasonable to do it in __folio_add_rmap(), which is the current behavior of __folio_remove_rmap() and folio_add_new_anon_rmap(). Call __folio_mod_stat() in __folio_add_rmap(), so that rmap adjustment family shares the same pattern. Link: https://lkml.kernel.org/r/20250804064106.21269-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Rik van Riel <riel@surriel.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm/nommu: convert kobjsize() to foliosSidhartha Kumar
Simple folio conversion to remove a user of PageSlab() and PageCompound(). Link: https://lkml.kernel.org/r/20250804145117.3857308-1-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm/slub: allow to set node and align in k[v]reallocVitaly Wool
Reimplement k[v]realloc_node() to be able to set node and alignment should a user need to do so. In order to do that while retaining the maximal backward compatibility, add k[v]realloc_node_align() functions and redefine the rest of API using these new ones. While doing that, we also keep the number of _noprof variants to a minimum, which implies some changes to the existing users of older _noprof functions, that basically being bcachefs. With that change we also provide the ability for the Rust part of the kernel to set node and alignment in its K[v]xxx [re]allocations. Link: https://lkml.kernel.org/r/20250806124147.1724658-1-vitaly.wool@konsulko.se Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.se> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Jann Horn <jannh@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm/vmalloc: allow to set node and align in vreallocVitaly Wool
Patch series "support large align and nid in Rust allocators", v15. The series provides the ability for Rust allocators to set NUMA node and large alignment. This patch (of 4): Reimplement vrealloc() to be able to set node and alignment should a user need to do so. Rename the function to vrealloc_node_align() to better match what it actually does now and introduce macros for vrealloc() and friends for backward compatibility. With that change we also provide the ability for the Rust part of the kernel to set node and alignment in its allocations. Link: https://lkml.kernel.org/r/20250806124034.1724515-1-vitaly.wool@konsulko.se Link: https://lkml.kernel.org/r/20250806124108.1724561-1-vitaly.wool@konsulko.se Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.se> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Jann Horn <jannh@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm, swap: prefer nonfull over free clustersKairui Song
We prefer a free cluster over a nonfull cluster whenever a CPU local cluster is drained to respect the SSD discard behavior [1]. It's not a best practice for non-discarding devices. And this is causing a higher fragmentation rate. So for a non-discarding device, prefer nonfull over free clusters. This reduces the fragmentation issue by a lot. Testing with make -j96, defconfig, using 64k mTHP, 8G ZRAM: Before: sys time: 6176.34s 64kB/swpout: 1659757 64kB/swpout_fallback: 139503 After: sys time: 6194.11s 64kB/swpout: 1689470 64kB/swpout_fallback: 56147 Testing with make -j96, defconfig, using 64k mTHP, 10G ZRAM: After: sys time: 5531.49s 64kB/swpout: 1791142 64kB/swpout_fallback: 17676 After: sys time: 5587.53s 64kB/swpout: 1811598 64kB/swpout_fallback: 0 Performance is basically unchanged, and the large allocation failure rate is lower. Enabling all mTHP sizes showed a more significant result. Using the same test setup with 10G ZRAM and enabling all mTHP sizes: 128kB swap failure rate: Before: swpout:451599 swpout_fallback:54525 After: swpout:502710 swpout_fallback:870 256kB swap failure rate: Before: swpout:63652 swpout_fallback:2708 After: swpout:65913 swpout_fallback:20 512kB swap failure rate: Before: swpout:11663 swpout_fallback:1767 After: swpout:14480 swpout_fallback:6 2M swap failure rate: Before: swpout:24 swpout_fallback:1442 After: swpout:1329 swpout_fallback:7 The success rate of large allocations is much higher. Link: https://lore.kernel.org/linux-mm/87v8242vng.fsf@yhuang6-desk2.ccr.corp.intel.com/ [1] Link: https://lkml.kernel.org/r/20250806161748.76651-4-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm, swap: remove fragment clusters counterKairui Song
It was used for calculating the iteration number when the swap allocator wants to scan the whole fragment list. Now the allocator only scans one fragment cluster at a time, so no one uses this counter anymore. Remove it as a cleanup; the performance change is marginal: Build linux kernel using 10G ZRAM, make -j96, defconfig with 2G cgroup memory limit, on top of tmpfs, 64kB mTHP enabled: Before: sys time: 6278.45s After: sys time: 6176.34s Change to 8G ZRAM: Before: sys time: 5572.85s After: sys time: 5531.49s Link: https://lkml.kernel.org/r/20250806161748.76651-3-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm, swap: only scan one cluster in fragment listKairui Song
Patch series "mm, swap: improve cluster scan strategy", v2. This series improves the large allocation performance and reduces the failure rate. Some design of the cluster alloactor was later found to be improvable after thorough testing. The allocator spent too much effort scanning the fragment list, which is not helpful in most setups, but causes serious contention of the list lock (si->lock). Besides, the allocator prefers free clusters when searching for a new cluster due to historical reasons, which causes fragmentation issues. So make the allocator only scan one cluster for high order allocation, and prefer nonfull cluster. This both improves the performance and reduces fragmentation. For example, build kernel test with make -j96 and 10G ZRAM with 64kB mTHP enabled shows better performance and a lower failure rate: Before: sys time: 11609.69s 64kB/swpout: 1787051 64kB/swpout_fallback: 20917 After: sys time: 5587.53s 64kB/swpout: 1811598 64kB/swpout_fallback: 0 System time is cut in half, and the failure rate drops to zero. Larger allocations in a hybrid workload also showed a major improvement: 512kB swap failure rate: Before: swpout:11663 swpout_fallback:1767 After: swpout:14480 swpout_fallback:6 2M swap failure rate: Before: swpout:24 swpout_fallback:1442 After: swpout:1329 swpout_fallback:7 This patch (of 3): Fragment clusters were mostly failing high order allocation already. The reason we scan it through now is that a swap slot may get freed without releasing the swap cache, so a swap map entry will end up in HAS_CACHE only status, and the cluster won't be moved back to non-full or free cluster list. This may cause a higher allocation failure rate. Usually only !SWP_SYNCHRONOUS_IO devices may have a large number of slots stuck in HAS_CACHE only status. Because when a !SWP_SYNCHRONOUS_IO device's usage is low (!vm_swap_full()), it will try to lazy free the swap cache. But this fragment list scan out is a bit overkill. Fragmentation is only an issue for the allocator when the device is getting full, and by that time, swap will be releasing the swap cache aggressively already. Only scanning one fragment cluster at a time is good enough to reclaim already pinned slots, and move the cluster back to nonfull. And besides, only high order allocation requires iterating over the list, order 0 allocation will succeed on the first attempt. And high order allocation failure isn't a serious problem. So the iteration of fragment clusters is trivial, but it will slow down large allocation by a lot when the fragment cluster list is long. So it's better to drop this fragment cluster iteration design. Test on a 48c96t system, build linux kernel using 10G ZRAM, make -j48, defconfig with 768M cgroup memory limit, on top of tmpfs, 4K folio only: Before: sys time: 4432.56s After: sys time: 4430.18s Change to make -j96, 2G memory limit, 64kB mTHP enabled, and 10G ZRAM: Before: sys time: 11609.69s 64kB/swpout: 1787051 64kB/swpout_fallback: 20917 After: sys time: 5572.85s 64kB/swpout: 1797612 64kB/swpout_fallback: 19254 Change to 8G ZRAM: Before: sys time: 21524.35s 64kB/swpout: 1687142 64kB/swpout_fallback: 128496 After: sys time: 6278.45s 64kB/swpout: 1679127 64kB/swpout_fallback: 130942 Change to use 10G brd device with SWP_SYNCHRONOUS_IO flag removed: Before: sys time: 7393.50s 64kB/swpout:1788246 swpout_fallback: 0 After: sys time: 7399.88s 64kB/swpout:1784257 swpout_fallback: 0 Change to use 8G brd device with SWP_SYNCHRONOUS_IO flag removed: Before: sys time: 26292.26s 64kB/swpout:1645236 swpout_fallback: 138945 After: sys time: 9463.16s 64kB/swpout:1581376 swpout_fallback: 259979 The performance is a lot better for large folios, and the large order allocation failure rate is only very slightly higher or unchanged even for !SWP_SYNCHRONOUS_IO devices high pressure. Link: https://lkml.kernel.org/r/20250806161748.76651-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20250806161748.76651-2-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm: change vma_start_read() to drop RCU lock on failureSuren Baghdasaryan
vma_start_read() can drop and reacquire RCU lock in certain failure cases. It's not apparent that the RCU session started by the caller of this function might be interrupted when vma_start_read() fails to lock the vma. This might become a source of subtle bugs and to prevent that we change the locking rules for vma_start_read() to drop RCU read lock upon failure. This way it's more obvious that RCU-protected objects are unsafe after vma locking fails. Link: https://lkml.kernel.org/r/20250804233349.1278678-2-surenb@google.com Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm: limit the scope of vma_start_read()Suren Baghdasaryan
Limit the scope of vma_start_read() as it is used only as a helper for higher-level locking functions implemented inside mmap_lock.c and we are about to introduce more complex RCU rules for this function. The change is pure code refactoring and has no functional changes. Link: https://lkml.kernel.org/r/20250804233349.1278678-1-surenb@google.com Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm: replace (20 - PAGE_SHIFT) with common macros for pages<->MB conversionYe Liu
Replace repeated (20 - PAGE_SHIFT) calculations with standard macros: - MB_TO_PAGES(mb) converts MB to page count - PAGES_TO_MB(pages) converts pages to MB No functional change. [akpm@linux-foundation.org: remove arc's private PAGES_TO_MB, remove its unused PAGES_TO_KB] [akpm@linux-foundation.org: don't include mm.h due to include file ordering mess] Link: https://lkml.kernel.org/r/20250718024134.1304745-1-ye.liu@linux.dev Signed-off-by: Ye Liu <liuye@kylinos.cn> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Ben Segall <bsegall@google.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lai jiangshan <jiangshanlai@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Neeraj Upadhyay <neeraj.upadhyay@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm: memory-tiering: fix PGPROMOTE_CANDIDATE countingRuan Shiyang
Goto-san reported confusing pgpromote statistics where the pgpromote_success count significantly exceeded pgpromote_candidate. On a system with three nodes (nodes 0-1: DRAM 4GB, node 2: NVDIMM 4GB): # Enable demotion only echo 1 > /sys/kernel/mm/numa/demotion_enabled numactl -m 0-1 memhog -r200 3500M >/dev/null & pid=$! sleep 2 numactl memhog -r100 2500M >/dev/null & sleep 10 kill -9 $pid # terminate the 1st memhog # Enable promotion echo 2 > /proc/sys/kernel/numa_balancing After a few seconds, we observeed `pgpromote_candidate < pgpromote_success` $ grep -e pgpromote /proc/vmstat pgpromote_success 2579 pgpromote_candidate 0 In this scenario, after terminating the first memhog, the conditions for pgdat_free_space_enough() are quickly met, and triggers promotion. However, these migrated pages are only counted for in PGPROMOTE_SUCCESS, not in PGPROMOTE_CANDIDATE. To solve these confusing statistics, introduce PGPROMOTE_CANDIDATE_NRL to count the missed promotion pages. And also, not counting these pages into PGPROMOTE_CANDIDATE is to avoid changing the existing algorithm or performance of the promotion rate limit. Link: https://lkml.kernel.org/r/20250901090122.124262-1-ruansy.fnst@fujitsu.com Link: https://lkml.kernel.org/r/20250729035101.1601407-1-ruansy.fnst@fujitsu.com Fixes: c6833e10008f ("memory tiering: rate limit NUMA migration throughput") Co-developed-by: Li Zhijian <lizhijian@fujitsu.com> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Signed-off-by: Ruan Shiyang <ruansy.fnst@fujitsu.com> Reported-by: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com> Suggested-by: Huang Ying <ying.huang@linux.alibaba.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm/mglru: update MG-LRU proactive reclaim statistics only to memcgHao Jia
Users can use /sys/kernel/debug/lru_gen to trigger proactive memory reclaim of a specified memcg. Currently, statistics such as pgrefill, pgscan and pgsteal will be updated to the /proc/vmstat system memory statistics. This will confuse some system memory pressure monitoring tools, making it difficult to determine whether pgscan and pgsteal are caused by system-level pressure or by proactive memory reclaim of some specific memory cgroup. Therefore, make this interface behave similarly to memory.reclaim. Update proactive memory reclaim statistics only to its memory cgroup. Link: https://lkml.kernel.org/r/20250717082845.34673-1-jiahao.kernel@gmail.com Signed-off-by: Hao Jia <jiahao1@lixiang.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kinsey Ho <kinseyho@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13kasan: add test for SLAB_TYPESAFE_BY_RCU quarantine skippingJann Horn
Verify that KASAN does not quarantine objects in SLAB_TYPESAFE_BY_RCU slabs if CONFIG_SLUB_RCU_DEBUG is off. [jannh@google.com: v2] Link: https://lkml.kernel.org/r/20250729-kasan-tsbrcu-noquarantine-test-v2-1-d16bd99309c9@google.com [jannh@google.com: make comment more verbose] Link: https://lkml.kernel.org/r/20250814-kasan-tsbrcu-noquarantine-test-v3-1-9e9110009b4e@google.com Link: https://lkml.kernel.org/r/20250728-kasan-tsbrcu-noquarantine-test-v1-1-fa24d9ab7f41@google.com Signed-off-by: Jann Horn <jannh@google.com> Suggested-by: Andrey Konovalov <andreyknvl@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm/damon/sysfs: use dynamically allocated repeat mode damon_call_controlSeongJae Park
DAMON sysfs interface is using a single global repeat mode damon_call_control variable for refresh_ms handling, for all DAMON contexts. As a result, when there are more than one context, the single global damon_call_control is unexpectedly over-written (corrupted). Particularly the ->link field is overwritten by the multiple contexts and this can cause a user hangup, and/or a kernel crash. Fix it by using dynamically allocated damon_call_control object per DAMON context. Link: https://lkml.kernel.org/r/20250908201513.60802-3-sj@kernel.org Link: https://lore.kernel.org/20250904011738.930-1-yunjeong.mun@sk.com [1] Link: https://lore.kernel.org/20250905035411.39501-1-sj@kernel.org [2] Fixes: d809a7c64ba8 ("mm/damon/sysfs: implement refresh_ms file internal work") Signed-off-by: SeongJae Park <sj@kernel.org> Reported-by: Yunjeong Mun <yunjeong.mun@sk.com> Closes: https://lore.kernel.org/20250904011738.930-1-yunjeong.mun@sk.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm/damon/core: introduce damon_call_control->dealloc_on_cancelSeongJae Park
Patch series "mm/damon/sysfs: fix refresh_ms control overwriting on multi-kdamonds usages". Automatic esssential DAMON/DAMOS status update feature of DAMON sysfs interface (refresh_ms) is broken [1] for multiple DAMON contexts (kdamonds) use case, since it uses a global single damon_call_control object for all created DAMON contexts. The fields of the object, particularly the list field is over-written for the contexts and it makes unexpected results including user-space hangup and kernel crashes [2]. Fix it by extending damon_call_control for the use case and updating the usage on DAMON sysfs interface to use per-context dynamically allocated damon_call_control object. This patch (of 2): When damon_call_control->repeat is set, damon_call() is executed asynchronously, and is eventually canceled when kdamond finishes. If the damon_call_control object is dynamically allocated, finding the place to deallocate the object is difficult. Introduce a new damon_call_control field, namely dealloc_on_cancel, to ask the kdamond deallocates those dynamically allocated objects when those are canceled. Link: https://lkml.kernel.org/r/20250908201513.60802-3-sj@kernel.org Link: https://lkml.kernel.org/r/20250908201513.60802-2-sj@kernel.org Fixes: d809a7c64ba8 ("mm/damon/sysfs: implement refresh_ms file internal work") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Yunjeong Mun <yunjeong.mun@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm: folio_may_be_lru_cached() unless folio_test_large()Hugh Dickins
mm/swap.c and mm/mlock.c agree to drain any per-CPU batch as soon as a large folio is added: so collect_longterm_unpinnable_folios() just wastes effort when calling lru_add_drain[_all]() on a large folio. But although there is good reason not to batch up PMD-sized folios, we might well benefit from batching a small number of low-order mTHPs (though unclear how that "small number" limitation will be implemented). So ask if folio_may_be_lru_cached() rather than !folio_test_large(), to insulate those particular checks from future change. Name preferred to "folio_is_batchable" because large folios can well be put on a batch: it's just the per-CPU LRU caches, drained much later, which need care. Marked for stable, to counter the increase in lru_add_drain_all()s from "mm/gup: check ref_count instead of lru before migration". Link: https://lkml.kernel.org/r/57d2eaf8-3607-f318-e0c5-be02dce61ad0@google.com Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region") Signed-off-by: Hugh Dickins <hughd@google.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Keir Fraser <keirf@google.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Xu <weixugc@google.com> Cc: Will Deacon <will@kernel.org> Cc: yangge <yangge1116@126.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm: revert "mm: vmscan.c: fix OOM on swap stress test"Hugh Dickins
This reverts commit 0885ef470560: that was a fix to the reverted 33dfe9204f29b415bbc0abb1a50642d1ba94f5e9. Link: https://lkml.kernel.org/r/aa0e9d67-fbcd-9d79-88a1-641dfbe1d9d1@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Keir Fraser <keirf@google.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Xu <weixugc@google.com> Cc: Will Deacon <will@kernel.org> Cc: yangge <yangge1116@126.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm: revert "mm/gup: clear the LRU flag of a page before adding to LRU batch"Hugh Dickins
This reverts commit 33dfe9204f29: now that collect_longterm_unpinnable_folios() is checking ref_count instead of lru, and mlock/munlock do not participate in the revised LRU flag clearing, those changes are misleading, and enlarge the window during which mlock/munlock may miss an mlock_count update. It is possible (I'd hesitate to claim probable) that the greater likelihood of missed mlock_count updates would explain the "Realtime threads delayed due to kcompactd0" observed on 6.12 in the Link below. If that is the case, this reversion will help; but a complete solution needs also a further patch, beyond the scope of this series. Included some 80-column cleanup around folio_batch_add_and_move(). The role of folio_test_clear_lru() (before taking per-memcg lru_lock) is questionable since 6.13 removed mem_cgroup_move_account() etc; but perhaps there are still some races which need it - not examined here. Link: https://lore.kernel.org/linux-mm/DU0PR01MB10385345F7153F334100981888259A@DU0PR01MB10385.eurprd01.prod.exchangelabs.com/ Link: https://lkml.kernel.org/r/05905d7b-ed14-68b1-79d8-bdec30367eba@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Keir Fraser <keirf@google.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Xu <weixugc@google.com> Cc: Will Deacon <will@kernel.org> Cc: yangge <yangge1116@126.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm/gup: local lru_add_drain() to avoid lru_add_drain_all()Hugh Dickins
In many cases, if collect_longterm_unpinnable_folios() does need to drain the LRU cache to release a reference, the cache in question is on this same CPU, and much more efficiently drained by a preliminary local lru_add_drain(), than the later cross-CPU lru_add_drain_all(). Marked for stable, to counter the increase in lru_add_drain_all()s from "mm/gup: check ref_count instead of lru before migration". Note for clean backports: can take 6.16 commit a03db236aebf ("gup: optimize longterm pin_user_pages() for large folio") first. Link: https://lkml.kernel.org/r/66f2751f-283e-816d-9530-765db7edc465@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Keir Fraser <keirf@google.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Xu <weixugc@google.com> Cc: Will Deacon <will@kernel.org> Cc: yangge <yangge1116@126.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm/gup: check ref_count instead of lru before migrationHugh Dickins
Patch series "mm: better GUP pin lru_add_drain_all()", v2. Series of lru_add_drain_all()-related patches, arising from recent mm/gup migration report from Will Deacon. This patch (of 5): Will Deacon reports:- When taking a longterm GUP pin via pin_user_pages(), __gup_longterm_locked() tries to migrate target folios that should not be longterm pinned, for example because they reside in a CMA region or movable zone. This is done by first pinning all of the target folios anyway, collecting all of the longterm-unpinnable target folios into a list, dropping the pins that were just taken and finally handing the list off to migrate_pages() for the actual migration. It is critically important that no unexpected references are held on the folios being migrated, otherwise the migration will fail and pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is relatively easy to observe migration failures when running pKVM (which uses pin_user_pages() on crosvm's virtual address space to resolve stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and this results in the VM terminating prematurely. In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its mapping of guest memory prior to the pinning. Subsequently, when pin_user_pages() walks the page-table, the relevant 'pte' is not present and so the faulting logic allocates a new folio, mlocks it with mlock_folio() and maps it in the page-table. Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page() batch by pagevec"), mlock/munlock operations on a folio (formerly page), are deferred. For example, mlock_folio() takes an additional reference on the target folio before placing it into a per-cpu 'folio_batch' for later processing by mlock_folio_batch(), which drops the refcount once the operation is complete. Processing of the batches is coupled with the LRU batch logic and can be forcefully drained with lru_add_drain_all() but as long as a folio remains unprocessed on the batch, its refcount will be elevated. This deferred batching therefore interacts poorly with the pKVM pinning scenario as we can find ourselves in a situation where the migration code fails to migrate a folio due to the elevated refcount from the pending mlock operation. Hugh Dickins adds:- !folio_test_lru() has never been a very reliable way to tell if an lru_add_drain_all() is worth calling, to remove LRU cache references to make the folio migratable: the LRU flag may be set even while the folio is held with an extra reference in a per-CPU LRU cache. 5.18 commit 2fbb0c10d1e8 may have made it more unreliable. Then 6.11 commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before adding to LRU batch") tried to make it reliable, by moving LRU flag clearing; but missed the mlock/munlock batches, so still unreliable as reported. And it turns out to be difficult to extend 33dfe9204f29's LRU flag clearing to the mlock/munlock batches: if they do benefit from batching, mlock/munlock cannot be so effective when easily suppressed while !LRU. Instead, switch to an expected ref_count check, which was more reliable all along: some more false positives (unhelpful drains) than before, and never a guarantee that the folio will prove migratable, but better. Note on PG_private_2: ceph and nfs are still using the deprecated PG_private_2 flag, with the aid of netfs and filemap support functions. Although it is consistently matched by an increment of folio ref_count, folio_expected_ref_count() intentionally does not recognize it, and ceph folio migration currently depends on that for PG_private_2 folios to be rejected. New references to the deprecated flag are discouraged, so do not add it into the collect_longterm_unpinnable_folios() calculation: but longterm pinning of transiently PG_private_2 ceph and nfs folios (an uncommon case) may invoke a redundant lru_add_drain_all(). And this makes easy the backport to earlier releases: up to and including 6.12, btrfs also used PG_private_2, but without a ref_count increment. Note for stable backports: requires 6.16 commit 86ebd50224c0 ("mm: add folio_expected_ref_count() for reference count calculation"). Link: https://lkml.kernel.org/r/41395944-b0e3-c3ac-d648-8ddd70451d28@google.com Link: https://lkml.kernel.org/r/bd1f314a-fca1-8f19-cac0-b936c9614557@google.com Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region") Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Will Deacon <will@kernel.org> Closes: https://lore.kernel.org/linux-mm/20250815101858.24352-1-will@kernel.org/ Acked-by: Kiryl Shutsemau <kas@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Keir Fraser <keirf@google.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Xu <weixugc@google.com> Cc: yangge <yangge1116@126.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.17-rc6). Conflicts: net/netfilter/nft_set_pipapo.c net/netfilter/nft_set_pipapo_avx2.c c4eaca2e1052 ("netfilter: nft_set_pipapo: don't check genbit from packetpath lookups") 84c1da7b38d9 ("netfilter: nft_set_pipapo: use avx2 algorithm for insertions too") Only trivial adjacent changes (in a doc and a Makefile). Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-10Merge tag 'mm-hotfixes-stable-2025-09-10-20-00' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "20 hotfixes. 15 are cc:stable and the remainder address post-6.16 issues or aren't considered necessary for -stable kernels. 14 of these fixes are for MM. This includes - kexec fixes from Breno for a recently introduced use-uninitialized bug - DAMON fixes from Quanmin Yan to avoid div-by-zero crashes which can occur if the operator uses poorly-chosen insmod parameters and misc singleton fixes" * tag 'mm-hotfixes-stable-2025-09-10-20-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: MAINTAINERS: add tree entry to numa memblocks and emulation block mm/damon/sysfs: fix use-after-free in state_show() proc: fix type confusion in pde_set_flags() compiler-clang.h: define __SANITIZE_*__ macros only when undefined mm/vmalloc, mm/kasan: respect gfp mask in kasan_populate_vmalloc() ocfs2: fix recursive semaphore deadlock in fiemap call mm/memory-failure: fix VM_BUG_ON_PAGE(PagePoisoned(page)) when unpoison memory mm/mremap: fix regression in vrm->new_addr check percpu: fix race on alloc failed warning limit mm/memory-failure: fix redundant updates for already poisoned pages s390: kexec: initialize kexec_buf struct riscv: kexec: initialize kexec_buf struct arm64: kexec: initialize kexec_buf struct in load_other_segments() mm/damon/reclaim: avoid divide-by-zero in damon_reclaim_apply_parameters() mm/damon/lru_sort: avoid divide-by-zero in damon_lru_sort_apply_parameters() mm/damon/core: set quota->charged_from to jiffies at first charge window mm/hugetlb: add missing hugetlb_lock in __unmap_hugepage_range() init/main.c: fix boot time tracing crash mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range() mm/khugepaged: fix the address passed to notifier on testing young
2025-09-10mm/slub: Refactor note_cmpxchg_failure for better readabilityYe Liu
Use IS_ENABLED() and standard if-else to make the code clearer. Signed-off-by: Ye Liu <liuye@kylinos.cn> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Christoph Lameter (Ampere) <cl@gentwo.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-09-10mm/slub: Replace sort_r() with sort() for debugfs stack trace sortingKuan-Wei Chiu
The comparison function used to sort stack trace locations in debugfs never relied on the third argument. Therefore, sort_r() is unnecessary. Switch to sort() with a two-argument comparison function to keep the code simple and aligned with the intended usage. Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-09-10mm/slub: Fix cmp_loc_by_count() to return 0 when counts are equalKuan-Wei Chiu
The comparison function cmp_loc_by_count() used for sorting stack trace locations in debugfs currently returns -1 if a->count > b->count and 1 otherwise. This breaks the antisymmetry property required by sort(), because when two counts are equal, both cmp(a, b) and cmp(b, a) return 1. This can lead to undefined or incorrect ordering results. Fix it by updating the comparison logic to explicitly handle the case when counts are equal, and use cmp_int() to ensure the comparison function adheres to the required mathematical properties of antisymmetry. Fixes: 553c0369b3e1 ("mm/slub: sort debugfs output by frequency of stack traces") Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-09-08mm/damon/sysfs: fix use-after-free in state_show()Stanislav Fort
state_show() reads kdamond->damon_ctx without holding damon_sysfs_lock. This allows a use-after-free race: CPU 0 CPU 1 ----- ----- state_show() damon_sysfs_turn_damon_on() ctx = kdamond->damon_ctx; mutex_lock(&damon_sysfs_lock); damon_destroy_ctx(kdamond->damon_ctx); kdamond->damon_ctx = NULL; mutex_unlock(&damon_sysfs_lock); damon_is_running(ctx); /* ctx is freed */ mutex_lock(&ctx->kdamond_lock); /* UAF */ (The race can also occur with damon_sysfs_kdamonds_rm_dirs() and damon_sysfs_kdamond_release(), which free or replace the context under damon_sysfs_lock.) Fix by taking damon_sysfs_lock before dereferencing the context, mirroring the locking used in pid_show(). The bug has existed since state_show() first accessed kdamond->damon_ctx. Link: https://lkml.kernel.org/r/20250905101046.2288-1-disclosure@aisle.com Fixes: a61ea561c871 ("mm/damon/sysfs: link DAMON for virtual address spaces monitoring") Signed-off-by: Stanislav Fort <disclosure@aisle.com> Reported-by: Stanislav Fort <disclosure@aisle.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-08mm/vmalloc, mm/kasan: respect gfp mask in kasan_populate_vmalloc()Uladzislau Rezki (Sony)
kasan_populate_vmalloc() and its helpers ignore the caller's gfp_mask and always allocate memory using the hardcoded GFP_KERNEL flag. This makes them inconsistent with vmalloc(), which was recently extended to support GFP_NOFS and GFP_NOIO allocations. Page table allocations performed during shadow population also ignore the external gfp_mask. To preserve the intended semantics of GFP_NOFS and GFP_NOIO, wrap the apply_to_page_range() calls into the appropriate memalloc scope. xfs calls vmalloc with GFP_NOFS, so this bug could lead to deadlock. There was a report here https://lkml.kernel.org/r/686ea951.050a0220.385921.0016.GAE@google.com This patch: - Extends kasan_populate_vmalloc() and helpers to take gfp_mask; - Passes gfp_mask down to alloc_pages_bulk() and __get_free_page(); - Enforces GFP_NOFS/NOIO semantics with memalloc_*_save()/restore() around apply_to_page_range(); - Updates vmalloc.c and percpu allocator call sites accordingly. Link: https://lkml.kernel.org/r/20250831121058.92971-1-urezki@gmail.com Fixes: 451769ebb7e7 ("mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc") Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reported-by: syzbot+3470c9ffee63e4abafeb@syzkaller.appspotmail.com Reviewed-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>