summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2025-07-13mm/madvise: eliminate very confusing manipulation of prev VMALorenzo Stoakes
The madvise code has for the longest time had very confusing code around the 'prev' VMA pointer passed around various functions which, in all cases except madvise_update_vma(), is unused and instead simply updated as soon as the function is invoked. To compound the confusion, the prev pointer is also used to indicate to the caller that the mmap lock has been dropped and that we can therefore not safely access the end of the current VMA (which might have been updated by madvise_update_vma()). Clear up this confusion by not setting prev = vma anywhere except in madvise_walk_vmas(), update all references to prev which will always be equal to vma after madvise_vma_behavior() is invoked, and adding a flag to indicate that the lock has been dropped to make this explicit. Additionally, drop a redundant BUG_ON() from madvise_collapse(), which is simply reiterating the BUG_ON(mmap_locked) above it (note that BUG_ON() is not appropriate here, but we leave existing code as-is). We finally adjust the madvise_walk_vmas() logic to be a little clearer - delaying the assignment of the end of the range to the start of the new range until the last moment and handling the lock being dropped scenario immediately. Additionally add some explanatory comments. [lorenzo.stoakes@oracle.com: fix very subtle bug] Link: https://lkml.kernel.org/r/dca94cde-8afb-4eab-8e57-3f508624d670@lucifer.local Link: https://lkml.kernel.org/r/63d281c5df2e64225ab5b4bda398b45e22818701.1750433500.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13mm/madvise: thread all madvise state through madv_behaviorLorenzo Stoakes
Doing so means we can get rid of all the weird struct vm_area_struct **prev stuff, everything becomes consistent and in future if we want to make change to behaviour there's a single place where all relevant state is stored. This also allows us to update try_vma_read_lock() to be a little more succinct and set up state for us, as well as cleaning up madvise_update_vma(). We also update the debug assertion prior to madvise_update_vma() to assert that this is a write operation as correctly pointed out by Barry in the relevant thread. We can't reasonably update the madvise functions that live outside of mm/madvise.c so we leave those as-is. Link: https://lkml.kernel.org/r/7b345ab82ef51e551f8bc0c4f7be25712871629d.1750433500.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13mm/madvise: thread VMA range state through madvise_behaviorLorenzo Stoakes
Rather than updating start and a confusing local parameter 'tmp' in madvise_walk_vmas(), instead store the current range being operated upon in the struct madvise_behavior helper object in a range pair and use this consistently in all operations. This makes it clearer what is going on and opens the door to further cleanup now we store state regarding what is currently being operated upon here. Link: https://lkml.kernel.org/r/518480ceb48553d3c280bc2b0b5e77bbad840147.1750433500.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13mm/madvise: thread mm_struct through madvise_behaviorLorenzo Stoakes
There's no need to thread a pointer to the mm_struct nor have different functions signatures for each behaviour, instead store state in the struct madvise_behavior object consistently and use it for all madvise() actions. Link: https://lkml.kernel.org/r/a47d850b0111735e026d438c3300c0e3b7f439f4.1750433500.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13mm/madvise: remove the visitor pattern and thread anon_vma stateLorenzo Stoakes
Patch series "madvise cleanup", v2. This is a series of patches that helps address a number of historic problems in the madvise() implementation: * Eliminate the visitor pattern and having the code which is implemented for both the anon_vma_name implementation and ordinary madvise() operations use the same madvise_vma_behavior() implementation. * Thread state through the madvise_behavior state object - this object, very usefully introduced by SJ, is already used to transmit state through operations. This series extends this by having all madvise() operations use this, including anon_vma_name. * Thread range, VMA state through madvise_behavior - This helps avoid a lot of the confusing code around range and VMA state and again keeps things consistent and with a single 'source of truth'. * Addressing the very strange behaviour around the passed around struct vm_area_struct **prev pointer - all read-only users do absolutely nothing with the prev pointer. The only function that uses it is madvise_update_vma(), and in all cases prev is always reset to VMA. Fix this by no longer having aything but madvise_update_vma() reference prev, and having madvise_walk_vmas() update prev in each instance. Additionally make it clear that the meaningful change in vma state is when madvise_update_vma() potentially merges a VMA, so explicitly retrieve the VMA in this case. * Update and clarify the madvise_walk_vmas() function - this is a source of a great deal of confusion, so simplify, stop using prev = NULL to signify that the mmap lock has been dropped (!) and make that explicit, and add some comments to explain what's going on. This patch (of 5): Now we have the madvise_behavior helper struct we no longer need to mess around with void* pointers in order to propagate anon_vma_name, and this means we can get rid of the confusing and inconsistent visitor pattern implementation in madvise_vma_anon_name(). This means we now have a single state object that threads through most of madvise()'s logic and a single code path which executes the majority of madvise() behaviour (we maintain separate logic for failure injection and memory population for the time being). We are able to remove the visitor pattern by handling the anon_vma_name setting logic via an internal madvise flag - __MADV_SET_ANON_VMA_NAME. This uses a negative value so it isn't reasonable that we will ever add this as a UAPI flag. Additionally, the madvise_behavior_valid() check ensures that user-specified behaviours are strictly only those we permit which, of course, this flag will be excluded from. We are able to propagate the anon_vma_name object through use of the madvise_behavior helper struct. Doing this results in a can_modify_vma_madv() check for anonymous VMA name changes, however this will cause no issues as this operation is not prohibited. We can also then reuse more code and drop the redundant madvise_vma_anon_name() function altogether. Additionally separate out behaviours that update VMAs from those that do not. Link: https://lkml.kernel.org/r/cover.1750433500.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/c5094bfccb41ecd19d4e9bcaa1c4a11e00158bba.1750433500.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-12Merge branch 'mm-hotfixes-stable' into mm-stable to pick up changes whichAndrew Morton
are required for a merge of the series "mm: folio_pte_batch() improvements".
2025-07-09mm: unexport globally copy_to_kernel_nofaultSabyrzhan Tasbolatov
copy_to_kernel_nofault() is an internal helper which should not be visible to loadable modules – exporting it would give exploit code a cheap oracle to probe kernel addresses. Instead, keep the helper un-exported and compile the kunit case that exercises it only when mm/kasan/kasan_test.o is linked into vmlinux. [snovitoll@gmail.com: add a brief comment to `#ifndef MODULE`] Link: https://lkml.kernel.org/r/20250622141142.79332-1-snovitoll@gmail.com Link: https://lkml.kernel.org/r/20250622051906.67374-1-snovitoll@gmail.com Fixes: ca79a00bb9a8 ("kasan: migrate copy_user_test to kunit") Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com> Suggested-by: Christoph Hellwig <hch@infradead.org> Suggested-by: Marco Elver <elver@google.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/damon/sysfs: decouple from damon_ops_idSeongJae Park
Decouple DAMON sysfs interface from damon_ops_id. For this, define and use new mm/damon/sysfs.c internal data structure that maps the user-space keywords and damon_ops_id, instead of having the implicit and unflexible array index rule. Link: https://lkml.kernel.org/r/20250622213759.50930-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/damon/sysfs-schemes: decouple from damos_filter_typeSeongJae Park
Decouple DAMOS sysfs interface from damos_filter_type. For this, define and use new sysfs-schemes internal data structure that maps the user-space keywords and damos_filter_type, instead of having the implicit and unflexible array index rule. Link: https://lkml.kernel.org/r/20250622213759.50930-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/damon/sysfs-schemes: decouple from damos_wmark_metricSeongJae Park
Decouple DAMOS sysfs interface from damos_wmark_metric. For this, define and use new sysfs-schemes internal data structure that maps the user-space keywords and damos_wmark_metric, instead of having the implicit and unflexible array index rule. Link: https://lkml.kernel.org/r/20250622213759.50930-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/damon/sysfs-schemes: decouple from damos_actionSeongJae Park
Decouple DAMOS sysfs interface from damos_action. For this, define and use new sysfs-schemes internal data structure that maps the user-space keywords and damos_action, instead of having the implicit and unflexible array index rule. [akpm@linux-foundation.org: make damos_sysfs_action_names static] Closes: https://lore.kernel.org/oe-kbuild-all/202506271655.b8yfEZIT-lkp@intel.com/ Link: https://lkml.kernel.org/r/20250622213759.50930-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/damon/sysfs-schemes: decouple from damos_quota_goal_metricSeongJae Park
Patch series "mm/damon: decouple sysfs from core". DAMON sysfs interface is coupled with core layer. It maintains some of its keywords arrays be synchronized with matching DAMON core API enums. It is unnecessary coupling that makes separated changes for different layers difficult. Decouple the layers by introducing new data structure for the mappings on DAMON sysfs interface. This patch (of 5): Decouple DAMOS sysfs interface from damos_quota_goal_metric. For this, define and use new sysfs-schemes internal data structure that maps the user-space keywords and damos_quota_goal_metric, instead of having the implicit and unflexible array index rule. Link: https://lkml.kernel.org/r/20250622213759.50930-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250622213759.50930-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/ptdump: take the memory hotplug lock inside ptdump_walk_pgd()Anshuman Khandual
Memory hot remove unmaps and tears down various kernel page table regions as required. The ptdump code can race with concurrent modifications of the kernel page tables. When leaf entries are modified concurrently, the dump code may log stale or inconsistent information for a VA range, but this is otherwise not harmful. But when intermediate levels of kernel page table are freed, the dump code will continue to use memory that has been freed and potentially reallocated for another purpose. In such cases, the ptdump code may dereference bogus addresses, leading to a number of potential problems. To avoid the above mentioned race condition, platforms such as arm64, riscv and s390 take memory hotplug lock, while dumping kernel page table via the sysfs interface /sys/kernel/debug/kernel_page_tables. Similar race condition exists while checking for pages that might have been marked W+X via /sys/kernel/debug/kernel_page_tables/check_wx_pages which in turn calls ptdump_check_wx(). Instead of solving this race condition again, let's just move the memory hotplug lock inside generic ptdump_check_wx() which will benefit both the scenarios. Drop get_online_mems() and put_online_mems() combination from all existing platform ptdump code paths. Link: https://lkml.kernel.org/r/20250620052427.2092093-1-anshuman.khandual@arm.com Fixes: bbd6ec605c0f ("arm64/mm: Enable memory hot remove") Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> [s390] Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/memremap: remove unused devmap_managed_keyAlistair Popple
It's no longer used so remove it. Link: https://lkml.kernel.org/r/11516e39f33f809292ffccab1d46062f9bc248b3.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: remove callers of pfn_t functionalityAlistair Popple
All PFN_* pfn_t flags have been removed. Therefore there is no longer a need for the pfn_t type and all uses can be replaced with normal pfns. Link: https://lkml.kernel.org/r/bbedfa576c9822f8032494efbe43544628698b1f.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: remove PFN_DEV, PFN_MAP, PFN_SPECIAL, PFN_SG_CHAIN and PFN_SG_LASTAlistair Popple
The PFN_MAP flag is no longer used for anything, so remove it. The PFN_SG_CHAIN and PFN_SG_LAST flags never appear to have been used so also remove them. The last user of PFN_SPECIAL was removed by 653d7825c149 ("dcssblk: mark DAX broken, remove FS_DAX_LIMITED support"). Users of PFN_DEV were removed earlier in this series by "mm: Remove remaining uses of PFN_DEV". Link: https://lkml.kernel.org/r/670b3950d70b4d97b905bb597dadfd3633de4314.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: remove devmap related functions and page table bitsAlistair Popple
Now that DAX and all other reference counts to ZONE_DEVICE pages are managed normally there is no need for the special devmap PTE/PMD/PUD page table bits. So drop all references to these, freeing up a software defined page table bit on architectures supporting it. Link: https://lkml.kernel.org/r/6389398c32cc9daa3dfcaa9f79c7972525d310ce.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: Will Deacon <will@kernel.org> # arm64 Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: Chunyan Zhang <zhang.lyra@gmail.com> Reviewed-by: Björn Töpel <bjorn@rivosinc.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09fs/dax: remove FS_DAX_LIMITED config optionAlistair Popple
The dcssblk driver was the last user of FS_DAX_LIMITED. That was marked broken by 653d7825c149 ("dcssblk: mark DAX broken, remove FS_DAX_LIMITED support") to allow removal of PFN_SPECIAL. However the FS_DAX_LIMITED config option itself was not removed, so do that now. Link: https://lkml.kernel.org/r/b47bf164b4a1013d736fa1a3d501bc8b8e71311f.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/khugepaged: remove redundant pmd_devmap() checkAlistair Popple
The pmd_devmap() check in check_pmd_state() is redundant because the only user of pmd_devmap were device dax and fs dax. However all callers of check_pmd_state() first call thp_vma_allowable_order() to check if the vma should be scanned. Except when called from a page fault this always returns 0 for dax vma's, hence we would never expect to see a pmd_devmap entry. Link: https://lkml.kernel.org/r/a68175fd3a37e9b72cc82c1d63fd8b69691a85b5.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: remove redundant pXd_devmap callsAlistair Popple
DAX was the only thing that created pmd_devmap and pud_devmap entries however it no longer does as DAX pages are now refcounted normally and pXd_trans_huge() returns true for those. Therefore checking both pXd_devmap and pXd_trans_huge() is redundant and the former can be removed without changing behaviour as it will always be false. Link: https://lkml.kernel.org/r/d58f089dc16b7feb7c6728164f37dea65d64a0d3.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/huge_memory: remove pXd_devmap usage from insert_pXd_pfn()Alistair Popple
Nothing uses PFN_DEV anymore so no need to create devmap pXd's when mapping a PFN. Instead special mappings will be created which ensures vm_normal_page_pXd() will not return pages which don't have an associated page. This could change behaviour slightly on architectures where pXd_devmap() does not imply pXd_special() as the normal page checks would have fallen through to checking VM_PFNMAP/MIXEDMAP instead, which in theory at least could have returned a page. However vm_normal_page_pXd() should never have been returning pages for pXd_devmap() entries anyway, so anything relying on that would have been a bug. Link: https://lkml.kernel.org/r/cd8658f9ff10afcfffd8b145a39d98bf1c595ffa.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/gup: remove pXX_devmap usage from get_user_pages()Alistair Popple
GUP uses pXX_devmap() calls to see if it needs to a get a reference on the associated pgmap data structure to ensure the pages won't go away. However it's a driver responsibility to ensure that if pages are mapped (ie. discoverable by GUP) that they are not offlined or removed from the memmap so there is no need to hold a reference on the pgmap data structure to ensure this. Furthermore mappings with PFN_DEV are no longer created, hence this effectively dead code anyway so can be removed. Link: https://lkml.kernel.org/r/708b2be76876659ec5261fe5d059b07268b98b36.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: convert vmf_insert_mixed() from using pte_devmap to pte_specialAlistair Popple
DAX no longer requires device PTEs as it always has a ZONE_DEVICE page associated with the PTE that can be reference counted normally. Other users of pte_devmap are drivers that set PFN_DEV when calling vmf_insert_mixed() which ensures vm_normal_page() returns NULL for these entries. There is no reason to distinguish these pte_devmap users so in order to free up a PTE bit use pte_special instead for entries created with vmf_insert_mixed(). This will ensure vm_normal_page() will continue to return NULL for these pages. Architectures that don't support pte_special also don't support pte_devmap so those will continue to rely on pfn_valid() to determine if the page can be mapped. Link: https://lkml.kernel.org/r/93086bd446e7bf8e4c85345613ac18f706b01f60.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: remove remaining uses of PFN_DEVAlistair Popple
PFN_DEV was used by callers of dax_direct_access() to figure out if the returned PFN is associated with a page using pfn_t_has_page() or not. However all DAX PFNs now require an assoicated ZONE_DEVICE page so can assume a page exists. Other users of PFN_DEV were setting it before calling vmf_insert_mixed(). This is unnecessary as it is no longer checked, instead relying on pfn_valid() to determine if there is an associated page or not. Link: https://lkml.kernel.org/r/74b293aebc21b941090bc3e7aeafa91b57c821a5.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: convert pXd_devmap checks to vma_is_daxAlistair Popple
Patch series "mm: Remove pXX_devmap page table bit and pfn_t type", v3. All users of dax now require a ZONE_DEVICE page which is properly refcounted. This means there is no longer any need for the PFN_DEV, PFN_MAP and PFN_SPECIAL flags. Furthermore the PFN_SG_CHAIN and PFN_SG_LAST flags never appear to have been used. It is therefore possible to remove the pfn_t type and replace any usage with raw pfns. The remaining users of PFN_DEV have simply passed this to vmf_insert_mixed() to create pte_devmap() mappings. It is unclear why this was the case but presumably to ensure vm_normal_page() does not return these pages. These users can be trivially converted to raw pfns and creating a pXX_special() mapping to ensure vm_normal_page() still doesn't return these pages. Now that there are no users of PFN_DEV we can remove the devmap page table bit and all associated functions and macros, freeing up a software page table bit. This patch (of 14): Currently dax is the only user of pmd and pud mapped ZONE_DEVICE pages. Therefore page walkers that want to exclude DAX pages can check pmd_devmap or pud_devmap. However soon dax will no longer set PFN_DEV, meaning dax pages are mapped as normal pages. Ensure page walkers that currently use pXd_devmap to skip DAX pages continue to do so by adding explicit checks of the VMA instead. Remove vma_is_dax() check from mm/userfaultfd.c as validate_move_areas() will already skip DAX VMA's on account of them not being anonymous. The check in userfaultfd_must_wait() is also redundant as vma_can_userfault() should have already filtered out dax vma's. For HMM the pud_devmap check seems unnecessary as there is no reason we shouldn't be able to handle any leaf PUD here so remove it also. Link: https://lkml.kernel.org/r/cover.176965585864cb8d2cf41464b44dcc0471e643a0.1750323463.git-series.apopple@nvidia.com Link: https://lkml.kernel.org/r/f0611f6f475f48fcdf34c65084a359aefef4cccc.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/percpu: conditionally define _shared_alloc_tag via ↵Hao Ge
CONFIG_ARCH_MODULE_NEEDS_WEAK_PER_CPU Recently discovered this entry while checking kallsyms on ARM64: ffff800083e509c0 D _shared_alloc_tag If ARCH_NEEDS_WEAK_PER_CPU is not defined(it is only defined for s390 and alpha architectures), there's no need to statically define the percpu variable _shared_alloc_tag. Therefore, we need to implement isolation for this purpose. When building the core kernel code for s390 or alpha architectures, ARCH_NEEDS_WEAK_PER_CPU remains undefined (as it is gated by #if defined(MODULE)). However, when building modules for these architectures, the macro is explicitly defined. Therefore, we remove all instances of ARCH_NEEDS_WEAK_PER_CPU from the code and introduced CONFIG_ARCH_MODULE_NEEDS_WEAK_PER_CPU to replace the relevant logic. We can now conditionally define the perpcu variable _shared_alloc_tag based on CONFIG_ARCH_MODULE_NEEDS_WEAK_PER_CPU. This allows architectures (such as s390/alpha) that require weak definitions for percpu variables in modules to include the definition, while others can omit it via compile-time exclusion. Link: https://lkml.kernel.org/r/20250618015809.1235761-1-hao.ge@linux.dev Signed-off-by: Hao Ge <gehao@kylinos.cn> Suggested-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> [s390] Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Chistoph Lameter <cl@linux.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/memfd: reserve hugetlb folios before allocationVivek Kasireddy
When we try to allocate a folio via alloc_hugetlb_folio_reserve(), we need to ensure that there is an active reservation associated with the allocation. Otherwise, our allocation request would fail if there are no active reservations made at that moment against any other allocations. This is because alloc_hugetlb_folio_reserve() checks h->resv_huge_pages before proceeding with the allocation. Therefore, to address this issue, we just need to make a reservation (by calling hugetlb_reserve_pages()) before we try to allocate the folio. This will also ensure that proper region/subpool accounting is done associated with our allocation. Link: https://lkml.kernel.org/r/20250618053415.1036185-3-vivek.kasireddy@intel.com Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Cc: Steve Sistare <steven.sistare@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Gerd Hoffmann <kraxel@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/hugetlb: make hugetlb_reserve_pages() return nr of entries updatedVivek Kasireddy
Patch series "mm/memfd: Reserve hugetlb folios before allocation", v4. There are cases when we try to pin a folio but discover that it has not been faulted-in. So, we try to allocate it in memfd_alloc_folio() but the allocation request may not succeed if there are no active reservations in the system at that instant. Therefore, making a reservation (by calling hugetlb_reserve_pages()) associated with the allocation will ensure that our request would not fail due to lack of reservations. This will also ensure that proper region/subpool accounting is done with our allocation. This patch (of 3): Currently, hugetlb_reserve_pages() returns a bool to indicate whether the reservation map update for the range [from, to] was successful or not. This is not sufficient for the case where the caller needs to determine how many entries were updated for the range. Therefore, have hugetlb_reserve_pages() return the number of entries updated in the reservation map associated with the range [from, to]. Also, update the callers of hugetlb_reserve_pages() to handle the new return value. Link: https://lkml.kernel.org/r/20250618053415.1036185-1-vivek.kasireddy@intel.com Link: https://lkml.kernel.org/r/20250618053415.1036185-2-vivek.kasireddy@intel.com Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Cc: Steve Sistare <steven.sistare@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Gerd Hoffmann <kraxel@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: update core kernel code to use vm_flags_t consistentlyLorenzo Stoakes
The core kernel code is currently very inconsistent in its use of vm_flags_t vs. unsigned long. This prevents us from changing the type of vm_flags_t in the future and is simply not correct, so correct this. While this results in rather a lot of churn, it is a critical pre-requisite for a future planned change to VMA flag type. Additionally, update VMA userland tests to account for the changes. To make review easier and to break things into smaller parts, driver and architecture-specific changes is left for a subsequent commit. The code has been adjusted to cascade the changes across all calling code as far as is needed. We will adjust architecture-specific and driver code in a subsequent patch. Overall, this patch does not introduce any functional change. Link: https://lkml.kernel.org/r/d1588e7bb96d1ea3fe7b9df2c699d5b4592d901d.1750274467.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Kees Cook <kees@kernel.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09Revert "mm: make alloc_demote_folio externally invokable for migration"SeongJae Park
This reverts commit a00ce85af2a1be494d3b0c9457e8e81cdcce2a89. Commit a00ce85af2a1 ("mm: make alloc_demote_folio externally invokable for migration") was made to let DAMOS_MIGRATE_{HOT,COLD} call the function. But a previous commit made DAMOS_MIGRATE_{HOT,COLD} call alloc_migration_target() instead. Hence there are no more callers of the function outside of vmscan.c. Revert the commit to make the function static again. Link: https://lkml.kernel.org/r/20250616172346.67659-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Honggyu Kim <honggyu.kim@sk.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09Revert "mm: rename alloc_demote_folio to alloc_migrate_folio"SeongJae Park
This reverts commit 8f75267d22bdf8e3baf70f2fa7092d8c2f58da71. Commit 8f75267d22bd ("mm: rename alloc_demote_folio to alloc_migrate_folio") was to reflect the fact the function is called for not only demotion, but also general migrations from DAMOS_MIGRATE_{HOT,COLD}. The previous commit made the DAMOS actions to not use alloc_migrate_folio(), though. So, demote_folio_list() is the only caller of alloc_migrate_folio(), and the name could now be rather confusing. Revert the renaming commit. Link: https://lkml.kernel.org/r/20250616172346.67659-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Honggyu Kim <honggyu.kim@sk.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/damon/paddr: use alloc_migartion_target() with no migration fallback nodemaskSeongJae Park
Patch series "mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD}". DAMOS_MIGRATE_{HOT,COLD} implementation resembles that for demotion, and hence the behavior is also similar to that. But, since those are not only for demotion but general migrations, it would be better to match with that for move_pages() system call. Make the implementation and the behavior more similar to move_pages() by not setting migration fallback nodes, and using alloc_migration_target() instead of alloc_migrate_folio(). alloc_migrate_folio() was renamed from alloc_demote_folio() and been non-static function, to let DAMOS_MIGRATE_{HOT,COLD} call it. As alloc_migration_target() is called instead, the renaming and de-static changes are no more required but could only make future code readers be confused. Revert the changes, too. This patch (of 3): DAMOS_MIGRATE_{HOT,COLD} implementation resembles that for demote_folio_list(). Because those are not only for demotion but general folio migrations, it makes more sense to behave similarly to move_pages() system call. Make the behavior more similar to move_pages(), by using alloc_migration_target() instead of alloc_migrate_folio(), without fallback nodemask. Link: https://lkml.kernel.org/r/20250616172346.67659-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Honggyu Kim <honggyu.kim@sk.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: huge_memory: fix the check for allowed huge orders in shmemBaolin Wang
Shmem already supports mTHP, and shmem_allowable_huge_orders() will return the huge orders allowed by shmem. However, there is no check against the 'orders' parameter passed by __thp_vma_allowable_orders(), which can lead to incorrect check results for __thp_vma_allowable_orders(). For example, when a user wants to check if shmem supports PMD-sized THP by thp_vma_allowable_order(), if shmem only enables 64K mTHP, the current logic would cause thp_vma_allowable_order() to return true, implying that shmem allows PMD-sized THP allocation, which it actually does not. I don't think this will cause a significant impact on users, and this will only have some impact on the shmem THP collapse. That is to say, even though the shmem sysfs setting does not enable the PMD-sized THP, the thp_vma_allowable_order() still indicates that shmem allows PMD-sized collapse, meaning it might successfully collapse into THP, or it might not (for example, thp_vma_suitable_order() check failed in the collapse process). However, this still does not align with the shmem sysfs configuration, fix it. Link: https://lkml.kernel.org/r/529affb3220153d0d5a542960b535cdfc33f51d7.1749804835.git.baolin.wang@linux.alibaba.com Fixes: 26c7d8413aaf ("mm: thp: support "THPeligible" semantics for mTHP with anonymous shmem") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/vma: use vmg->target to specify target VMA for new VMA mergeLorenzo Stoakes
In commit 3a75ccba047b ("mm: simplify vma merge structure and expand comments") we introduced the vmg->target field to make the merging of existing VMAs simpler - clarifying precisely which VMA would eventually become the merged VMA once the merge operation was complete. New VMA merging did not get quite the same treatment, retaining the rather confusing convention of storing the target VMA in vmg->middle. This patch corrects this state of affairs, utilising vmg->target for this purpose for both vma_merge_new_range() and also for vma_expand(). We retain the WARN_ON for vmg->middle being specified in vma_merge_new_range() as doing so would make no sense, but add an additional debug assert for setting vmg->target. This patch additionally updates VMA userland testing to account for this change. [lorenzo.stoakes@oracle.com: make comment consistent in vma_expand()] Link: https://lkml.kernel.org/r/c54f45e3-a6ac-4749-93c0-cc9e3080ee37@lucifer.local Link: https://lkml.kernel.org/r/20250613184807.108089-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <kees@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09secretmem: remove uses of struct pageMatthew Wilcox (Oracle)
Use filemap_lock_folio() instead of find_lock_page() to retrieve a folio from the page cache. [lorenzo.stoakes@oracle.com: fix check of filemap_lock_folio() return value] Link: https://lkml.kernel.org/r/fdbca1d0-01a3-4653-85ed-cf257bb848be@lucifer.local Link: https://lkml.kernel.org/r/20250613194744.3175157-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pud()David Hildenbrand
Marking PUDs that map a "normal" refcounted folios as special is against our rules documented for vm_normal_page(). normal (refcounted) folios shall never have the page table mapping marked as special. Fortunately, there are not that many pud_special() check that can be mislead and are right now rather harmless: e.g., none so far bases decisions whether to grab a folio reference on that decision. Well, and GUP-fast will fallback to GUP-slow. All in all, so far no big implications as it seems. Getting this right will get more important as we introduce folio_normal_page_pud() and start using it in more place where we currently special-case based on other VMA flags. Fix it just like we fixed vmf_insert_folio_pmd(). Add folio_mk_pud() to mimic what we do with folio_mk_pmd(). Link: https://lkml.kernel.org/r/20250613092702.1943533-4-david@redhat.com Fixes: dbe54153296d ("mm/huge_memory: add vmf_insert_folio_pud()") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pmd()David Hildenbrand
Marking PMDs that map a "normal" refcounted folios as special is against our rules documented for vm_normal_page(): normal (refcounted) folios shall never have the page table mapping marked as special. Fortunately, there are not that many pmd_special() check that can be mislead, and most vm_normal_page_pmd()/vm_normal_folio_pmd() users that would get this wrong right now are rather harmless: e.g., none so far bases decisions whether to grab a folio reference on that decision. Well, and GUP-fast will fallback to GUP-slow. All in all, so far no big implications as it seems. Getting this right will get more important as we use folio_normal_page_pmd() in more places. Fix it by teaching insert_pfn_pmd() to properly handle folios and pfns -- moving refcount/mapcount/etc handling in there, renaming it to insert_pmd(), and distinguishing between both cases using a new simple "struct folio_or_pfn" structure. Use folio_mk_pmd() to create a pmd for a folio cleanly. Link: https://lkml.kernel.org/r/20250613092702.1943533-3-david@redhat.com Fixes: 6c88f72691f8 ("mm/huge_memory: add vmf_insert_folio_pmd()") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/huge_memory: don't ignore queried cachemode in vmf_insert_pfn_pud()David Hildenbrand
Patch series "mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes", v3. While working on improving vm_normal_page() and friends, I stumbled over this issues: refcounted "normal" folios must not be marked using pmd_special() / pud_special(). Otherwise, we're effectively telling the system that these folios are no "normal", violating the rules we documented for vm_normal_page(). Fortunately, there are not many pmd_special()/pud_special() users yet. So far there doesn't seem to be serious damage. Tested using the ndctl tests ("ndctl:dax" suite). This patch (of 3): We set up the cache mode but ... don't forward the updated pgprot to insert_pfn_pud(). Only a problem on x86-64 PAT when mapping PFNs using PUDs that require a special cachemode. Fix it by using the proper pgprot where the cachemode was setup. It is unclear in which configurations we would get the cachemode wrong: through vfio seems possible. Getting cachemodes wrong is usually ... bad. As the fix is easy, let's backport it to stable. Identified by code inspection. Link: https://lkml.kernel.org/r/20250613092702.1943533-1-david@redhat.com Link: https://lkml.kernel.org/r/20250613092702.1943533-2-david@redhat.com Fixes: 7b806d229ef1 ("mm: remove vmf_insert_pfn_xxx_prot() for huge page-table entries") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Tested-by: Dan Williams <dan.j.williams@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: use folio_expected_ref_count() helper for reference countingShivank Garg
Replace open-coded folio reference count calculations with the folio_expected_ref_count(). No functional changes intended. Link: https://lkml.kernel.org/r/20250611052706.515408-2-shivankg@amd.com Signed-off-by: Shivank Garg <shivankg@amd.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Marc Rutland <mark.rutland@arm.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Namhyung kim <namhyung@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: madvise: use per_vma lock for MADV_FREEBarry Song
MADV_FREE is another option, besides MADV_DONTNEED, for dynamic memory freeing in user-space native or Java heap memory management. For example, jemalloc can be configured to use MADV_FREE, and recent versions of the Android Java heap have also increasingly adopted MADV_FREE. Supporting per-VMA locking for MADV_FREE thus appears increasingly necessary. We have replaced walk_page_range() with walk_page_range_vma(). Along with the proposed madvise_lock_mode by Lorenzo, the necessary infrastructure is now in place to begin exploring per-VMA locking support for MADV_FREE and potentially other madvise using walk_page_range_vma(). This patch adds support for the PGWALK_VMA_RDLOCK walk_lock mode in walk_page_range_vma(), and leverages madvise_lock_mode from madv_behavior to select the appropriate walk_lock—either mmap_lock or per-VMA lock—based on the context. Because we now dynamically update the walk_ops->walk_lock field, we must ensure this is thread-safe. The madvise_free_walk_ops is now defined as a stack variable instead of a global constant. Link: https://lkml.kernel.org/r/20250611104745.57405-1-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: SeongJae Park <sj@kernel.org> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Tangquan Zheng <zhengtangquan@oppo.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: optimize mremap() by PTE batchingDev Jain
Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes are painted with the contig bit, then ptep_get() will iterate through all 16 entries to collect a/d bits. Hence this optimization will result in a 16x reduction in the number of ptep_get() calls. Next, ptep_get_and_clear() will eventually call contpte_try_unfold() on every contig block, thus flushing the TLB for the complete large folio range. Instead, use get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only do them on the starting and ending contig block. For split folios, there will be no pte batching; nr_ptes will be 1. For pagetable splitting, the ptes will still point to the same large folio; for arm64, this results in the optimization described above, and for other arches (including the general case), a minor improvement is expected due to a reduction in the number of function calls. Link: https://lkml.kernel.org/r/20250610035043.75448-3-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Bang Li <libang.li@antgroup.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: bibo mao <maobibo@loongson.cn> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: call pointers to ptes as ptepDev Jain
Patch series "Optimize mremap() for large folios", v4. Currently move_ptes() iterates through ptes one by one. If the underlying folio mapped by the ptes is large, we can process those ptes in a batch using folio_pte_batch(), thus clearing and setting the PTEs in one go. For arm64 specifically, this results in a 16x reduction in the number of ptep_get() calls (since on a contig block, ptep_get() on arm64 will iterate through all 16 entries to collect a/d bits), and we also elide extra TLBIs through get_and_clear_full_ptes, replacing ptep_get_and_clear. Mapping 1M of memory with 64K folios, memsetting it, remapping it to src + 1M, and munmapping it 10,000 times, the average execution time reduces from 1.9 to 1.2 seconds, giving a 37% performance optimization, on Apple M3 (arm64). No regression is observed for small folios. Test program for reference: #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <string.h> #include <errno.h> #define SIZE (1UL << 20) // 1M int main(void) { void *new_addr, *addr; for (int i = 0; i < 10000; ++i) { addr = mmap((void *)(1UL << 30), SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (addr == MAP_FAILED) { perror("mmap"); return 1; } memset(addr, 0xAA, SIZE); new_addr = mremap(addr, SIZE, SIZE, MREMAP_MAYMOVE | MREMAP_FIXED, addr + SIZE); if (new_addr != (addr + SIZE)) { perror("mremap"); return 1; } munmap(new_addr, SIZE); } } This patch (of 2): Avoid confusion between pte_t* and pte_t data types by suffixing pointer type variables with p. No functional change. Link: https://lkml.kernel.org/r/20250610035043.75448-1-dev.jain@arm.com Link: https://lkml.kernel.org/r/20250610035043.75448-2-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Bang Li <libang.li@antgroup.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: bibo mao <maobibo@loongson.cn> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09gup: optimize longterm pin_user_pages() for large folioLi Zhe
In the current implementation of longterm pin_user_pages(), we invoke collect_longterm_unpinnable_folios(). This function iterates through the list to check whether each folio belongs to the "longterm_unpinnabled" category. The folios in this list essentially correspond to a contiguous region of userspace addresses, with each folio representing a physical address in increments of PAGESIZE. If this userspace address range is mapped with large folio, we can optimize the performance of function collect_longterm_unpinnable_folios() by reducing the using of READ_ONCE() invoked in pofs_get_folio()->page_folio()->_compound_head(). Also, we can simplify the logic of collect_longterm_unpinnable_folios(). Instead of comparing with prev_folio after calling pofs_get_folio(), we can check whether the next page is within the same folio. The performance test results, based on v6.15, obtained through the gup_test tool from the kernel source tree are as follows. We achieve an improvement of over 66% for large folio with pagesize=2M. For small folio, we have only observed a very slight degradation in performance. Without this patch: [root@localhost ~] ./gup_test -HL -m 8192 -n 512 TAP version 13 1..1 # PIN_LONGTERM_BENCHMARK: Time: get:14391 put:10858 us# ok 1 ioctl status 0 # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 [root@localhost ~]# ./gup_test -LT -m 8192 -n 512 TAP version 13 1..1 # PIN_LONGTERM_BENCHMARK: Time: get:130538 put:31676 us# ok 1 ioctl status 0 # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 With this patch: [root@localhost ~] ./gup_test -HL -m 8192 -n 512 TAP version 13 1..1 # PIN_LONGTERM_BENCHMARK: Time: get:4867 put:10516 us# ok 1 ioctl status 0 # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 [root@localhost ~]# ./gup_test -LT -m 8192 -n 512 TAP version 13 1..1 # PIN_LONGTERM_BENCHMARK: Time: get:131798 put:31328 us# ok 1 ioctl status 0 # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 [lizhe.67@bytedance.com: whitespace fix, per David] Link: https://lkml.kernel.org/r/20250606091917.91384-1-lizhe.67@bytedance.com Link: https://lkml.kernel.org/r/20250606023742.58344-1-lizhe.67@bytedance.com Signed-off-by: Li Zhe <lizhe.67@bytedance.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/pagewalk: split walk_page_range_novma() into kernel/user partsLorenzo Stoakes
walk_page_range_novma() is rather confusing - it supports two modes, one used often, the other used only for debugging. The first mode is the common case of traversal of kernel page tables, which is what nearly all callers use this for. Secondly it provides an unusual debugging interface that allows for the traversal of page tables in a userland range of memory even for that memory which is not described by a VMA. It is far from certain that such page tables should even exist, but perhaps this is precisely why it is useful as a debugging mechanism. As a result, this is utilised by ptdump only. Historically, things were reversed - ptdump was the only user, and other parts of the kernel evolved to use the kernel page table walking here. Since we have some complicated and confusing locking rules for the novma case, it makes sense to separate the two usages into their own functions. Doing this also provide self-documentation as to the intent of the caller - are they doing something rather unusual or are they simply doing a standard kernel page table walk? We therefore establish two separate functions - walk_page_range_debug() for this single usage, and walk_kernel_page_table_range() for general kernel page table walking. The walk_page_range_debug() function is currently used to traverse both userland and kernel mappings, so we maintain this and in the case of kernel mappings being traversed, we have walk_page_range_debug() invoke walk_kernel_page_table_range() internally. We additionally make walk_page_range_debug() internal to mm. Link: https://lkml.kernel.org/r/20250605135104.90720-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Barry Song <baohua@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Stafford Horne <shorne@gmail.com> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/memfd: clarify error handling labels in memfd_create()Ye Liu
err_name --> err_free_name (fd failure case) err_fd --> err_free_fd (file failure case) Link: https://lkml.kernel.org/r/20250610083730.527619-1-ye.liu@linux.dev Signed-off-by: Ye Liu <liuye@kylinos.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/filemap: allow arch to request folio size for exec memoryRyan Roberts
Change the readahead config so that if it is being requested for an executable mapping, do a synchronous read into a set of folios with an arch-specified order and in a naturally aligned manner. We no longer center the read on the faulting page but simply align it down to the previous natural boundary. Additionally, we don't bother with an asynchronous part. On arm64 if memory is physically contiguous and naturally aligned to the "contpte" size, we can use contpte mappings, which improves utilization of the TLB. When paired with the "multi-size THP" feature, this works well to reduce dTLB pressure. However iTLB pressure is still high due to executable mappings having a low likelihood of being in the required folio size and mapping alignment, even when the filesystem supports readahead into large folios (e.g. XFS). The reason for the low likelihood is that the current readahead algorithm starts with an order-0 folio and increases the folio order by 2 every time the readahead mark is hit. But most executable memory tends to be accessed randomly and so the readahead mark is rarely hit and most executable folios remain order-0. So let's special-case the read(ahead) logic for executable mappings. The trade-off is performance improvement (due to more efficient storage of the translations in iTLB) vs potential for making reclaim more difficult (due to the folios being larger so if a part of the folio is hot the whole thing is considered hot). But executable memory is a small portion of the overall system memory so I doubt this will even register from a reclaim perspective. I've chosen 64K folio size for arm64 which benefits both the 4K and 16K base page size configs. Crucially the same amount of data is still read (usually 128K) so I'm not expecting any read amplification issues. I don't anticipate any write amplification because text is always RO. Note that the text region of an ELF file could be populated into the page cache for other reasons than taking a fault in a mmapped area. The most common case is due to the loader read()ing the header which can be shared with the beginning of text. So some text will still remain in small folios, but this simple, best effort change provides good performance improvements as is. Confine this special-case approach to the bounds of the VMA. This prevents wasting memory for any padding that might exist in the file between sections. Previously the padding would have been contained in order-0 folios and would be easy to reclaim. But now it would be part of a larger folio so more difficult to reclaim. Solve this by simply not reading it into memory in the first place. Benchmarking ============ The below shows pgbench and redis benchmarks on Graviton3 arm64 system. First, confirmation that this patch causes more text to be contained in 64K folios: +----------------------+---------------+---------------+---------------+ | File-backed folios by| system boot | pgbench | redis | | size as percentage of+-------+-------+-------+-------+-------+-------+ | all mapped text mem |before | after |before | after |before | after | +======================+=======+=======+=======+=======+=======+=======+ | base-page-4kB | 78% | 30% | 78% | 11% | 73% | 14% | | thp-aligned-8kB | 1% | 0% | 0% | 0% | 1% | 0% | | thp-aligned-16kB | 17% | 4% | 17% | 3% | 20% | 4% | | thp-aligned-32kB | 1% | 1% | 1% | 2% | 1% | 1% | | thp-aligned-64kB | 3% | 63% | 3% | 81% | 4% | 77% | | thp-aligned-128kB | 0% | 1% | 1% | 1% | 1% | 2% | | thp-unaligned-64kB | 0% | 0% | 0% | 1% | 0% | 1% | | thp-unaligned-128kB | 0% | 1% | 0% | 0% | 0% | 0% | | thp-partial | 0% | 0% | 0% | 1% | 0% | 1% | +----------------------+-------+-------+-------+-------+-------+-------+ | cont-aligned-64kB | 4% | 65% | 4% | 83% | 6% | 79% | +----------------------+-------+-------+-------+-------+-------+-------+ The above shows that for both workloads (each isolated with cgroups) as well as the general system state after boot, the amount of text backed by 4K and 16K folios reduces and the amount backed by 64K folios increases significantly. And the amount of text that is contpte-mapped significantly increases (see last row). And this is reflected in performance improvement. "(I)" indicates a statistically significant improvement. Note TPS and Reqs/sec are rates so bigger is better, ms is time so smaller is better: +-------------+-------------------------------------------+------------+ | Benchmark | Result Class | Improvemnt | +=============+===========================================+============+ | pts/pgbench | Scale: 1 Clients: 1 RO (TPS) | (I) 3.47% | | | Scale: 1 Clients: 1 RO - Latency (ms) | -2.88% | | | Scale: 1 Clients: 250 RO (TPS) | (I) 5.02% | | | Scale: 1 Clients: 250 RO - Latency (ms) | (I) -4.79% | | | Scale: 1 Clients: 1000 RO (TPS) | (I) 6.16% | | | Scale: 1 Clients: 1000 RO - Latency (ms) | (I) -5.82% | | | Scale: 100 Clients: 1 RO (TPS) | 2.51% | | | Scale: 100 Clients: 1 RO - Latency (ms) | -3.51% | | | Scale: 100 Clients: 250 RO (TPS) | (I) 4.75% | | | Scale: 100 Clients: 250 RO - Latency (ms) | (I) -4.44% | | | Scale: 100 Clients: 1000 RO (TPS) | (I) 6.34% | | | Scale: 100 Clients: 1000 RO - Latency (ms)| (I) -5.95% | +-------------+-------------------------------------------+------------+ | pts/redis | Test: GET Connections: 50 (Reqs/sec) | (I) 3.20% | | | Test: GET Connections: 1000 (Reqs/sec) | (I) 2.55% | | | Test: LPOP Connections: 50 (Reqs/sec) | (I) 4.59% | | | Test: LPOP Connections: 1000 (Reqs/sec) | (I) 4.81% | | | Test: LPUSH Connections: 50 (Reqs/sec) | (I) 5.31% | | | Test: LPUSH Connections: 1000 (Reqs/sec) | (I) 4.36% | | | Test: SADD Connections: 50 (Reqs/sec) | (I) 2.64% | | | Test: SADD Connections: 1000 (Reqs/sec) | (I) 4.15% | | | Test: SET Connections: 50 (Reqs/sec) | (I) 3.11% | | | Test: SET Connections: 1000 (Reqs/sec) | (I) 3.36% | +-------------+-------------------------------------------+------------+ [ryan.roberts@arm.com: fix use-after-free] Link: https://lkml.kernel.org/r/ea7f9da7-9a9f-4b85-9d0a-35b320f5ed25@arm.com [ryan.roberts@arm.com: use the vma_pages() helper instead of open-coding] Link: https://lkml.kernel.org/r/0e0f674b-3b7e-494f-ae7a-fc9dbb98dad4@arm.com Link: https://lkml.kernel.org/r/20250609092729.274960-6-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Will Deacon <will@kernel.org> Cc: Chaitanya S Prakash <chaitanyas.prakash@arm.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/readahead: store folio order in struct file_ra_stateRyan Roberts
Previously the folio order of the previous readahead request was inferred from the folio who's readahead marker was hit. But due to the way we have to round to non-natural boundaries sometimes, this first folio in the readahead block is often smaller than the preferred order for that request. This means that for cases where the initial sync readahead is poorly aligned, the folio order will ramp up much more slowly. So instead, let's store the order in struct file_ra_state so we are not affected by any required alignment. We previously made enough room in the struct for a 16 order field. This should be plenty big enough since we are limited to MAX_PAGECACHE_ORDER anyway, which is certainly never larger than ~20. Since we now pass order in struct file_ra_state, page_cache_ra_order() no longer needs it's new_order parameter, so let's remove that. Worked example: Here we are touching pages 17-256 sequentially just as we did in the previous commit, but now that we are remembering the preferred order explicitly, we no longer have the slow ramp up problem. Note specifically that we no longer have 2 rounds (2x ~128K) of order-2 folios: TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA ----- ---------- ---------- ---------- ------- ------- ----- ----- -- HOLE 0x00000000 0x00001000 4096 0 1 1 FOLIO 0x00001000 0x00002000 4096 1 2 1 0 FOLIO 0x00002000 0x00003000 4096 2 3 1 0 FOLIO 0x00003000 0x00004000 4096 3 4 1 0 FOLIO 0x00004000 0x00005000 4096 4 5 1 0 FOLIO 0x00005000 0x00006000 4096 5 6 1 0 FOLIO 0x00006000 0x00007000 4096 6 7 1 0 FOLIO 0x00007000 0x00008000 4096 7 8 1 0 FOLIO 0x00008000 0x00009000 4096 8 9 1 0 FOLIO 0x00009000 0x0000a000 4096 9 10 1 0 FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0 FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0 FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0 FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0 FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0 FOLIO 0x0000f000 0x00010000 4096 15 16 1 0 FOLIO 0x00010000 0x00011000 4096 16 17 1 0 FOLIO 0x00011000 0x00012000 4096 17 18 1 0 FOLIO 0x00012000 0x00013000 4096 18 19 1 0 FOLIO 0x00013000 0x00014000 4096 19 20 1 0 FOLIO 0x00014000 0x00015000 4096 20 21 1 0 FOLIO 0x00015000 0x00016000 4096 21 22 1 0 FOLIO 0x00016000 0x00017000 4096 22 23 1 0 FOLIO 0x00017000 0x00018000 4096 23 24 1 0 FOLIO 0x00018000 0x00019000 4096 24 25 1 0 FOLIO 0x00019000 0x0001a000 4096 25 26 1 0 FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0 FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0 FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0 FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0 FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0 FOLIO 0x0001f000 0x00020000 4096 31 32 1 0 FOLIO 0x00020000 0x00021000 4096 32 33 1 0 FOLIO 0x00021000 0x00022000 4096 33 34 1 0 FOLIO 0x00022000 0x00024000 8192 34 36 2 1 FOLIO 0x00024000 0x00028000 16384 36 40 4 2 FOLIO 0x00028000 0x0002c000 16384 40 44 4 2 FOLIO 0x0002c000 0x00030000 16384 44 48 4 2 FOLIO 0x00030000 0x00034000 16384 48 52 4 2 FOLIO 0x00034000 0x00038000 16384 52 56 4 2 FOLIO 0x00038000 0x0003c000 16384 56 60 4 2 FOLIO 0x0003c000 0x00040000 16384 60 64 4 2 FOLIO 0x00040000 0x00050000 65536 64 80 16 4 FOLIO 0x00050000 0x00060000 65536 80 96 16 4 FOLIO 0x00060000 0x00080000 131072 96 128 32 5 FOLIO 0x00080000 0x000a0000 131072 128 160 32 5 FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5 FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5 FOLIO 0x000e0000 0x00100000 131072 224 256 32 5 FOLIO 0x00100000 0x00120000 131072 256 288 32 5 FOLIO 0x00120000 0x00140000 131072 288 320 32 5 Y HOLE 0x00140000 0x00800000 7077888 320 2048 1728 Link: https://lkml.kernel.org/r/20250609092729.274960-5-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Chaitanya S Prakash <chaitanyas.prakash@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/readahead: make space in struct file_ra_stateRyan Roberts
We need to be able to store the preferred folio order associated with a readahead request in the struct file_ra_state so that we can more accurately increase the order across subsequent readahead requests. But struct file_ra_state is per-struct file, so we don't really want to increase it's size. mmap_miss is currently 32 bits but it is only counted up to 10 * MMAP_LOTSAMISS, which is currently defined as 1000. So 16 bits should be plenty. Redefine it to unsigned short, making room for order as unsigned short in follow up commit. Link: https://lkml.kernel.org/r/20250609092729.274960-4-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Chaitanya S Prakash <chaitanyas.prakash@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/readahead: terminate async readahead on natural boundaryRyan Roberts
Previously asynchonous readahead would read ra_pages (usually 128K) directly after the end of the synchonous readahead and given the synchronous readahead portion had no alignment guarantees (beyond page boundaries) it is possible (and likely) that the end of the initial 128K region would not fall on a natural boundary for the folio size being used. Therefore smaller folios were used to align down to the required boundary, both at the end of the previous readahead block and at the start of the new one. In the worst cases, this can result in never properly ramping up the folio size, and instead getting stuck oscillating between order-0, -1 and -2 folios. The next readahead will try to use folios whose order is +2 bigger than the folio that had the readahead marker. But because of the alignment requirements, that folio (the first one in the readahead block) can end up being order-0 in some cases. There will be 2 modifications to solve this issue: 1) Calculate the readahead size so the end is aligned to a folio boundary. This prevents needing to allocate small folios to align down at the end of the window and fixes the oscillation problem. 2) Remember the "preferred folio order" in the ra state instead of inferring it from the folio with the readahead marker. This solves the slow ramp up problem (discussed in a subsequent patch). This patch addresses (1) only. A subsequent patch will address (2). Worked example: The following shows the previous pathalogical behaviour when the initial synchronous readahead is unaligned. We start reading at page 17 in the file and read sequentially from there. I'm showing a dump of the pages in the page cache just after we read the first page of the folio with the readahead marker. Initially there are no pages in the page cache: TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA ----- ---------- ---------- ---------- ------- ------- ----- ----- -- HOLE 0x00000000 0x00800000 8388608 0 2048 2048 Then we access page 17, causing synchonous read-around of 128K with a readahead marker set up at page 25. So far, all as expected: TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA ----- ---------- ---------- ---------- ------- ------- ----- ----- -- HOLE 0x00000000 0x00001000 4096 0 1 1 FOLIO 0x00001000 0x00002000 4096 1 2 1 0 FOLIO 0x00002000 0x00003000 4096 2 3 1 0 FOLIO 0x00003000 0x00004000 4096 3 4 1 0 FOLIO 0x00004000 0x00005000 4096 4 5 1 0 FOLIO 0x00005000 0x00006000 4096 5 6 1 0 FOLIO 0x00006000 0x00007000 4096 6 7 1 0 FOLIO 0x00007000 0x00008000 4096 7 8 1 0 FOLIO 0x00008000 0x00009000 4096 8 9 1 0 FOLIO 0x00009000 0x0000a000 4096 9 10 1 0 FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0 FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0 FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0 FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0 FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0 FOLIO 0x0000f000 0x00010000 4096 15 16 1 0 FOLIO 0x00010000 0x00011000 4096 16 17 1 0 FOLIO 0x00011000 0x00012000 4096 17 18 1 0 FOLIO 0x00012000 0x00013000 4096 18 19 1 0 FOLIO 0x00013000 0x00014000 4096 19 20 1 0 FOLIO 0x00014000 0x00015000 4096 20 21 1 0 FOLIO 0x00015000 0x00016000 4096 21 22 1 0 FOLIO 0x00016000 0x00017000 4096 22 23 1 0 FOLIO 0x00017000 0x00018000 4096 23 24 1 0 FOLIO 0x00018000 0x00019000 4096 24 25 1 0 FOLIO 0x00019000 0x0001a000 4096 25 26 1 0 Y FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0 FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0 FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0 FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0 FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0 FOLIO 0x0001f000 0x00020000 4096 31 32 1 0 FOLIO 0x00020000 0x00021000 4096 32 33 1 0 HOLE 0x00021000 0x00800000 8253440 33 2048 2015 Now access pages 18-25 inclusive. This causes an asynchronous 128K readahead starting at page 33. But since we are unaligned, even though the preferred folio order is 2, the first folio in this batch (the one with the new readahead marker) is order-0: TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA ----- ---------- ---------- ---------- ------- ------- ----- ----- -- HOLE 0x00000000 0x00001000 4096 0 1 1 FOLIO 0x00001000 0x00002000 4096 1 2 1 0 FOLIO 0x00002000 0x00003000 4096 2 3 1 0 FOLIO 0x00003000 0x00004000 4096 3 4 1 0 FOLIO 0x00004000 0x00005000 4096 4 5 1 0 FOLIO 0x00005000 0x00006000 4096 5 6 1 0 FOLIO 0x00006000 0x00007000 4096 6 7 1 0 FOLIO 0x00007000 0x00008000 4096 7 8 1 0 FOLIO 0x00008000 0x00009000 4096 8 9 1 0 FOLIO 0x00009000 0x0000a000 4096 9 10 1 0 FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0 FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0 FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0 FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0 FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0 FOLIO 0x0000f000 0x00010000 4096 15 16 1 0 FOLIO 0x00010000 0x00011000 4096 16 17 1 0 FOLIO 0x00011000 0x00012000 4096 17 18 1 0 FOLIO 0x00012000 0x00013000 4096 18 19 1 0 FOLIO 0x00013000 0x00014000 4096 19 20 1 0 FOLIO 0x00014000 0x00015000 4096 20 21 1 0 FOLIO 0x00015000 0x00016000 4096 21 22 1 0 FOLIO 0x00016000 0x00017000 4096 22 23 1 0 FOLIO 0x00017000 0x00018000 4096 23 24 1 0 FOLIO 0x00018000 0x00019000 4096 24 25 1 0 FOLIO 0x00019000 0x0001a000 4096 25 26 1 0 FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0 FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0 FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0 FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0 FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0 FOLIO 0x0001f000 0x00020000 4096 31 32 1 0 FOLIO 0x00020000 0x00021000 4096 32 33 1 0 FOLIO 0x00021000 0x00022000 4096 33 34 1 0 Y FOLIO 0x00022000 0x00024000 8192 34 36 2 1 FOLIO 0x00024000 0x00028000 16384 36 40 4 2 FOLIO 0x00028000 0x0002c000 16384 40 44 4 2 FOLIO 0x0002c000 0x00030000 16384 44 48 4 2 FOLIO 0x00030000 0x00034000 16384 48 52 4 2 FOLIO 0x00034000 0x00038000 16384 52 56 4 2 FOLIO 0x00038000 0x0003c000 16384 56 60 4 2 FOLIO 0x0003c000 0x00040000 16384 60 64 4 2 FOLIO 0x00040000 0x00041000 4096 64 65 1 0 HOLE 0x00041000 0x00800000 8122368 65 2048 1983 Which means that when we now read pages 26-33 and readahead is kicked off again, the new preferred order is 2 (0 + 2), not 4 as we intended: TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA ----- ---------- ---------- ---------- ------- ------- ----- ----- -- HOLE 0x00000000 0x00001000 4096 0 1 1 FOLIO 0x00001000 0x00002000 4096 1 2 1 0 FOLIO 0x00002000 0x00003000 4096 2 3 1 0 FOLIO 0x00003000 0x00004000 4096 3 4 1 0 FOLIO 0x00004000 0x00005000 4096 4 5 1 0 FOLIO 0x00005000 0x00006000 4096 5 6 1 0 FOLIO 0x00006000 0x00007000 4096 6 7 1 0 FOLIO 0x00007000 0x00008000 4096 7 8 1 0 FOLIO 0x00008000 0x00009000 4096 8 9 1 0 FOLIO 0x00009000 0x0000a000 4096 9 10 1 0 FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0 FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0 FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0 FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0 FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0 FOLIO 0x0000f000 0x00010000 4096 15 16 1 0 FOLIO 0x00010000 0x00011000 4096 16 17 1 0 FOLIO 0x00011000 0x00012000 4096 17 18 1 0 FOLIO 0x00012000 0x00013000 4096 18 19 1 0 FOLIO 0x00013000 0x00014000 4096 19 20 1 0 FOLIO 0x00014000 0x00015000 4096 20 21 1 0 FOLIO 0x00015000 0x00016000 4096 21 22 1 0 FOLIO 0x00016000 0x00017000 4096 22 23 1 0 FOLIO 0x00017000 0x00018000 4096 23 24 1 0 FOLIO 0x00018000 0x00019000 4096 24 25 1 0 FOLIO 0x00019000 0x0001a000 4096 25 26 1 0 FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0 FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0 FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0 FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0 FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0 FOLIO 0x0001f000 0x00020000 4096 31 32 1 0 FOLIO 0x00020000 0x00021000 4096 32 33 1 0 FOLIO 0x00021000 0x00022000 4096 33 34 1 0 FOLIO 0x00022000 0x00024000 8192 34 36 2 1 FOLIO 0x00024000 0x00028000 16384 36 40 4 2 FOLIO 0x00028000 0x0002c000 16384 40 44 4 2 FOLIO 0x0002c000 0x00030000 16384 44 48 4 2 FOLIO 0x00030000 0x00034000 16384 48 52 4 2 FOLIO 0x00034000 0x00038000 16384 52 56 4 2 FOLIO 0x00038000 0x0003c000 16384 56 60 4 2 FOLIO 0x0003c000 0x00040000 16384 60 64 4 2 FOLIO 0x00040000 0x00041000 4096 64 65 1 0 FOLIO 0x00041000 0x00042000 4096 65 66 1 0 Y FOLIO 0x00042000 0x00044000 8192 66 68 2 1 FOLIO 0x00044000 0x00048000 16384 68 72 4 2 FOLIO 0x00048000 0x0004c000 16384 72 76 4 2 FOLIO 0x0004c000 0x00050000 16384 76 80 4 2 FOLIO 0x00050000 0x00054000 16384 80 84 4 2 FOLIO 0x00054000 0x00058000 16384 84 88 4 2 FOLIO 0x00058000 0x0005c000 16384 88 92 4 2 FOLIO 0x0005c000 0x00060000 16384 92 96 4 2 FOLIO 0x00060000 0x00061000 4096 96 97 1 0 HOLE 0x00061000 0x00800000 7991296 97 2048 1951 This ramp up from order-0 with smaller orders at the edges for alignment cycle continues all the way to the end of the file (not shown). After the change, we round down the end boundary to the order boundary so we no longer get stuck in the cycle and can ramp up the order over time. Note that the rate of the ramp up is still not as we would expect it. We will fix that next. Here we are touching pages 17-256 sequentially: TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA ----- ---------- ---------- ---------- ------- ------- ----- ----- -- HOLE 0x00000000 0x00001000 4096 0 1 1 FOLIO 0x00001000 0x00002000 4096 1 2 1 0 FOLIO 0x00002000 0x00003000 4096 2 3 1 0 FOLIO 0x00003000 0x00004000 4096 3 4 1 0 FOLIO 0x00004000 0x00005000 4096 4 5 1 0 FOLIO 0x00005000 0x00006000 4096 5 6 1 0 FOLIO 0x00006000 0x00007000 4096 6 7 1 0 FOLIO 0x00007000 0x00008000 4096 7 8 1 0 FOLIO 0x00008000 0x00009000 4096 8 9 1 0 FOLIO 0x00009000 0x0000a000 4096 9 10 1 0 FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0 FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0 FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0 FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0 FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0 FOLIO 0x0000f000 0x00010000 4096 15 16 1 0 FOLIO 0x00010000 0x00011000 4096 16 17 1 0 FOLIO 0x00011000 0x00012000 4096 17 18 1 0 FOLIO 0x00012000 0x00013000 4096 18 19 1 0 FOLIO 0x00013000 0x00014000 4096 19 20 1 0 FOLIO 0x00014000 0x00015000 4096 20 21 1 0 FOLIO 0x00015000 0x00016000 4096 21 22 1 0 FOLIO 0x00016000 0x00017000 4096 22 23 1 0 FOLIO 0x00017000 0x00018000 4096 23 24 1 0 FOLIO 0x00018000 0x00019000 4096 24 25 1 0 FOLIO 0x00019000 0x0001a000 4096 25 26 1 0 FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0 FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0 FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0 FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0 FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0 FOLIO 0x0001f000 0x00020000 4096 31 32 1 0 FOLIO 0x00020000 0x00021000 4096 32 33 1 0 FOLIO 0x00021000 0x00022000 4096 33 34 1 0 FOLIO 0x00022000 0x00024000 8192 34 36 2 1 FOLIO 0x00024000 0x00028000 16384 36 40 4 2 FOLIO 0x00028000 0x0002c000 16384 40 44 4 2 FOLIO 0x0002c000 0x00030000 16384 44 48 4 2 FOLIO 0x00030000 0x00034000 16384 48 52 4 2 FOLIO 0x00034000 0x00038000 16384 52 56 4 2 FOLIO 0x00038000 0x0003c000 16384 56 60 4 2 FOLIO 0x0003c000 0x00040000 16384 60 64 4 2 FOLIO 0x00040000 0x00044000 16384 64 68 4 2 FOLIO 0x00044000 0x00048000 16384 68 72 4 2 FOLIO 0x00048000 0x0004c000 16384 72 76 4 2 FOLIO 0x0004c000 0x00050000 16384 76 80 4 2 FOLIO 0x00050000 0x00054000 16384 80 84 4 2 FOLIO 0x00054000 0x00058000 16384 84 88 4 2 FOLIO 0x00058000 0x0005c000 16384 88 92 4 2 FOLIO 0x0005c000 0x00060000 16384 92 96 4 2 FOLIO 0x00060000 0x00070000 65536 96 112 16 4 FOLIO 0x00070000 0x00080000 65536 112 128 16 4 FOLIO 0x00080000 0x000a0000 131072 128 160 32 5 FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5 FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5 FOLIO 0x000e0000 0x00100000 131072 224 256 32 5 FOLIO 0x00100000 0x00120000 131072 256 288 32 5 FOLIO 0x00120000 0x00140000 131072 288 320 32 5 Y HOLE 0x00140000 0x00800000 7077888 320 2048 1728 Link: https://lkml.kernel.org/r/20250609092729.274960-3-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Chaitanya S Prakash <chaitanyas.prakash@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/readahead: honour new_order in page_cache_ra_order()Ryan Roberts
Patch series "Readahead tweaks for larger folios", v5. This series adds some tweaks to readahead so that it does a better job of ramping up folio sizes as readahead extends further into the file. And it additionally special-cases executable mappings to allow the arch to request a preferred folio size for text. This patch (of 5): page_cache_ra_order() takes a parameter called new_order, which is intended to express the preferred order of the folios that will be allocated for the readahead operation. Most callers indeed call this with their preferred new order. But page_cache_async_ra() calls it with the preferred order of the previous readahead request (actually the order of the folio that had the readahead marker, which may be smaller when alignment comes into play). And despite the parameter name, page_cache_ra_order() always treats it at the old order, adding 2 to it on entry. As a result, a cold readahead always starts with order-2 folios. Let's fix this behaviour by always passing in the *new* order. Worked example: Prior to the change, mmaping an 8MB file and touching each page sequentially, resulted in the following, where we start with order-2 folios for the first 128K then ramp up to order-4 for the next 128K, then get clamped to order-5 for the rest of the file because pa_pages is limited to 128K: TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER ----- ---------- ---------- --------- ------- ------- ----- ----- FOLIO 0x00000000 0x00004000 16384 0 4 4 2 FOLIO 0x00004000 0x00008000 16384 4 8 4 2 FOLIO 0x00008000 0x0000c000 16384 8 12 4 2 FOLIO 0x0000c000 0x00010000 16384 12 16 4 2 FOLIO 0x00010000 0x00014000 16384 16 20 4 2 FOLIO 0x00014000 0x00018000 16384 20 24 4 2 FOLIO 0x00018000 0x0001c000 16384 24 28 4 2 FOLIO 0x0001c000 0x00020000 16384 28 32 4 2 FOLIO 0x00020000 0x00030000 65536 32 48 16 4 FOLIO 0x00030000 0x00040000 65536 48 64 16 4 FOLIO 0x00040000 0x00060000 131072 64 96 32 5 FOLIO 0x00060000 0x00080000 131072 96 128 32 5 FOLIO 0x00080000 0x000a0000 131072 128 160 32 5 FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5 ... After the change, the same operation results in the first 128K being order-0, then we start ramping up to order-2, -4, and finally get clamped at order-5: TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER ----- ---------- ---------- --------- ------- ------- ----- ----- FOLIO 0x00000000 0x00001000 4096 0 1 1 0 FOLIO 0x00001000 0x00002000 4096 1 2 1 0 FOLIO 0x00002000 0x00003000 4096 2 3 1 0 FOLIO 0x00003000 0x00004000 4096 3 4 1 0 FOLIO 0x00004000 0x00005000 4096 4 5 1 0 FOLIO 0x00005000 0x00006000 4096 5 6 1 0 FOLIO 0x00006000 0x00007000 4096 6 7 1 0 FOLIO 0x00007000 0x00008000 4096 7 8 1 0 FOLIO 0x00008000 0x00009000 4096 8 9 1 0 FOLIO 0x00009000 0x0000a000 4096 9 10 1 0 FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0 FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0 FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0 FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0 FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0 FOLIO 0x0000f000 0x00010000 4096 15 16 1 0 FOLIO 0x00010000 0x00011000 4096 16 17 1 0 FOLIO 0x00011000 0x00012000 4096 17 18 1 0 FOLIO 0x00012000 0x00013000 4096 18 19 1 0 FOLIO 0x00013000 0x00014000 4096 19 20 1 0 FOLIO 0x00014000 0x00015000 4096 20 21 1 0 FOLIO 0x00015000 0x00016000 4096 21 22 1 0 FOLIO 0x00016000 0x00017000 4096 22 23 1 0 FOLIO 0x00017000 0x00018000 4096 23 24 1 0 FOLIO 0x00018000 0x00019000 4096 24 25 1 0 FOLIO 0x00019000 0x0001a000 4096 25 26 1 0 FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0 FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0 FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0 FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0 FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0 FOLIO 0x0001f000 0x00020000 4096 31 32 1 0 FOLIO 0x00020000 0x00024000 16384 32 36 4 2 FOLIO 0x00024000 0x00028000 16384 36 40 4 2 FOLIO 0x00028000 0x0002c000 16384 40 44 4 2 FOLIO 0x0002c000 0x00030000 16384 44 48 4 2 FOLIO 0x00030000 0x00034000 16384 48 52 4 2 FOLIO 0x00034000 0x00038000 16384 52 56 4 2 FOLIO 0x00038000 0x0003c000 16384 56 60 4 2 FOLIO 0x0003c000 0x00040000 16384 60 64 4 2 FOLIO 0x00040000 0x00050000 65536 64 80 16 4 FOLIO 0x00050000 0x00060000 65536 80 96 16 4 FOLIO 0x00060000 0x00080000 131072 96 128 32 5 FOLIO 0x00080000 0x000a0000 131072 128 160 32 5 FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5 FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5 ... Link: https://lkml.kernel.org/r/20250609092729.274960-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20250609092729.274960-2-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: Chaitanya S Prakash <chaitanyas.prakash@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>