summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-07-09mm: convert pXd_devmap checks to vma_is_daxAlistair Popple
Patch series "mm: Remove pXX_devmap page table bit and pfn_t type", v3. All users of dax now require a ZONE_DEVICE page which is properly refcounted. This means there is no longer any need for the PFN_DEV, PFN_MAP and PFN_SPECIAL flags. Furthermore the PFN_SG_CHAIN and PFN_SG_LAST flags never appear to have been used. It is therefore possible to remove the pfn_t type and replace any usage with raw pfns. The remaining users of PFN_DEV have simply passed this to vmf_insert_mixed() to create pte_devmap() mappings. It is unclear why this was the case but presumably to ensure vm_normal_page() does not return these pages. These users can be trivially converted to raw pfns and creating a pXX_special() mapping to ensure vm_normal_page() still doesn't return these pages. Now that there are no users of PFN_DEV we can remove the devmap page table bit and all associated functions and macros, freeing up a software page table bit. This patch (of 14): Currently dax is the only user of pmd and pud mapped ZONE_DEVICE pages. Therefore page walkers that want to exclude DAX pages can check pmd_devmap or pud_devmap. However soon dax will no longer set PFN_DEV, meaning dax pages are mapped as normal pages. Ensure page walkers that currently use pXd_devmap to skip DAX pages continue to do so by adding explicit checks of the VMA instead. Remove vma_is_dax() check from mm/userfaultfd.c as validate_move_areas() will already skip DAX VMA's on account of them not being anonymous. The check in userfaultfd_must_wait() is also redundant as vma_can_userfault() should have already filtered out dax vma's. For HMM the pud_devmap check seems unnecessary as there is no reason we shouldn't be able to handle any leaf PUD here so remove it also. Link: https://lkml.kernel.org/r/cover.176965585864cb8d2cf41464b44dcc0471e643a0.1750323463.git-series.apopple@nvidia.com Link: https://lkml.kernel.org/r/f0611f6f475f48fcdf34c65084a359aefef4cccc.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/percpu: conditionally define _shared_alloc_tag via ↵Hao Ge
CONFIG_ARCH_MODULE_NEEDS_WEAK_PER_CPU Recently discovered this entry while checking kallsyms on ARM64: ffff800083e509c0 D _shared_alloc_tag If ARCH_NEEDS_WEAK_PER_CPU is not defined(it is only defined for s390 and alpha architectures), there's no need to statically define the percpu variable _shared_alloc_tag. Therefore, we need to implement isolation for this purpose. When building the core kernel code for s390 or alpha architectures, ARCH_NEEDS_WEAK_PER_CPU remains undefined (as it is gated by #if defined(MODULE)). However, when building modules for these architectures, the macro is explicitly defined. Therefore, we remove all instances of ARCH_NEEDS_WEAK_PER_CPU from the code and introduced CONFIG_ARCH_MODULE_NEEDS_WEAK_PER_CPU to replace the relevant logic. We can now conditionally define the perpcu variable _shared_alloc_tag based on CONFIG_ARCH_MODULE_NEEDS_WEAK_PER_CPU. This allows architectures (such as s390/alpha) that require weak definitions for percpu variables in modules to include the definition, while others can omit it via compile-time exclusion. Link: https://lkml.kernel.org/r/20250618015809.1235761-1-hao.ge@linux.dev Signed-off-by: Hao Ge <gehao@kylinos.cn> Suggested-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> [s390] Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Chistoph Lameter <cl@linux.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09selftests/udmabuf: add a test to pin first before writing to memfdVivek Kasireddy
Unlike the existing tests, this new test will create a memfd (backed by hugetlb) and pin the folios in it (a small subset) before writing/ populating it with data. This is a valid use-case that invokes the memfd_alloc_folio() kernel API and is expected to work unless there aren't enough hugetlb folios to satisfy the allocation needs. Link: https://lkml.kernel.org/r/20250618053415.1036185-4-vivek.kasireddy@intel.com Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Cc: Gerd Hoffmann <kraxel@redhat.com> Cc: Steve Sistare <steven.sistare@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/memfd: reserve hugetlb folios before allocationVivek Kasireddy
When we try to allocate a folio via alloc_hugetlb_folio_reserve(), we need to ensure that there is an active reservation associated with the allocation. Otherwise, our allocation request would fail if there are no active reservations made at that moment against any other allocations. This is because alloc_hugetlb_folio_reserve() checks h->resv_huge_pages before proceeding with the allocation. Therefore, to address this issue, we just need to make a reservation (by calling hugetlb_reserve_pages()) before we try to allocate the folio. This will also ensure that proper region/subpool accounting is done associated with our allocation. Link: https://lkml.kernel.org/r/20250618053415.1036185-3-vivek.kasireddy@intel.com Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Cc: Steve Sistare <steven.sistare@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Gerd Hoffmann <kraxel@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/hugetlb: make hugetlb_reserve_pages() return nr of entries updatedVivek Kasireddy
Patch series "mm/memfd: Reserve hugetlb folios before allocation", v4. There are cases when we try to pin a folio but discover that it has not been faulted-in. So, we try to allocate it in memfd_alloc_folio() but the allocation request may not succeed if there are no active reservations in the system at that instant. Therefore, making a reservation (by calling hugetlb_reserve_pages()) associated with the allocation will ensure that our request would not fail due to lack of reservations. This will also ensure that proper region/subpool accounting is done with our allocation. This patch (of 3): Currently, hugetlb_reserve_pages() returns a bool to indicate whether the reservation map update for the range [from, to] was successful or not. This is not sufficient for the case where the caller needs to determine how many entries were updated for the range. Therefore, have hugetlb_reserve_pages() return the number of entries updated in the reservation map associated with the range [from, to]. Also, update the callers of hugetlb_reserve_pages() to handle the new return value. Link: https://lkml.kernel.org/r/20250618053415.1036185-1-vivek.kasireddy@intel.com Link: https://lkml.kernel.org/r/20250618053415.1036185-2-vivek.kasireddy@intel.com Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Cc: Steve Sistare <steven.sistare@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Gerd Hoffmann <kraxel@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09codetag: avoid unused alloc_tags sections/symbolsPetr Pavlu
With CONFIG_MEM_ALLOC_PROFILING=n, vmlinux and all modules unnecessarily contain the symbols __start_alloc_tags and __stop_alloc_tags, which define an empty range. In the case of modules, the presence of these symbols also forces the linker to create an empty .codetag.alloc_tags section. Update codetag.lds.h to make the data conditional on CONFIG_MEM_ALLOC_PROFILING. Link: https://lkml.kernel.org/r/20250618125037.53182-1-petr.pavlu@suse.com Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Casey Chen <cachen@purestorage.com> Cc: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/damon: fix minor typos in damon headerNathan Gao
Fix typos in include/linux/damon.h. Link: https://lkml.kernel.org/r/20250618163331.54910-1-sj@kernel.org Signed-off-by: Nathan Gao <zcgao@amazon.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: update architecture and driver code to use vm_flags_tLorenzo Stoakes
In future we intend to change the vm_flags_t type, so it isn't correct for architecture and driver code to assume it is unsigned long. Correct this assumption across the board. Overall, this patch does not introduce any functional change. Link: https://lkml.kernel.org/r/b6eb1894abc5555ece80bb08af5c022ef780c8bc.1750274467.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <kees@kernel.org> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: update core kernel code to use vm_flags_t consistentlyLorenzo Stoakes
The core kernel code is currently very inconsistent in its use of vm_flags_t vs. unsigned long. This prevents us from changing the type of vm_flags_t in the future and is simply not correct, so correct this. While this results in rather a lot of churn, it is a critical pre-requisite for a future planned change to VMA flag type. Additionally, update VMA userland tests to account for the changes. To make review easier and to break things into smaller parts, driver and architecture-specific changes is left for a subsequent commit. The code has been adjusted to cascade the changes across all calling code as far as is needed. We will adjust architecture-specific and driver code in a subsequent patch. Overall, this patch does not introduce any functional change. Link: https://lkml.kernel.org/r/d1588e7bb96d1ea3fe7b9df2c699d5b4592d901d.1750274467.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Kees Cook <kees@kernel.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: change vm_get_page_prot() to accept vm_flags_t argumentLorenzo Stoakes
Patch series "use vm_flags_t consistently". The VMA flags field vma->vm_flags is of type vm_flags_t. Right now this is exactly equivalent to unsigned long, but it should not be assumed to be. Much code that references vma->vm_flags already correctly uses vm_flags_t, but a fairly large chunk of code simply uses unsigned long and assumes that the two are equivalent. This series corrects that and has us use vm_flags_t consistently. This series is motivated by the desire to, in a future series, adjust vm_flags_t to be a u64 regardless of whether the kernel is 32-bit or 64-bit in order to deal with the VMA flag exhaustion issue and avoid all the various problems that arise from it (being unable to use certain features in 32-bit, being unable to add new flags except for 64-bit, etc.) This is therefore a critical first step towards that goal. At any rate, using the correct type is of value regardless. We additionally take the opportunity to refer to VMA flags as vm_flags where possible to make clear what we're referring to. Overall, this series does not introduce any functional change. This patch (of 3): We abstract the type of the VMA flags to vm_flags_t, however in may places it is simply assumed this is unsigned long, which is simply incorrect. At the moment this is simply an incongruity, however in future we plan to change this type and therefore this change is a critical requirement for doing so. Overall, this patch does not introduce any functional change. [lorenzo.stoakes@oracle.com: add missing vm_get_page_prot() instance, remove include] Link: https://lkml.kernel.org/r/552f88e1-2df8-4e95-92b8-812f7c8db829@lucifer.local Link: https://lkml.kernel.org/r/cover.1750274467.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/a12769720a2743f235643b158c4f4f0a9911daf0.1750274467.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <kees@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09Revert "mm: make alloc_demote_folio externally invokable for migration"SeongJae Park
This reverts commit a00ce85af2a1be494d3b0c9457e8e81cdcce2a89. Commit a00ce85af2a1 ("mm: make alloc_demote_folio externally invokable for migration") was made to let DAMOS_MIGRATE_{HOT,COLD} call the function. But a previous commit made DAMOS_MIGRATE_{HOT,COLD} call alloc_migration_target() instead. Hence there are no more callers of the function outside of vmscan.c. Revert the commit to make the function static again. Link: https://lkml.kernel.org/r/20250616172346.67659-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Honggyu Kim <honggyu.kim@sk.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09Revert "mm: rename alloc_demote_folio to alloc_migrate_folio"SeongJae Park
This reverts commit 8f75267d22bdf8e3baf70f2fa7092d8c2f58da71. Commit 8f75267d22bd ("mm: rename alloc_demote_folio to alloc_migrate_folio") was to reflect the fact the function is called for not only demotion, but also general migrations from DAMOS_MIGRATE_{HOT,COLD}. The previous commit made the DAMOS actions to not use alloc_migrate_folio(), though. So, demote_folio_list() is the only caller of alloc_migrate_folio(), and the name could now be rather confusing. Revert the renaming commit. Link: https://lkml.kernel.org/r/20250616172346.67659-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Honggyu Kim <honggyu.kim@sk.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/damon/paddr: use alloc_migartion_target() with no migration fallback nodemaskSeongJae Park
Patch series "mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD}". DAMOS_MIGRATE_{HOT,COLD} implementation resembles that for demotion, and hence the behavior is also similar to that. But, since those are not only for demotion but general migrations, it would be better to match with that for move_pages() system call. Make the implementation and the behavior more similar to move_pages() by not setting migration fallback nodes, and using alloc_migration_target() instead of alloc_migrate_folio(). alloc_migrate_folio() was renamed from alloc_demote_folio() and been non-static function, to let DAMOS_MIGRATE_{HOT,COLD} call it. As alloc_migration_target() is called instead, the renaming and de-static changes are no more required but could only make future code readers be confused. Revert the changes, too. This patch (of 3): DAMOS_MIGRATE_{HOT,COLD} implementation resembles that for demote_folio_list(). Because those are not only for demotion but general folio migrations, it makes more sense to behave similarly to move_pages() system call. Make the behavior more similar to move_pages(), by using alloc_migration_target() instead of alloc_migrate_folio(), without fallback nodemask. Link: https://lkml.kernel.org/r/20250616172346.67659-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Honggyu Kim <honggyu.kim@sk.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09tools/testing/radix-tree: test maple tree chaining mas_preallocate() callsLiam R. Howlett
Testing calling multiple mas_preallocate() calls in a row after adjusting the maple state. Ensures new calls to mas_preallocate() will change the number of allocated nodes. Link: https://lkml.kernel.org/r/20250616184521.3382795-4-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Hailong Liu <hailong.liu@oppo.com> Cc: "Liam R. Howlett" <howlett@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09testing/radix-tree/maple: increase readers and reduce delay for faster machinesLiam R. Howlett
Faster machines may not see the initial or updated value in the race condition. Reduce the delay so that faster machines are less likely to fail testing of the race conditions. Link: https://lkml.kernel.org/r/20250616184521.3382795-2-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <howlett@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Hailong Liu <hailong.liu@oppo.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: huge_memory: fix the check for allowed huge orders in shmemBaolin Wang
Shmem already supports mTHP, and shmem_allowable_huge_orders() will return the huge orders allowed by shmem. However, there is no check against the 'orders' parameter passed by __thp_vma_allowable_orders(), which can lead to incorrect check results for __thp_vma_allowable_orders(). For example, when a user wants to check if shmem supports PMD-sized THP by thp_vma_allowable_order(), if shmem only enables 64K mTHP, the current logic would cause thp_vma_allowable_order() to return true, implying that shmem allows PMD-sized THP allocation, which it actually does not. I don't think this will cause a significant impact on users, and this will only have some impact on the shmem THP collapse. That is to say, even though the shmem sysfs setting does not enable the PMD-sized THP, the thp_vma_allowable_order() still indicates that shmem allows PMD-sized collapse, meaning it might successfully collapse into THP, or it might not (for example, thp_vma_suitable_order() check failed in the collapse process). However, this still does not align with the shmem sysfs configuration, fix it. Link: https://lkml.kernel.org/r/529affb3220153d0d5a542960b535cdfc33f51d7.1749804835.git.baolin.wang@linux.alibaba.com Fixes: 26c7d8413aaf ("mm: thp: support "THPeligible" semantics for mTHP with anonymous shmem") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09selftest/mm: skip if fallocate() is unsupported in gup_longtermMark Brown
Currently gup_longterm assumes that filesystems support fallocate() and uses that to allocate space in files, however this is an optional feature and is in particular not implemented by NFSv3 which is commonly used in CI systems leading to spurious failures. Check for lack of support and report a skip instead for that case. Link: https://lkml.kernel.org/r/20250613-selftest-mm-gup-longterm-fallocate-nfs-v1-1-758a104c175f@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/vma: use vmg->target to specify target VMA for new VMA mergeLorenzo Stoakes
In commit 3a75ccba047b ("mm: simplify vma merge structure and expand comments") we introduced the vmg->target field to make the merging of existing VMAs simpler - clarifying precisely which VMA would eventually become the merged VMA once the merge operation was complete. New VMA merging did not get quite the same treatment, retaining the rather confusing convention of storing the target VMA in vmg->middle. This patch corrects this state of affairs, utilising vmg->target for this purpose for both vma_merge_new_range() and also for vma_expand(). We retain the WARN_ON for vmg->middle being specified in vma_merge_new_range() as doing so would make no sense, but add an additional debug assert for setting vmg->target. This patch additionally updates VMA userland testing to account for this change. [lorenzo.stoakes@oracle.com: make comment consistent in vma_expand()] Link: https://lkml.kernel.org/r/c54f45e3-a6ac-4749-93c0-cc9e3080ee37@lucifer.local Link: https://lkml.kernel.org/r/20250613184807.108089-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <kees@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09highmem: remove a use of folio->pageMatthew Wilcox (Oracle)
Call folio_address() instead of page_address(). Link: https://lkml.kernel.org/r/20250613194825.3175276-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09secretmem: remove uses of struct pageMatthew Wilcox (Oracle)
Use filemap_lock_folio() instead of find_lock_page() to retrieve a folio from the page cache. [lorenzo.stoakes@oracle.com: fix check of filemap_lock_folio() return value] Link: https://lkml.kernel.org/r/fdbca1d0-01a3-4653-85ed-cf257bb848be@lucifer.local Link: https://lkml.kernel.org/r/20250613194744.3175157-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pud()David Hildenbrand
Marking PUDs that map a "normal" refcounted folios as special is against our rules documented for vm_normal_page(). normal (refcounted) folios shall never have the page table mapping marked as special. Fortunately, there are not that many pud_special() check that can be mislead and are right now rather harmless: e.g., none so far bases decisions whether to grab a folio reference on that decision. Well, and GUP-fast will fallback to GUP-slow. All in all, so far no big implications as it seems. Getting this right will get more important as we introduce folio_normal_page_pud() and start using it in more place where we currently special-case based on other VMA flags. Fix it just like we fixed vmf_insert_folio_pmd(). Add folio_mk_pud() to mimic what we do with folio_mk_pmd(). Link: https://lkml.kernel.org/r/20250613092702.1943533-4-david@redhat.com Fixes: dbe54153296d ("mm/huge_memory: add vmf_insert_folio_pud()") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pmd()David Hildenbrand
Marking PMDs that map a "normal" refcounted folios as special is against our rules documented for vm_normal_page(): normal (refcounted) folios shall never have the page table mapping marked as special. Fortunately, there are not that many pmd_special() check that can be mislead, and most vm_normal_page_pmd()/vm_normal_folio_pmd() users that would get this wrong right now are rather harmless: e.g., none so far bases decisions whether to grab a folio reference on that decision. Well, and GUP-fast will fallback to GUP-slow. All in all, so far no big implications as it seems. Getting this right will get more important as we use folio_normal_page_pmd() in more places. Fix it by teaching insert_pfn_pmd() to properly handle folios and pfns -- moving refcount/mapcount/etc handling in there, renaming it to insert_pmd(), and distinguishing between both cases using a new simple "struct folio_or_pfn" structure. Use folio_mk_pmd() to create a pmd for a folio cleanly. Link: https://lkml.kernel.org/r/20250613092702.1943533-3-david@redhat.com Fixes: 6c88f72691f8 ("mm/huge_memory: add vmf_insert_folio_pmd()") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/huge_memory: don't ignore queried cachemode in vmf_insert_pfn_pud()David Hildenbrand
Patch series "mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes", v3. While working on improving vm_normal_page() and friends, I stumbled over this issues: refcounted "normal" folios must not be marked using pmd_special() / pud_special(). Otherwise, we're effectively telling the system that these folios are no "normal", violating the rules we documented for vm_normal_page(). Fortunately, there are not many pmd_special()/pud_special() users yet. So far there doesn't seem to be serious damage. Tested using the ndctl tests ("ndctl:dax" suite). This patch (of 3): We set up the cache mode but ... don't forward the updated pgprot to insert_pfn_pud(). Only a problem on x86-64 PAT when mapping PFNs using PUDs that require a special cachemode. Fix it by using the proper pgprot where the cachemode was setup. It is unclear in which configurations we would get the cachemode wrong: through vfio seems possible. Getting cachemodes wrong is usually ... bad. As the fix is easy, let's backport it to stable. Identified by code inspection. Link: https://lkml.kernel.org/r/20250613092702.1943533-1-david@redhat.com Link: https://lkml.kernel.org/r/20250613092702.1943533-2-david@redhat.com Fixes: 7b806d229ef1 ("mm: remove vmf_insert_pfn_xxx_prot() for huge page-table entries") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Tested-by: Dan Williams <dan.j.williams@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09selftests: mm: add shmem collapse as a default test itemBaolin Wang
Currently, we only test anonymous memory collapse by default. We should also add shmem collapse as a default test item to catch issues that could break the test cases. Link: https://lkml.kernel.org/r/a30b1529b399f2e649b5a05c3d352f41a68faeae.1749779183.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Tested-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Barry Song <baohua@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09selftests: khugepaged: fix the shmem collapse failureBaolin Wang
When running the khugepaged selftest for shmem (./khugepaged all:shmem), I encountered the following test failures: : Run test: collapse_full (khugepaged:shmem) : Collapse multiple fully populated PTE table.... Fail : ... : Run test: collapse_single_pte_entry (khugepaged:shmem) : Collapse PTE table with single PTE entry present.... Fail : ... : Run test: collapse_full_of_compound (khugepaged:shmem) : Allocate huge page... OK : Split huge page leaving single PTE page table full of compound pages... OK : Collapse PTE table full of compound pages.... Fail The reason for the failure is that it will set MADV_NOHUGEPAGE to prevent khugepaged from continuing to scan shmem VMA after khugepaged finishes scanning in the wait_for_scan() function. Moreover, shmem requires a refault to establish PMD mappings. However, after commit 2b0f922323cc ("mm: don't install PMD mappings when THPs are disabled by the hw/process/vma"), PMD mappings are prevented if the VMA is set with MADV_NOHUGEPAGE flag, so shmem cannot establish PMD mappings during refault. One way to fix this issue is to move the MADV_NOHUGEPAGE setting after the shmem refault. After shmem refault and check huge, the test case will unmap the shmem immediately. So it seems unnecessary to set the MADV_NOHUGEPAGE. Then we can simply drop the MADV_NOHUGEPAGE setting, and all khugepaged test cases passed. Link: https://lkml.kernel.org/r/d8502fc50d0304c2afd27ced062b1d636b7a872e.1749779183.git.baolin.wang@linux.alibaba.com Fixes: 2b0f922323cc ("mm: don't install PMD mappings when THPs are disabled by the hw/process/vma") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Tested-by: Dev Jain <dev.jain@arm.com> Tested-by: Mario Casquero <mcasquer@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: remove zero_user()Matthew Wilcox (Oracle)
All users have now been converted to either memzero_page() or folio_zero_range(). Link: https://lkml.kernel.org/r/20250612143443.2848197-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Alex Markuze <amarkuze@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Xiubo Li <xiubli@redhat.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09ceph: convert ceph_zero_partial_page() to use a folioMatthew Wilcox (Oracle)
Retrieve a folio from the pagecache instead of a page and operate on it. Removes several hidden calls to compound_head() along with calls to deprecated functions like wait_on_page_writeback() and find_lock_page(). [dan.carpenter@linaro.org: fix NULL vs IS_ERR() bug in ceph_zero_partial_page()] Link: https://lkml.kernel.org/r/685c1424.050a0220.baa8.d6a1@mx.google.com Link: https://lkml.kernel.org/r/20250612143443.2848197-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Xiubo Li <xiubli@redhat.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Alex Markuze <amarkuze@redhat.com> Cc: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09direct-io: use memzero_page()Matthew Wilcox (Oracle)
memzero_page() is the new name for zero_user(). Link: https://lkml.kernel.org/r/20250612143443.2848197-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Alex Markuze <amarkuze@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Xiubo Li <xiubli@redhat.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09null_blk: use memzero_page()Matthew Wilcox (Oracle)
memzero_page() is the new name for zero_user(). Link: https://lkml.kernel.org/r/20250612143443.2848197-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Alex Markuze <amarkuze@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Xiubo Li <xiubli@redhat.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09bio: use memzero_page() in bio_truncate()Matthew Wilcox (Oracle)
Patch series "Remove zero_user()". The zero_user() API is almost unused these days. Finish the job of removing it. This patch (of 5): memzero_page() is the new name for zero_user(). Link: https://lkml.kernel.org/r/20250612143443.2848197-1-willy@infradead.org Link: https://lkml.kernel.org/r/20250612143443.2848197-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Alex Markuze <amarkuze@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: use folio_expected_ref_count() helper for reference countingShivank Garg
Replace open-coded folio reference count calculations with the folio_expected_ref_count(). No functional changes intended. Link: https://lkml.kernel.org/r/20250611052706.515408-2-shivankg@amd.com Signed-off-by: Shivank Garg <shivankg@amd.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Marc Rutland <mark.rutland@arm.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Namhyung kim <namhyung@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09selftests/mm: use generic read_sysfs in thuge-gen testPu Lehui
As generic read_sysfs is available in vm_utils, let's use is in thuge-gen test. Link: https://lkml.kernel.org/r/20250611100106.1331197-1-pulehui@huaweicloud.com Signed-off-by: Pu Lehui <pulehui@huawei.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: madvise: use per_vma lock for MADV_FREEBarry Song
MADV_FREE is another option, besides MADV_DONTNEED, for dynamic memory freeing in user-space native or Java heap memory management. For example, jemalloc can be configured to use MADV_FREE, and recent versions of the Android Java heap have also increasingly adopted MADV_FREE. Supporting per-VMA locking for MADV_FREE thus appears increasingly necessary. We have replaced walk_page_range() with walk_page_range_vma(). Along with the proposed madvise_lock_mode by Lorenzo, the necessary infrastructure is now in place to begin exploring per-VMA locking support for MADV_FREE and potentially other madvise using walk_page_range_vma(). This patch adds support for the PGWALK_VMA_RDLOCK walk_lock mode in walk_page_range_vma(), and leverages madvise_lock_mode from madv_behavior to select the appropriate walk_lock—either mmap_lock or per-VMA lock—based on the context. Because we now dynamically update the walk_ops->walk_lock field, we must ensure this is thread-safe. The madvise_free_walk_ops is now defined as a stack variable instead of a global constant. Link: https://lkml.kernel.org/r/20250611104745.57405-1-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: SeongJae Park <sj@kernel.org> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Tangquan Zheng <zhengtangquan@oppo.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: optimize mremap() by PTE batchingDev Jain
Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes are painted with the contig bit, then ptep_get() will iterate through all 16 entries to collect a/d bits. Hence this optimization will result in a 16x reduction in the number of ptep_get() calls. Next, ptep_get_and_clear() will eventually call contpte_try_unfold() on every contig block, thus flushing the TLB for the complete large folio range. Instead, use get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only do them on the starting and ending contig block. For split folios, there will be no pte batching; nr_ptes will be 1. For pagetable splitting, the ptes will still point to the same large folio; for arm64, this results in the optimization described above, and for other arches (including the general case), a minor improvement is expected due to a reduction in the number of function calls. Link: https://lkml.kernel.org/r/20250610035043.75448-3-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Bang Li <libang.li@antgroup.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: bibo mao <maobibo@loongson.cn> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: call pointers to ptes as ptepDev Jain
Patch series "Optimize mremap() for large folios", v4. Currently move_ptes() iterates through ptes one by one. If the underlying folio mapped by the ptes is large, we can process those ptes in a batch using folio_pte_batch(), thus clearing and setting the PTEs in one go. For arm64 specifically, this results in a 16x reduction in the number of ptep_get() calls (since on a contig block, ptep_get() on arm64 will iterate through all 16 entries to collect a/d bits), and we also elide extra TLBIs through get_and_clear_full_ptes, replacing ptep_get_and_clear. Mapping 1M of memory with 64K folios, memsetting it, remapping it to src + 1M, and munmapping it 10,000 times, the average execution time reduces from 1.9 to 1.2 seconds, giving a 37% performance optimization, on Apple M3 (arm64). No regression is observed for small folios. Test program for reference: #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <string.h> #include <errno.h> #define SIZE (1UL << 20) // 1M int main(void) { void *new_addr, *addr; for (int i = 0; i < 10000; ++i) { addr = mmap((void *)(1UL << 30), SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (addr == MAP_FAILED) { perror("mmap"); return 1; } memset(addr, 0xAA, SIZE); new_addr = mremap(addr, SIZE, SIZE, MREMAP_MAYMOVE | MREMAP_FIXED, addr + SIZE); if (new_addr != (addr + SIZE)) { perror("mremap"); return 1; } munmap(new_addr, SIZE); } } This patch (of 2): Avoid confusion between pte_t* and pte_t data types by suffixing pointer type variables with p. No functional change. Link: https://lkml.kernel.org/r/20250610035043.75448-1-dev.jain@arm.com Link: https://lkml.kernel.org/r/20250610035043.75448-2-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Bang Li <libang.li@antgroup.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: bibo mao <maobibo@loongson.cn> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/memory-tier: fix abstract distance calculation overflowLi Zhijian
In mt_perf_to_adistance(), the calculation of abstract distance (adist) involves multiplying several int values including MEMTIER_ADISTANCE_DRAM. *adist = MEMTIER_ADISTANCE_DRAM * (perf->read_latency + perf->write_latency) / (default_dram_perf.read_latency + default_dram_perf.write_latency) * (default_dram_perf.read_bandwidth + default_dram_perf.write_bandwidth) / (perf->read_bandwidth + perf->write_bandwidth); Since these values can be large, the multiplication may exceed the maximum value of an int (INT_MAX) and overflow (Our platform did), leading to an incorrect adist. User-visible impact: The memory tiering subsystem will misinterpret slow memory (like CXL) as faster than DRAM, causing inappropriate demotion of pages from CXL (slow memory) to DRAM (fast memory). For example, we will see the following demotion chains from the dmesg, where Node0,1 are DRAM, and Node2,3 are CXL node: Demotion targets for Node 0: null Demotion targets for Node 1: null Demotion targets for Node 2: preferred: 0-1, fallback: 0-1 Demotion targets for Node 3: preferred: 0-1, fallback: 0-1 Change MEMTIER_ADISTANCE_DRAM to be a long constant by writing it with the 'L' suffix. This prevents the overflow because the multiplication will then be done in the long type which has a larger range. Link: https://lkml.kernel.org/r/20250611023439.2845785-1-lizhijian@fujitsu.com Link: https://lkml.kernel.org/r/20250610062751.2365436-1-lizhijian@fujitsu.com Fixes: 3718c02dbd4c ("acpi, hmat: calculate abstract distance with HMAT") Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com> Acked-by: Balbir Singh <balbirs@nvidia.com> Reviewed-by: Donet Tom <donettom@linux.ibm.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09alloc_tag: keep codetag iterator active between read()David Wang
When reading /proc/allocinfo, for each read syscall, seq_file would invoke start/stop callbacks. In start callback, a memory is alloced to store iterator and the iterator would start from beginning to walk linearly to current read position. seq_file read() takes at most 4096 bytes, even if read with a larger user space buffer, meaning read out all of /proc/allocinfo, tens of read syscalls are needed. For example, a 306036 bytes allocinfo files need 76 reads: $ sudo cat /proc/allocinfo | wc 3964 16678 306036 $ sudo strace -T -e read cat /proc/allocinfo ... read(3, " 4096 1 arch/x86/k"..., 131072) = 4063 <0.000062> ... read(3, " 0 0 sound/core"..., 131072) = 4021 <0.000150> ... For those n=3964 lines, each read takes about m=3964/76=52 lines, since iterator restart from beginning for each read(), it would move forward m steps on 1st read 2*m steps on 2nd read 3*m steps on 3rd read ... n steps on last read As read() along, those linear seek steps make read() calls slower and slower. Adding those up, codetag iterator moves about O(n*n/m) steps, making data structure traversal take significant part of the whole reading. Profiling when stress reading /proc/allocinfo confirms it: vfs_read(99.959% 1677299/1677995) proc_reg_read_iter(99.856% 1674881/1677299) seq_read_iter(99.959% 1674191/1674881) allocinfo_start(75.664% 1266755/1674191) codetag_next_ct(79.217% 1003487/1266755) <--- srso_return_thunk(1.264% 16011/1266755) __kmalloc_cache_noprof(0.102% 1296/1266755) ... allocinfo_show(21.287% 356378/1674191) allocinfo_next(1.530% 25621/1674191) codetag_next_ct() takes major part. A private data alloced at open() time can be used to carry iterator alive across read() calls, and avoid the memory allocation and iterator reset for each read(). This way, only O(1) memory allocation and O(n) steps iterating, and `time` shows performance improvement from ~7ms to ~4ms. Profiling with the change: vfs_read(99.865% 1581073/1583214) proc_reg_read_iter(99.485% 1572934/1581073) seq_read_iter(99.846% 1570519/1572934) allocinfo_show(87.428% 1373074/1570519) seq_buf_printf(83.695% 1149196/1373074) seq_buf_putc(1.917% 26321/1373074) _find_next_bit(1.531% 21023/1373074) ... codetag_to_text(0.490% 6727/1373074) ... allocinfo_next(6.275% 98543/1570519) ... allocinfo_start(0.369% 5790/1570519) ... Now seq_buf_printf() takes major part. Link: https://lkml.kernel.org/r/20250609064408.112783-1-00107082@163.com Signed-off-by: David Wang <00107082@163.com> Acked-by: Suren Baghdasaryan <surenb@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09alloc_tag: add sequence number for module and iteratorDavid Wang
Codetag iterator use <id,address> pair to guarantee the validness. But both id and address can be reused, there is theoretical possibility when module inserted right after another module removed, kmalloc returns an address same as the address kfree by previous module and IDR key reuses the key recently removed. Add a sequence number to codetag_module and code_iterator, the sequence number is strickly incremented whenever a module is loaded. An iterator is valid if and only if its sequence number match codetag_module's. Link: https://lkml.kernel.org/r/20250609064200.112639-1-00107082@163.com Signed-off-by: David Wang <00107082@163.com> Acked-by: Suren Baghdasaryan <surenb@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09gup: optimize longterm pin_user_pages() for large folioLi Zhe
In the current implementation of longterm pin_user_pages(), we invoke collect_longterm_unpinnable_folios(). This function iterates through the list to check whether each folio belongs to the "longterm_unpinnabled" category. The folios in this list essentially correspond to a contiguous region of userspace addresses, with each folio representing a physical address in increments of PAGESIZE. If this userspace address range is mapped with large folio, we can optimize the performance of function collect_longterm_unpinnable_folios() by reducing the using of READ_ONCE() invoked in pofs_get_folio()->page_folio()->_compound_head(). Also, we can simplify the logic of collect_longterm_unpinnable_folios(). Instead of comparing with prev_folio after calling pofs_get_folio(), we can check whether the next page is within the same folio. The performance test results, based on v6.15, obtained through the gup_test tool from the kernel source tree are as follows. We achieve an improvement of over 66% for large folio with pagesize=2M. For small folio, we have only observed a very slight degradation in performance. Without this patch: [root@localhost ~] ./gup_test -HL -m 8192 -n 512 TAP version 13 1..1 # PIN_LONGTERM_BENCHMARK: Time: get:14391 put:10858 us# ok 1 ioctl status 0 # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 [root@localhost ~]# ./gup_test -LT -m 8192 -n 512 TAP version 13 1..1 # PIN_LONGTERM_BENCHMARK: Time: get:130538 put:31676 us# ok 1 ioctl status 0 # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 With this patch: [root@localhost ~] ./gup_test -HL -m 8192 -n 512 TAP version 13 1..1 # PIN_LONGTERM_BENCHMARK: Time: get:4867 put:10516 us# ok 1 ioctl status 0 # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 [root@localhost ~]# ./gup_test -LT -m 8192 -n 512 TAP version 13 1..1 # PIN_LONGTERM_BENCHMARK: Time: get:131798 put:31328 us# ok 1 ioctl status 0 # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 [lizhe.67@bytedance.com: whitespace fix, per David] Link: https://lkml.kernel.org/r/20250606091917.91384-1-lizhe.67@bytedance.com Link: https://lkml.kernel.org/r/20250606023742.58344-1-lizhe.67@bytedance.com Signed-off-by: Li Zhe <lizhe.67@bytedance.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/pagewalk: split walk_page_range_novma() into kernel/user partsLorenzo Stoakes
walk_page_range_novma() is rather confusing - it supports two modes, one used often, the other used only for debugging. The first mode is the common case of traversal of kernel page tables, which is what nearly all callers use this for. Secondly it provides an unusual debugging interface that allows for the traversal of page tables in a userland range of memory even for that memory which is not described by a VMA. It is far from certain that such page tables should even exist, but perhaps this is precisely why it is useful as a debugging mechanism. As a result, this is utilised by ptdump only. Historically, things were reversed - ptdump was the only user, and other parts of the kernel evolved to use the kernel page table walking here. Since we have some complicated and confusing locking rules for the novma case, it makes sense to separate the two usages into their own functions. Doing this also provide self-documentation as to the intent of the caller - are they doing something rather unusual or are they simply doing a standard kernel page table walk? We therefore establish two separate functions - walk_page_range_debug() for this single usage, and walk_kernel_page_table_range() for general kernel page table walking. The walk_page_range_debug() function is currently used to traverse both userland and kernel mappings, so we maintain this and in the case of kernel mappings being traversed, we have walk_page_range_debug() invoke walk_kernel_page_table_range() internally. We additionally make walk_page_range_debug() internal to mm. Link: https://lkml.kernel.org/r/20250605135104.90720-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Barry Song <baohua@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Stafford Horne <shorne@gmail.com> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/memfd: clarify error handling labels in memfd_create()Ye Liu
err_name --> err_free_name (fd failure case) err_fd --> err_free_fd (file failure case) Link: https://lkml.kernel.org/r/20250610083730.527619-1-ye.liu@linux.dev Signed-off-by: Ye Liu <liuye@kylinos.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09lib/test_hmm: reduce stack usageArnd Bergmann
The various test ioctl handlers use arrays of 64 integers that add up to 1KiB of stack data, which in turn leads to exceeding the warning limit in some configurations: lib/test_hmm.c:935:12: error: stack frame size (1408) exceeds limit (1280) in 'dmirror_migrate_to_device' [-Werror,-Wframe-larger-than] Use half the size for these arrays, in order to stay under the warning limits. The code can already deal with arbitrary lengths, but this may be a little less efficient. Link: https://lkml.kernel.org/r/20250610092159.2639515-1-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jeff Johnson <jeff.johnson@oss.qualcomm.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09selftests/mm: check for YAMA ptrace_scope configuraiton before modifying itMark Brown
When running the memfd_secret test run_vmtests.sh unconditionally tries to confgiure the YAMA LSM's ptrace_scope configuration, leading to an error if YAMA is not in the running kernel: # ./run_vmtests.sh: line 432: /proc/sys/kernel/yama/ptrace_scope: No such file or directory # # ---------------------- # # running ./memfd_secret # # ---------------------- Check that this file is present before trying to write to it. The indentation here is a bit odd, and it doesn't seem great that we configure but don't restore ptrace_scope. Link: https://lkml.kernel.org/r/20250610-selftest-mm-enable-yama-v1-1-0097b6713116@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09selftests/mm: add messages about test errors to the cow testsMark Brown
It is not sufficiently clear what the individual tests in the cow test program are checking so add messages for the failure cases. Link: https://lkml.kernel.org/r/20250610-selftest-mm-cow-tweaks-v1-4-43cd7457500f@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09selftests/mm: don't compare return values to in cowMark Brown
Tweak the coding style for checking for non-zero return values. While we're at it also remove a now redundant oring of the madvise() return code. Link: https://lkml.kernel.org/r/20250610-selftest-mm-cow-tweaks-v1-3-43cd7457500f@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09selftests/mm: convert some cow error reports to ksft_perror()Mark Brown
This prints the errno and a string decode of it. Link: https://lkml.kernel.org/r/20250610-selftest-mm-cow-tweaks-v1-2-43cd7457500f@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09kselftest/mm: clarify errors for pipe()Mark Brown
Patch series "selftests/mm: Tweaks to the cow test". A collection of non-functional updates from David Hildenbrand's review. This patch (of 4): Specify that errors reported from pipe() failures are the result of failures. Link: https://lkml.kernel.org/r/20250610-selftest-mm-cow-tweaks-v1-0-43cd7457500f@kernel.org Link: https://lkml.kernel.org/r/20250610-selftest-mm-cow-tweaks-v1-1-43cd7457500f@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Mark Brown <broonie@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09alloc_tag: remove empty module tag sectionCasey Chen
The empty MOD_CODETAG_SECTIONS() macro added an incomplete .data section in module linker script, which caused symbol lookup tools like gdb to misinterpret symbol addresses e.g., __ib_process_cq incorrectly mapping to unrelated functions like below. (gdb) disas __ib_process_cq Dump of assembler code for function trace_event_fields_cq_schedule: Removing the empty section restores proper symbol resolution and layout, ensuring .data placement behaves as expected. Link: https://lkml.kernel.org/r/20250610162258.324645-1-cachen@purestorage.com Fixes: 0db6f8d7820a ("alloc_tag: load module tags into separate contiguous memory") 22d407b164ff ("lib: add allocation tagging support for memory allocation profiling") Signed-off-by: Casey Chen <cachen@purestorage.com> Reviewed-by: Yuanyuan Zhong <yzhong@purestorage.com> Acked-by: Suren Baghdasaryan <surenb@google.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/filemap: allow arch to request folio size for exec memoryRyan Roberts
Change the readahead config so that if it is being requested for an executable mapping, do a synchronous read into a set of folios with an arch-specified order and in a naturally aligned manner. We no longer center the read on the faulting page but simply align it down to the previous natural boundary. Additionally, we don't bother with an asynchronous part. On arm64 if memory is physically contiguous and naturally aligned to the "contpte" size, we can use contpte mappings, which improves utilization of the TLB. When paired with the "multi-size THP" feature, this works well to reduce dTLB pressure. However iTLB pressure is still high due to executable mappings having a low likelihood of being in the required folio size and mapping alignment, even when the filesystem supports readahead into large folios (e.g. XFS). The reason for the low likelihood is that the current readahead algorithm starts with an order-0 folio and increases the folio order by 2 every time the readahead mark is hit. But most executable memory tends to be accessed randomly and so the readahead mark is rarely hit and most executable folios remain order-0. So let's special-case the read(ahead) logic for executable mappings. The trade-off is performance improvement (due to more efficient storage of the translations in iTLB) vs potential for making reclaim more difficult (due to the folios being larger so if a part of the folio is hot the whole thing is considered hot). But executable memory is a small portion of the overall system memory so I doubt this will even register from a reclaim perspective. I've chosen 64K folio size for arm64 which benefits both the 4K and 16K base page size configs. Crucially the same amount of data is still read (usually 128K) so I'm not expecting any read amplification issues. I don't anticipate any write amplification because text is always RO. Note that the text region of an ELF file could be populated into the page cache for other reasons than taking a fault in a mmapped area. The most common case is due to the loader read()ing the header which can be shared with the beginning of text. So some text will still remain in small folios, but this simple, best effort change provides good performance improvements as is. Confine this special-case approach to the bounds of the VMA. This prevents wasting memory for any padding that might exist in the file between sections. Previously the padding would have been contained in order-0 folios and would be easy to reclaim. But now it would be part of a larger folio so more difficult to reclaim. Solve this by simply not reading it into memory in the first place. Benchmarking ============ The below shows pgbench and redis benchmarks on Graviton3 arm64 system. First, confirmation that this patch causes more text to be contained in 64K folios: +----------------------+---------------+---------------+---------------+ | File-backed folios by| system boot | pgbench | redis | | size as percentage of+-------+-------+-------+-------+-------+-------+ | all mapped text mem |before | after |before | after |before | after | +======================+=======+=======+=======+=======+=======+=======+ | base-page-4kB | 78% | 30% | 78% | 11% | 73% | 14% | | thp-aligned-8kB | 1% | 0% | 0% | 0% | 1% | 0% | | thp-aligned-16kB | 17% | 4% | 17% | 3% | 20% | 4% | | thp-aligned-32kB | 1% | 1% | 1% | 2% | 1% | 1% | | thp-aligned-64kB | 3% | 63% | 3% | 81% | 4% | 77% | | thp-aligned-128kB | 0% | 1% | 1% | 1% | 1% | 2% | | thp-unaligned-64kB | 0% | 0% | 0% | 1% | 0% | 1% | | thp-unaligned-128kB | 0% | 1% | 0% | 0% | 0% | 0% | | thp-partial | 0% | 0% | 0% | 1% | 0% | 1% | +----------------------+-------+-------+-------+-------+-------+-------+ | cont-aligned-64kB | 4% | 65% | 4% | 83% | 6% | 79% | +----------------------+-------+-------+-------+-------+-------+-------+ The above shows that for both workloads (each isolated with cgroups) as well as the general system state after boot, the amount of text backed by 4K and 16K folios reduces and the amount backed by 64K folios increases significantly. And the amount of text that is contpte-mapped significantly increases (see last row). And this is reflected in performance improvement. "(I)" indicates a statistically significant improvement. Note TPS and Reqs/sec are rates so bigger is better, ms is time so smaller is better: +-------------+-------------------------------------------+------------+ | Benchmark | Result Class | Improvemnt | +=============+===========================================+============+ | pts/pgbench | Scale: 1 Clients: 1 RO (TPS) | (I) 3.47% | | | Scale: 1 Clients: 1 RO - Latency (ms) | -2.88% | | | Scale: 1 Clients: 250 RO (TPS) | (I) 5.02% | | | Scale: 1 Clients: 250 RO - Latency (ms) | (I) -4.79% | | | Scale: 1 Clients: 1000 RO (TPS) | (I) 6.16% | | | Scale: 1 Clients: 1000 RO - Latency (ms) | (I) -5.82% | | | Scale: 100 Clients: 1 RO (TPS) | 2.51% | | | Scale: 100 Clients: 1 RO - Latency (ms) | -3.51% | | | Scale: 100 Clients: 250 RO (TPS) | (I) 4.75% | | | Scale: 100 Clients: 250 RO - Latency (ms) | (I) -4.44% | | | Scale: 100 Clients: 1000 RO (TPS) | (I) 6.34% | | | Scale: 100 Clients: 1000 RO - Latency (ms)| (I) -5.95% | +-------------+-------------------------------------------+------------+ | pts/redis | Test: GET Connections: 50 (Reqs/sec) | (I) 3.20% | | | Test: GET Connections: 1000 (Reqs/sec) | (I) 2.55% | | | Test: LPOP Connections: 50 (Reqs/sec) | (I) 4.59% | | | Test: LPOP Connections: 1000 (Reqs/sec) | (I) 4.81% | | | Test: LPUSH Connections: 50 (Reqs/sec) | (I) 5.31% | | | Test: LPUSH Connections: 1000 (Reqs/sec) | (I) 4.36% | | | Test: SADD Connections: 50 (Reqs/sec) | (I) 2.64% | | | Test: SADD Connections: 1000 (Reqs/sec) | (I) 4.15% | | | Test: SET Connections: 50 (Reqs/sec) | (I) 3.11% | | | Test: SET Connections: 1000 (Reqs/sec) | (I) 3.36% | +-------------+-------------------------------------------+------------+ [ryan.roberts@arm.com: fix use-after-free] Link: https://lkml.kernel.org/r/ea7f9da7-9a9f-4b85-9d0a-35b320f5ed25@arm.com [ryan.roberts@arm.com: use the vma_pages() helper instead of open-coding] Link: https://lkml.kernel.org/r/0e0f674b-3b7e-494f-ae7a-fc9dbb98dad4@arm.com Link: https://lkml.kernel.org/r/20250609092729.274960-6-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Will Deacon <will@kernel.org> Cc: Chaitanya S Prakash <chaitanyas.prakash@arm.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/readahead: store folio order in struct file_ra_stateRyan Roberts
Previously the folio order of the previous readahead request was inferred from the folio who's readahead marker was hit. But due to the way we have to round to non-natural boundaries sometimes, this first folio in the readahead block is often smaller than the preferred order for that request. This means that for cases where the initial sync readahead is poorly aligned, the folio order will ramp up much more slowly. So instead, let's store the order in struct file_ra_state so we are not affected by any required alignment. We previously made enough room in the struct for a 16 order field. This should be plenty big enough since we are limited to MAX_PAGECACHE_ORDER anyway, which is certainly never larger than ~20. Since we now pass order in struct file_ra_state, page_cache_ra_order() no longer needs it's new_order parameter, so let's remove that. Worked example: Here we are touching pages 17-256 sequentially just as we did in the previous commit, but now that we are remembering the preferred order explicitly, we no longer have the slow ramp up problem. Note specifically that we no longer have 2 rounds (2x ~128K) of order-2 folios: TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA ----- ---------- ---------- ---------- ------- ------- ----- ----- -- HOLE 0x00000000 0x00001000 4096 0 1 1 FOLIO 0x00001000 0x00002000 4096 1 2 1 0 FOLIO 0x00002000 0x00003000 4096 2 3 1 0 FOLIO 0x00003000 0x00004000 4096 3 4 1 0 FOLIO 0x00004000 0x00005000 4096 4 5 1 0 FOLIO 0x00005000 0x00006000 4096 5 6 1 0 FOLIO 0x00006000 0x00007000 4096 6 7 1 0 FOLIO 0x00007000 0x00008000 4096 7 8 1 0 FOLIO 0x00008000 0x00009000 4096 8 9 1 0 FOLIO 0x00009000 0x0000a000 4096 9 10 1 0 FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0 FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0 FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0 FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0 FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0 FOLIO 0x0000f000 0x00010000 4096 15 16 1 0 FOLIO 0x00010000 0x00011000 4096 16 17 1 0 FOLIO 0x00011000 0x00012000 4096 17 18 1 0 FOLIO 0x00012000 0x00013000 4096 18 19 1 0 FOLIO 0x00013000 0x00014000 4096 19 20 1 0 FOLIO 0x00014000 0x00015000 4096 20 21 1 0 FOLIO 0x00015000 0x00016000 4096 21 22 1 0 FOLIO 0x00016000 0x00017000 4096 22 23 1 0 FOLIO 0x00017000 0x00018000 4096 23 24 1 0 FOLIO 0x00018000 0x00019000 4096 24 25 1 0 FOLIO 0x00019000 0x0001a000 4096 25 26 1 0 FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0 FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0 FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0 FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0 FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0 FOLIO 0x0001f000 0x00020000 4096 31 32 1 0 FOLIO 0x00020000 0x00021000 4096 32 33 1 0 FOLIO 0x00021000 0x00022000 4096 33 34 1 0 FOLIO 0x00022000 0x00024000 8192 34 36 2 1 FOLIO 0x00024000 0x00028000 16384 36 40 4 2 FOLIO 0x00028000 0x0002c000 16384 40 44 4 2 FOLIO 0x0002c000 0x00030000 16384 44 48 4 2 FOLIO 0x00030000 0x00034000 16384 48 52 4 2 FOLIO 0x00034000 0x00038000 16384 52 56 4 2 FOLIO 0x00038000 0x0003c000 16384 56 60 4 2 FOLIO 0x0003c000 0x00040000 16384 60 64 4 2 FOLIO 0x00040000 0x00050000 65536 64 80 16 4 FOLIO 0x00050000 0x00060000 65536 80 96 16 4 FOLIO 0x00060000 0x00080000 131072 96 128 32 5 FOLIO 0x00080000 0x000a0000 131072 128 160 32 5 FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5 FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5 FOLIO 0x000e0000 0x00100000 131072 224 256 32 5 FOLIO 0x00100000 0x00120000 131072 256 288 32 5 FOLIO 0x00120000 0x00140000 131072 288 320 32 5 Y HOLE 0x00140000 0x00800000 7077888 320 2048 1728 Link: https://lkml.kernel.org/r/20250609092729.274960-5-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Chaitanya S Prakash <chaitanyas.prakash@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>