summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2025-05-11mm,hugetlb: allocate frozen pages in alloc_buddy_hugetlb_folioOscar Salvador
alloc_buddy_hugetlb_folio() allocates a rmappable folio, then strips the rmappable part and freezes it. We can simplify all that by allocating frozen pages directly. Link: https://lkml.kernel.org/r/20250411132359.312708-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11vmalloc: use atomic_long_add_return_relaxed()Uladzislau Rezki (Sony)
Switch from the atomic_long_add_return() to its relaxed version. We do not need a full memory barrier or any memory ordering during increasing the "vmap_lazy_nr" variable. What we only need is to do it atomically. This is what atomic_long_add_return_relaxed() guarantees. AARCH64: <snip> Default: 40ec: d34cfe94 lsr x20, x20, #12 40f0: 14000044 b 4200 <free_vmap_area_noflush+0x19c> 40f4: 94000000 bl 0 <__sanitizer_cov_trace_pc> 40f8: 90000000 adrp x0, 0 <__traceiter_alloc_vmap_area> 40fc: 91000000 add x0, x0, #0x0 4100: f8f40016 ldaddal x20, x22, [x0] 4104: 8b160296 add x22, x20, x22 Relaxed: 40ec: d34cfe94 lsr x20, x20, #12 40f0: 14000044 b 4200 <free_vmap_area_noflush+0x19c> 40f4: 94000000 bl 0 <__sanitizer_cov_trace_pc> 40f8: 90000000 adrp x0, 0 <__traceiter_alloc_vmap_area> 40fc: 91000000 add x0, x0, #0x0 4100: f8340016 ldadd x20, x22, [x0] 4104: 8b160296 add x22, x20, x22 <snip> Link: https://lkml.kernel.org/r/20250415112646.113091-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Christop Hellwig <hch@infradead.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm, hugetlb: avoid passing a null nodemask when there is mbind policyOscar Salvador
Before trying to allocate a page, gather_surplus_pages() sets up a nodemask for the nodes we can allocate from, but instead of passing the nodemask down the road to the page allocator, it iterates over the nodes within that nodemask right there, meaning that the page allocator will receive a preferred_nid and a null nodemask. This is a problem when using a memory policy, because it might be that the page allocator ends up using a node as a fallback which is not represented in the policy. Avoid that by passing the nodemask directly to the page allocator, so it can filter out fallback nodes that are not part of the nodemask. Link: https://lkml.kernel.org/r/20250415121503.376811-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11memcg: optimize memcg_rstat_updatedShakeel Butt
Currently the kernel maintains the stats updates per-memcg which is needed to implement stats flushing threshold. On the update side, the update is added to the per-cpu per-memcg update of the given memcg and all of its ancestors. However when the given memcg has passed the flushing threshold, all of its ancestors should have passed the threshold as well. There is no need to traverse up the memcg tree to maintain the stats updates. Perf profile collected from our fleet shows that memcg_rstat_updated is one of the most expensive memcg function i.e. a lot of cumulative CPU is being spent on it. So, even small micro optimizations matter a lot. This patch is microbenchmarked with multiple instances of netperf on a single machine with locally running netserver and we see couple of percentage of improvement. Link: https://lkml.kernel.org/r/20250410025752.92159-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm/madvise: batch tlb flushes for MADV_DONTNEED[_LOCKED]SeongJae Park
MADV_DONTNEED[_LOCKED] handling for [process_]madvise() flushes tlb for each vma of each address range. Update the logic to do tlb flushes in a batched way. Initialize an mmu_gather object from do_madvise() and vector_madvise(), which are the entry level functions for [process_]madvise(), respectively. And pass those objects to the function for per-vma work, via madvise_behavior struct. Make the per-vma logic not flushes tlb on their own but just saves the tlb entries to the received mmu_gather object. For this internal logic change, make zap_page_range_single_batched() non-static and use it directly from madvise_dontneed_single_vma(). Finally, the entry level functions flush the tlb entries that gathered for the entire user request, at once. Link: https://lkml.kernel.org/r/20250410000022.1901-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm/memory: split non-tlb flushing part from zap_page_range_single()SeongJae Park
Some of zap_page_range_single() callers such as [process_]madvise() with MADV_DONTNEED[_LOCKED] cannot batch tlb flushes because zap_page_range_single() flushes tlb for each invocation. Split out the body of zap_page_range_single() except mmu_gather object initialization and gathered tlb entries flushing for such batched tlb flushing usage. To avoid hugetlb pages allocation failures from concurrent page faults, the tlb flush should be done before hugetlb faults unlocking, though. Do the flush and the unlock inside the split out function in the order for hugetlb vma case. Refer to commit 2820b0f09be9 ("hugetlbfs: close race between MADV_DONTNEED and page fault") for more details about the concurrent faults' page allocation failure problem. Link: https://lkml.kernel.org/r/20250410000022.1901-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm/madvise: batch tlb flushes for MADV_FREESeongJae Park
MADV_FREE handling for [process_]madvise() flushes tlb for each vma of each address range. Update the logic to do tlb flushes in a batched way. Initialize an mmu_gather object from do_madvise() and vector_madvise(), which are the entry level functions for [process_]madvise(), respectively. And pass those objects to the function for per-vma work, via madvise_behavior struct. Make the per-vma logic not flushes tlb on their own but just saves the tlb entries to the received mmu_gather object. Finally, the entry level functions flush the tlb entries that gathered for the entire user request, at once. Link: https://lkml.kernel.org/r/20250410000022.1901-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm/madvise: define and use madvise_behavior struct for madvise_do_behavior()SeongJae Park
Patch series "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE", v3. When process_madvise() is called to do MADV_DONTNEED[_LOCKED] or MADV_FREE with multiple address ranges, tlb flushes happen for each of the given address ranges. Because such tlb flushes are for the same process, doing those in a batch is more efficient while still being safe. Modify process_madvise() entry level code path to do such batched tlb flushes, while the internal unmap logic do only gathering of the tlb entries to flush. In more detail, modify the entry functions to initialize an mmu_gather object and pass it to the internal logic. And make the internal logic do only gathering of the tlb entries to flush into the received mmu_gather object. After all internal function calls are done, the entry functions flush the gathered tlb entries at once. Because process_madvise() and madvise() share the internal unmap logic, make same change to madvise() entry code together, to make code consistent and cleaner. It is only for keeping the code clean, and shouldn't degrade madvise(). It could rather provide a potential tlb flushes reduction benefit for a case that there are multiple vmas for the given address range. It is only a side effect from an effort to keep code clean, so we don't measure it separately. Similar optimizations might be applicable to other madvise behavior such as MADV_COLD and MADV_PAGEOUT. Those are simply out of the scope of this patch series, though. Patches Sequence ================ The first patch defines a new data structure for managing information that is required for batched tlb flushes (mmu_gather and behavior), and update code paths for MADV_DONTNEED[_LOCKED] and MADV_FREE handling internal logic to receive it. The second patch batches tlb flushes for MADV_FREE handling for both madvise() and process_madvise(). Remaining two patches are for MADV_DONTNEED[_LOCKED] tlb flushes batching. The third patch splits zap_page_range_single() for batching of MADV_DONTNEED[_LOCKED] handling. The fourth patch batches tlb flushes for the hint using the sub-logic that the third patch split out, and the helpers for batched tlb flushes that introduced for the MADV_FREE case, by the second patch. Test Results ============ I measured the latency to apply MADV_DONTNEED advice to 256 MiB memory using multiple process_madvise() calls. I apply the advice in 4 KiB sized regions granularity, but with varying batch size per process_madvise() call (vlen) from 1 to 1024. The source code for the measurement is available at GitHub[1]. To reduce measurement errors, I did the measurement five times. The measurement results are as below. 'sz_batch' column shows the batch size of process_madvise() calls. 'Before' and 'After' columns show the average of latencies in nanoseconds that measured five times on kernels that built without and with the tlb flushes batching of this series (patches 3 and 4), respectively. For the baseline, mm-new tree of 2025-04-09[2] has been used, after reverting the second version of this patch series and adding a temporal fix for !CONFIG_DEBUG_VM build failure[3]. 'B-stdev' and 'A-stdev' columns show ratios of latency measurements standard deviation to average in percent for 'Before' and 'After', respectively. 'Latency_reduction' shows the reduction of the latency that the 'After' has achieved compared to 'Before', in percent. Higher 'Latency_reduction' values mean more efficiency improvements. sz_batch Before B-stdev After A-stdev Latency_reduction 1 146386348 2.78 111327360.6 3.13 23.95 2 108222130 1.54 72131173.6 2.39 33.35 4 93617846.8 2.76 51859294.4 2.50 44.61 8 80555150.4 2.38 44328790 1.58 44.97 16 77272777 1.62 37489433.2 1.16 51.48 32 76478465.2 2.75 33570506 3.48 56.10 64 75810266.6 1.15 27037652.6 1.61 64.34 128 73222748 3.86 25517629.4 3.30 65.15 256 72534970.8 2.31 25002180.4 0.94 65.53 512 71809392 5.12 24152285.4 2.41 66.37 1024 73281170.2 4.53 24183615 2.09 67.00 Unexpectedly the latency has reduced (improved) even with batch size one. I think some of compiler optimizations have affected that, like also observed with the first version of this patch series. So, please focus on the proportion between the improvement and the batch size. As expected, tlb flushes batching provides latency reduction that proportional to the batch size. The efficiency gain ranges from about 33 percent with batch size 2, and up to 67 percent with batch size 1,024. Please note that this is a very simple microbenchmark, so real efficiency gain on real workload could be very different. This patch (of 4): To implement batched tlb flushes for MADV_DONTNEED[_LOCKED] and MADV_FREE, an mmu_gather object in addition to the behavior integer need to be passed to the internal logics. Using a struct can make it easy without increasing the number of parameters of all code paths towards the internal logic. Define a struct for the purpose and use it on the code path that starts from madvise_do_behavior() and ends on madvise_dontneed_free(). Note that this changes madvise_walk_vmas() visitor type signature, too. Specifically, it changes its 'arg' type from 'unsigned long' to the new struct pointer. Link: https://lkml.kernel.org/r/20250410000022.1901-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250410000022.1901-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Liam R. Howlett <howlett@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm: huge_memory: add folio_mark_accessed() when zapping file THPBaolin Wang
When investigating performance issues during file folio unmap, I noticed some behavioral differences in handling non-PMD-sized folios and PMD-sized folios. For non-PMD-sized file folios, it will call folio_mark_accessed() to mark the folio as having seen activity, but this is not done for PMD-sized folios. This might not cause obvious issues, but a potential problem could be that, it might lead to reclaim of hot file folios under memory pressure, as quoted from Johannes: : Sometimes file contents are only accessed through relatively short-lived : mappings. But they can nevertheless be accessed a lot and be hot. It's : important to not lose that information on unmap, and end up kicking out a : frequently used cache page. Therefore, we should also add folio_mark_accessed() for PMD-sized file folios when unmapping. [baolin.wang@linux.alibaba.com: add comment] Link: https://lkml.kernel.org/r/23fdc11d-e983-4627-89a8-79e9ecf9a45a@linux.alibaba.com Link: https://lkml.kernel.org/r/fc117f60d7b686f87067f36a0ef7cdbc3a78109c.1744190345.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Barry Song <21cnbao@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm/vma: fix incorrectly disallowed anonymous VMA mergesLorenzo Stoakes
Patch series "fix incorrectly disallowed anonymous VMA merges", v2. It appears that we have been incorrectly rejecting merge cases for 15 years, apparently by mistake. Imagine a range of anonymous mapped momemory divided into two VMAs like this, with incompatible protection bits: RW RWX unfaulted faulted |-----------|-----------| | prev | vma | |-----------|-----------| mprotect(RW) Now imagine mprotect()'ing vma so it is RW. This appears as if it should merge, it does not. Neither does this case, again mprotect()'ing vma RW: RWX RW faulted unfaulted |-----------|-----------| | vma | next | |-----------|-----------| mprotect(RW) Nor: RW RWX RW unfaulted faulted unfaulted |-----------|-----------|-----------| | prev | vma | next | |-----------|-----------|-----------| mprotect(RW) What's going on here? In commit 5beb49305251 ("mm: change anon_vma linking to fix multi-process server scalability issue"), from 2010, Rik von Riel took careful care to account for these cases - commenting that '[this is] easily overlooked: when mprotect shifts the boundary, make sure the expanding vma has anon_vma set if the shrinking vma had, to cover any anon pages imported.' However, commit 965f55dea0e3 ("mmap: avoid merging cloned VMAs") introduced a little over a year later, appears to have accidentally disallowed this. By adjusting the is_mergeable_anon_vma() function to avoid lock contention across large trees of forked anon_vma's, this commit wrongly assumed the VMA being checked (the ostensible merge 'target') should be faulted, that is, have an anon_vma, and thus an anon_vma_chain list established, but only of length 1. This appears to have been unintentional, as disallowing empty target VMAs like this across the board makes no sense. We already have logic that accounts for this case, the same logic Rik introduced in 2010, now via dup_anon_vma() (and ultimately anon_vma_clone()), so there is no problem permitting this. This series fixes this mistake and also ensures that scalability concerns remain addressed by explicitly checking that whatever VMA is being merged has not been forked. A full set of self tests which reproduce the issue are provided, as well as updating userland VMA tests to assert this behaviour. The self tests additionally assert scalability concerns are addressed. This patch (of 3): anon_vma_chain's were introduced by Rik von Riel in commit 5beb49305251 ("mm: change anon_vma linking to fix multi-process server scalability issue"). This patch was introduced in March 2010. As part of this change, careful attention was made to the instance of mprotect() causing a VMA merge, with one faulted (i.e. having anon_vma set) and another not: /* * Easily overlooked: when mprotect shifts the boundary, * make sure the expanding vma has anon_vma set if the * shrinking vma had, to cover any anon pages imported. */ In the modern VMA code, this is handled in dup_anon_vma() (and ultimately anon_vma_clone()). This case is one of the three configurations of adjacent VMA anon_vma state that we might encounter on merge (where dst is the VMA which will be merged into and src the one being merged into dst): 1. dst->anon_vma, src->anon_vma - These must be equal, no-op. 2. dst->anon_vma, !src->anon_vma - We simply use dst->anon_vma, no-op. 3. !dst->anon_vma, src->anon_vma - The case in question here. In case 3, the instance addressed here - we duplicate the AVC connections from src and place into dst. However, in practice, we very often do NOT do this. This appears to be due to an inadvertent consequence of the change introduced by commit 965f55dea0e3 ("mmap: avoid merging cloned VMAs"), introduced in May 2011. This implies that this merge case was functional only for a little over a year, and has since been broken for ~15 years. Here, lock scalability concerns lead to us restricting anonymous merges only to those VMAs with 1 entry in their vma->anon_vma_chain, that is, a VMA that is not connected to any parent process's anon_vma. The mergeability test looks like this: static inline bool is_mergeable_anon_vma(struct anon_vma *anon_vma1, struct anon_vma *anon_vma2, struct vm_area_struct *vma) { if ((!anon_vma1 || !anon_vma2) && (!vma || !vma->anon_vma || list_is_singular(&vma->anon_vma_chain))) return true; return anon_vma1 == anon_vma2; } However, we have a problem here - typically the vma passed here is the destination VMA. For instance in vma_merge_existing_range() we invoke: can_vma_merge_left() -> [ check that there is an immediately adjacent prior VMA ] -> can_vma_merge_after() -> is_mergeable_vma() for general attribute check -> is_mergeable_anon_vma([ proposed anon_vma ], prev->anon_vma, prev) So if we were considering a target unfaulted 'prev': unfaulted faulted |-----------|-----------| | prev | vma | |-----------|-----------| This would call is_mergeable_anon_vma(NULL, vma->anon_vma, prev). The list_is_singular() check for vma->anon_vma_chain, an empty list on fault, would cause this merge to _fail_ even though all else indicates a merge. Equally a simple merge into a next VMA would hit the same problem: faulted unfaulted |-----------|-----------| | vma | next | |-----------|-----------| can_vma_merge_right() -> [ check that there is an immediately adjacent succeeding VMA ] -> can_vma_merge_before() -> is_mergeable_vma() for general attribute check -> is_mergeable_anon_vma([ proposed anon_vma ], next->anon_vma, next) For a 3-way merge, we'd also hit the same problem if it was configured like this for instance: unfaulted faulted unfaulted |-----------|-----------|-----------| | prev | vma | next | |-----------|-----------|-----------| As we'd call can_vma_merge_left() for prev, and can_vma_merge_right() for next, both of which would fail. vma_merge_new_range() (and relatedly, vma_expand()) are not impacted, as the new VMA would never already be faulted (it is a proposed new range). Because we already handle each of the aforementioned merge cases, and can absolutely therefore deal with an existing VMA merge with !dst->anon_vma, src->anon_vma, there is absolutely no reason to disallow this kind of merge. It seems that the intention of this patch is to ensure that, in the instance of merging unfaulted VMAs with faulted ones, we never wish to do so with those with multiple AVCs due to the fact that anon_vma lock's are held across both parent and child anon_vma's (actually, the 'root' parent anon_vma's lock is used). In fact, the original commit alludes to this - "find_mergeable_anon_vma() already considers this case". In find_mergeable_anon_vma() however, we check the anon_vma which will be merged from, if it is set, then we check list_is_singular(vma->anon_vma_chain). So to match this logic, update is_mergeable_anon_vma() to perform this scalability check on the VMA whose anon_vma we ultimately merge into. This matches existing behaviour with forked VMAs, only we no longer wrongly disallow ALL empty target merges. So we both allow merge cases and ensure the scalability check is correctly applied. We may wish to revisit these lock scalability concerns at a later date and ensure they are still valid. Additionally, correct userland VMA tests which were mistakenly not asserting these cases correctly previously to now correctly assert this, and to ensure vmg->anon_vma state is always consistent to account for newly introduced asserts. Link: https://lkml.kernel.org/r/cover.1744104124.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/18c756fc9eaf7ad082a710c91133b8346f8cd9a8.1744104124.git.lorenzo.stoakes@oracle.com Fixes: 965f55dea0e3 ("mmap: avoid merging cloned VMAs") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11vmalloc: use for_each_vmap_node() in purge-vmap-areaUladzislau Rezki (Sony)
Update a __purge_vmap_area_lazy() to use introduced helper. This is last place in vmalloc code. Also this patch introduces an extra function which is node_to_id() that converts a vmap_node pointer to an index in array. __purge_vmap_area_lazy() requires that extra function. Link: https://lkml.kernel.org/r/20250408151549.77937-3-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Christop Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11vmalloc: switch to for_each_vmap_node() helperUladzislau Rezki (Sony)
There are places which can be updated easily to use the helper to iterate over all vmap-nodes. This is what this patch does. The aim is to improve readability and simplify the code. [akpm@linux-foundation.org: fix build warning] Link: https://lkml.kernel.org/r/20250408151549.77937-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Christop Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11vmalloc: add for_each_vmap_node() helperUladzislau Rezki (Sony)
To simplify iteration over vmap-nodes, add the for_each_vmap_node() macro that iterates over all nodes in a system. It tends to simplify the code. Link: https://lkml.kernel.org/r/20250408151549.77937-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Christop Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm/ptdump: split effective_prot() into level specific callbacksAnshuman Khandual
Last argument in effective_prot() is u64 assuming pxd_val() returned value (all page table levels) is 64 bit. pxd_val() is very platform specific and its type should not be assumed in generic MM. Split effective_prot() into individual page table level specific callbacks which accepts corresponding pxd_t argument instead and then the subscribing platform (only x86) just derive pxd_val() from the entries as required and proceed as earlier. Link: https://lkml.kernel.org/r/20250407053113.746295-3-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm/ptdump: split note_page() into level specific callbacksAnshuman Khandual
Patch series "mm/ptdump: Drop assumption that pxd_val() is u64", v2. Last argument passed down in note_page() is u64 assuming pxd_val() returned value (all page table levels) is 64 bit - which might not be the case going ahead when D128 page tables is enabled on arm64 platform. Besides pxd_val() is very platform specific and its type should not be assumed in generic MM. A similar problem exists for effective_prot(), although it is restricted to x86 platform. This series splits note_page() and effective_prot() into individual page table level specific callbacks which accepts corresponding pxd_t page table entry as an argument instead and later on all subscribing platforms could derive pxd_val() from the table entries as required and proceed as before. Define ptdesc_t type which describes the basic page table descriptor layout on arm64 platform. Subsequently all level specific pxxval_t descriptors are derived from ptdesc_t thus establishing a common original format, which can also be appropriate for page table entries, masks and protection values etc which are used at all page table levels. This patch (of 3): Last argument passed down in note_page() is u64 assuming pxd_val() returned value (all page table levels) is 64 bit - which might not be the case going ahead when D128 page tables is enabled on arm64 platform. Besides pxd_val() is very platform specific and its type should not be assumed in generic MM. Split note_page() into individual page table level specific callbacks which accepts corresponding pxd_t argument instead and then subscribing platforms just derive pxd_val() from the entries as required and proceed as earlier. Also add a note_page_flush() callback for flushing the last page table page that was being handled earlier via level = -1. Link: https://lkml.kernel.org/r/20250407053113.746295-1-anshuman.khandual@arm.com Link: https://lkml.kernel.org/r/20250407053113.746295-2-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm: page_alloc: tighten up find_suitable_fallback()Johannes Weiner
find_suitable_fallback() is not as efficient as it could be, and somewhat difficult to follow. 1. should_try_claim_block() is a loop invariant. There is no point in checking fallback areas if the caller is interested in claimable blocks but the order and the migratetype don't allow for that. 2. __rmqueue_steal() doesn't care about claimability, so it shouldn't have to run those tests. Different callers want different things from this helper: 1. __compact_finished() scans orders up until it finds a claimable block 2. __rmqueue_claim() scans orders down as long as blocks are claimable 3. __rmqueue_steal() doesn't care about claimability at all Move should_try_claim_block() out of the loop. Only test it for the two callers who care in the first place. Distinguish "no blocks" from "order + mt are not claimable" in the return value; __rmqueue_claim() can stop once order becomes unclaimable, __compact_finished() can keep advancing until order becomes claimable. Before: Performance counter stats for './run case-lru-file-mmap-read' (5 runs): 85,294.85 msec task-clock # 5.644 CPUs utilized ( +- 0.32% ) 15,968 context-switches # 187.209 /sec ( +- 3.81% ) 153 cpu-migrations # 1.794 /sec ( +- 3.29% ) 801,808 page-faults # 9.400 K/sec ( +- 0.10% ) 733,358,331,786 instructions # 1.87 insn per cycle ( +- 0.20% ) (64.94%) 392,622,904,199 cycles # 4.603 GHz ( +- 0.31% ) (64.84%) 148,563,488,531 branches # 1.742 G/sec ( +- 0.18% ) (63.86%) 152,143,228 branch-misses # 0.10% of all branches ( +- 1.19% ) (62.82%) 15.1128 +- 0.0637 seconds time elapsed ( +- 0.42% ) After: Performance counter stats for './run case-lru-file-mmap-read' (5 runs): 84,380.21 msec task-clock # 5.664 CPUs utilized ( +- 0.21% ) 16,656 context-switches # 197.392 /sec ( +- 3.27% ) 151 cpu-migrations # 1.790 /sec ( +- 3.28% ) 801,703 page-faults # 9.501 K/sec ( +- 0.09% ) 731,914,183,060 instructions # 1.88 insn per cycle ( +- 0.38% ) (64.90%) 388,673,535,116 cycles # 4.606 GHz ( +- 0.24% ) (65.06%) 148,251,482,143 branches # 1.757 G/sec ( +- 0.37% ) (63.92%) 149,766,550 branch-misses # 0.10% of all branches ( +- 1.22% ) (62.88%) 14.8968 +- 0.0486 seconds time elapsed ( +- 0.33% ) Link: https://lkml.kernel.org/r/20250407180154.63348-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Brendan Jackman <jackmanb@google.com> Tested-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Carlos Song <carlos.song@nxp.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm/debug: fix parameter passed to page_mapcount_is_type()Gavin Shan
As the comments of page_mapcount_is_type() indicate, the parameter passed to the function should be one more than page->_mapcount. However, page->_mapcount is passed to the function by commit 4ffca5a96678 ("mm: support only one page_type per page") where page_type_has_type() is replaced by page_mapcount_is_type(), but the parameter isn't adjusted. Fix the parameter for page_mapcount_is_type() to be (page->__mapcount + 1). Note that the issue doesn't cause any visible impacts due to the safety gap introduced by PGTY_mapcount_underflow limit. [akpm@linux-foundation.org: simplify __dump_folio(), per David] Link: https://lkml.kernel.org/r/20250321120222.1456770-3-gshan@redhat.com Fixes: 4ffca5a96678 ("mm: support only one page_type per page") Signed-off-by: Gavin Shan <gshan@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: gehao <gehao@kylinos.cn> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11zsmalloc: cleanup headers includesSergey Senozhatsky
Remove unused headers includes from zsmalloc and move pagemap.h and migrate.h includes into zpdesc header. Link: https://lkml.kernel.org/r/20250325080427.3449359-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm: add kernel-doc comment for free_pgd_range()SoumishDas
Provide kernel-doc for free_pgd_range() so it's easier to understand what the function does and how it is used. Link: https://lkml.kernel.org/r/20250325181325.5774-1-soumish.das@gmail.com Signed-off-by: SoumishDas <soumish.das@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm: swap: replace cluster_swap_free_nr() with swap_entries_put_[map/cache]()Kemeng Shi
Replace cluster_swap_free_nr() with swap_entries_put_[map/cache]() to remove repeat code and leverage batch-remove for entries with last flag. After removing cluster_swap_free_nr, only functions with "_nr" suffix could free entries spanning cross clusters. Add corresponding description in comment of swap_entries_put_map_nr() as is first function with "_nr" suffix and have a non-suffix variant function swap_entries_put_map(). Link: https://lkml.kernel.org/r/20250325162528.68385-9-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm: swap: factor out helper to drop cache of entries within a single clusterKemeng Shi
Factor out helper swap_entries_put_cache() from put_swap_folio() to serve as a general-purpose routine for dropping cache flag of entries within a single cluster. Link: https://lkml.kernel.org/r/20250325162528.68385-8-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm: swap: free each cluster individually in swap_entries_put_map_nr()Kemeng Shi
1. Factor out general swap_entries_put_map() helper to drop entries belonging to one cluster. If entries are last map, free entries in batch, otherwise put entries with cluster lock acquired and released only once. 2. Iterate and call swap_entries_put_map() for each cluster in swap_entries_put_nr() to leverage batch-remove for last map belonging to one cluster and reduce lock acquire/release in fallback case. 3. As swap_entries_put_nr() won't handle SWAP_HSA_CACHE drop, rename it to swap_entries_put_map_nr(). 4. As we won't drop each entry invidually with swap_entry_put() now, do reclaim in free_swap_and_cache_nr() because swap_entries_put_map_nr() is general routine to drop reference and the relcaim work should only be done in free_swap_and_cache_nr(). Remove stale comment accordingly. Link: https://lkml.kernel.org/r/20250325162528.68385-7-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm: swap: drop last SWAP_MAP_SHMEM flag in batch in swap_entries_put_nr()Kemeng Shi
The SWAP_MAP_SHMEM indicates last map from shmem. Therefore we can drop SWAP_MAP_SHMEM in batch in similar way to drop last ref count in batch. Link: https://lkml.kernel.org/r/20250325162528.68385-6-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm: swap: use swap_entries_free() drop last ref count in swap_entries_put_nr()Kemeng Shi
Use swap_entries_free() to directly free swap entries when the swap entries are not cached and referenced, without needing to set swap entries to set intermediate SWAP_HAS_CACHE state. Link: https://lkml.kernel.org/r/20250325162528.68385-5-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm: swap: use swap_entries_free() to free swap entry in swap_entry_put_locked()Kemeng Shi
In swap_entry_put_locked(), we will set slot to SWAP_HAS_CACHE before using swap_entries_free() to do actual swap entry freeing. This introduce an unnecessary intermediate state. By using swap_entries_free() in swap_entry_put_locked(), we can eliminate the need to set slot to SWAP_HAS_CACHE. This change would make the behavior of swap_entry_put_locked() more consistent with other put() operations which will do actual free work after put last reference. Link: https://lkml.kernel.org/r/20250325162528.68385-4-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm: swap: enable swap_entry_range_free() to drop any kind of last refKemeng Shi
The original VM_BUG_ON only allows swap_entry_range_free() to drop last SWAP_HAS_CACHE ref. By allowing other kind of last ref in VM_BUG_ON, swap_entry_range_free() could be a more general-purpose function able to handle all kind of last ref. Following thi change, also rename swap_entry_range_free() to swap_entries_free() and update it's comment accordingly. This is a preparation to use swap_entries_free() to drop more kind of last ref other than SWAP_HAS_CACHE. [shikemeng@huaweicloud.com: add __maybe_unused attribute for swap_is_last_ref() and update comment] Link: https://lkml.kernel.org/r/20250410153908.612984-1-shikemeng@huaweicloud.com Link: https://lkml.kernel.org/r/20250325162528.68385-3-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Baoquan He <bhe@redhat.com> Tested-by: SeongJae Park <sj@kernel.org> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm: swap: rename __swap_[entry/entries]_free[_locked] to ↵Kemeng Shi
swap_[entry/entries]_put[_locked] Patch series "Minor cleanups and improvements to swap freeing code", v4. This series contains some cleanups and improvements which are made during learning swapfile. Here is a summary of the changes: 1. Function naming improvments. - Use "put" instead of "free" to name functions which only do actual free when count drops to zero. - Use "entry" to name function only frees one swap slot. Use "entries" to name function could may free multi swap slots within one cluster. Use "_nr" suffix to name function which could free multi swap slots spanning cross multi clusters. 2. Eliminate the need to set swap slot to intermediate SWAP_HAS_CACHE value before do actual free by using swap_entry_range_free() 3. Add helpers swap_entries_put_map() and swap_entries_put_cache() as a general-purpose routine to free swap entries within a single cluster which will try batch-remove first and fallback to put eatch entry indvidually with cluster lock acquired/released only once. By using these helpers, we could remove repeated code, levarage batch-remove in more cases and aoivd to acquire/release cluster lock for each single swap entry. This patch (of 8): In __swap_entry_free[_locked] and __swap_entries_free, we decrease count first and only free swap entry if count drops to zero. This behavior is more akin to a put() operation rather than a free() operation. Therefore, rename these functions with "put" instead of "free". Additionally, add "_nr" suffix to swap_entries_put to indicate the input range may span swap clusters. Link: https://lkml.kernel.org/r/20250325162528.68385-1-shikemeng@huaweicloud.com Link: https://lkml.kernel.org/r/20250325162528.68385-2-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11memcg: manually inline replace_stock_objcgShakeel Butt
The replace_stock_objcg() is being called by only refill_obj_stock, so manually inline it. Link: https://lkml.kernel.org/r/20250404013913.1663035-10-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11memcg: combine slab obj stock charging and accountingVlastimil Babka
When handing slab objects, we use obj_cgroup_[un]charge() for (un)charging and mod_objcg_state() to account NR_SLAB_[UN]RECLAIMABLE_B. All these operations use the percpu stock for performance. However with the calls being separate, the stock_lock is taken twice in each case. By refactoring the code, we can turn mod_objcg_state() into __account_obj_stock() which is called on a stock that's already locked and validated. On the charging side we can call this function from consume_obj_stock() when it succeeds, and refill_obj_stock() in the fallback. We just expand parameters of these functions as necessary. The uncharge side from __memcg_slab_free_hook() is just the call to refill_obj_stock(). Other callers of obj_cgroup_[un]charge() (i.e. not slab) simply pass the extra parameters as NULL/zeroes to skip the __account_obj_stock() operation. In __memcg_slab_post_alloc_hook() we now charge each object separately, but that's not a problem as we did call mod_objcg_state() for each object separately, and most allocations are non-bulk anyway. This could be improved by batching all operations until slab_pgdat(slab) changes. Some preliminary benchmarking with a kfree(kmalloc()) loop of 10M iterations with/without __GFP_ACCOUNT: Before the patch: kmalloc/kfree !memcg: 581390144 cycles kmalloc/kfree memcg: 783689984 cycles After the patch: kmalloc/kfree memcg: 658723808 cycles More than half of the overhead of __GFP_ACCOUNT relative to non-accounted case seems eliminated. Link: https://lkml.kernel.org/r/20250404013913.1663035-9-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11memcg: use __mod_memcg_state in drain_obj_stockShakeel Butt
For non-PREEMPT_RT kernels, drain_obj_stock() is always called with irq disabled, so we can use __mod_memcg_state() instead of mod_memcg_state(). For PREEMPT_RT, we need to add memcg_stats_[un]lock in __mod_memcg_state(). Link: https://lkml.kernel.org/r/20250404013913.1663035-8-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11memcg: do obj_cgroup_put inside drain_obj_stockShakeel Butt
Previously we could not call obj_cgroup_put() inside the local lock because on the put on the last reference, the release function obj_cgroup_release() may try to re-acquire the local lock. However that chain has been broken. Now simply do obj_cgroup_put() inside drain_obj_stock() instead of returning the old objcg. Link: https://lkml.kernel.org/r/20250404013913.1663035-7-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11memcg: no refilling stock from obj_cgroup_releaseShakeel Butt
obj_cgroup_release is called when all the references to the objcg have been released i.e. no more memory objects are pointing to it. Most probably objcg->memcg will be pointing to some ancestor memcg. In obj_cgroup_release(), the kernel calls obj_cgroup_uncharge_pages() which refills the local stock. There is no need to refill the local stock with some ancestor memcg and flush the local stock. Let's decouple obj_cgroup_release() from the local stock by uncharging instead of refilling. One additional benefit of this change is that it removes the requirement to only call obj_cgroup_put() outside of local_lock. Link: https://lkml.kernel.org/r/20250404013913.1663035-6-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11memcg: manually inline __refill_stockShakeel Butt
There are no more multiple callers of __refill_stock(), so simply inline it to refill_stock(). Link: https://lkml.kernel.org/r/20250404013913.1663035-5-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11memcg: introduce memcg_unchargeShakeel Butt
At multiple places in memcontrol.c, the memory and memsw page counters are being uncharged. This is error-prone. Let's move the functionality to a newly introduced memcg_uncharge and call it from all those places. Link: https://lkml.kernel.org/r/20250404013913.1663035-4-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11memcg: decouple drain_obj_stock from local stockShakeel Butt
Currently drain_obj_stock() can potentially call __refill_stock which accesses local cpu stock and thus requires memcg stock's local_lock. However if we look at the code paths leading to drain_obj_stock(), there is never a good reason to refill the memcg stock at all from it. At the moment, drain_obj_stock can be called from reclaim, hotplug cpu teardown, mod_objcg_state() and refill_obj_stock(). For reclaim and hotplug there is no need to refill. For the other two paths, most probably the newly switched objcg would be used in near future and thus no need to refill stock with the older objcg. In addition, __refill_stock() from drain_obj_stock() happens on rare cases, so performance is not really an issue. Let's just uncharge directly instead of refill which will also decouple drain_obj_stock from local cpu stock and local_lock requirements. Link: https://lkml.kernel.org/r/20250404013913.1663035-3-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11memcg: remove root memcg check from refill_stockShakeel Butt
refill_stock can not be called with root memcg, so there is no need to check it. Instead add a warning if root is ever passed to it. Link: https://lkml.kernel.org/r/20250404013913.1663035-1-shakeel.butt@linux.dev Link: https://lkml.kernel.org/r/20250404013913.1663035-2-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11memcg: vmalloc: simplify MEMCG_VMALLOC updatesShakeel Butt
The vmalloc region can either be charged to a single memcg or none. At the moment kernel traverses all the pages backing the vmalloc region to update the MEMCG_VMALLOC stat. However there is no need to look at all the pages as all those pages will be charged to a single memcg or none. Simplify the MEMCG_VMALLOC update by just looking at the first page of the vmalloc region. [shakeel.butt@linux.dev: add comment] Link: https://lkml.kernel.org/r/bmlkdbqgwboyqrnxyom7n52fjmo76ux77jhqw5odc6c6dfon3h@zdylwtmlywbt Link: https://lkml.kernel.org/r/20250403053326.26860-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm/compaction: reduce the difference between low and high watermarksMichal Clapinski
Reduce the diff between low and high watermarks when compaction proactiveness is set to high. This allows users who set the proactiveness really high to have more stable fragmentation score over time. Link: https://lkml.kernel.org/r/20250404111103.1994507-3-mclapinski@google.com Signed-off-by: Michal Clapinski <mclapinski@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm/compaction: remove low watermark cap for proactive compactionMichal Clapinski
Patch series "mm/compaction: allow more aggressive proactive compaction", v4. Our goal is to keep memory usage of a VM low on the host. For that reason, we use free page reporting which by default reports free pages of order 9 and larger to the host to be freed. The feature works well only if the memory in the guest is not fragmented below pages of order 9. Proactive compaction can be reused to achieve defragmentation after some parameter tweaking. When the fragmentation score (lower is better) gets larger than the high watermark, proactive compaction kicks in. Compaction stops when the score goes below the low watermark (or no progress is made and backoff kicks in). Let's define the difference between high and low watermarks as leeway. Before these changes, the minimum possible value for low watermark was 5 and the leeway was hardcoded to 10 (so minimum possible value for high watermark was 15). To test this, I created a VM with 19GB of memory and free page reporting enabled. The VM was ~idle. I meassured the memory usage from inside the guest (/proc/meminfo) and from the host (provided by the hypervisor). Before: https://drive.google.com/file/d/1Xw23lRry_PgEH3f6QRnSGvoHh2u9UHyI/view?usp=sharing After: https://drive.google.com/file/d/1wMhpIzepx6t44F70yCPA50n1S5V2AT-a/view?usp=sharing This patch (of 2): Previously a min cap of 5 has been set in the commit introducing proactive compaction. This was to make sure users don't hurt themselves by setting the proactiveness to 100 and making their system unresponsive. But the compaction mechanism has a backoff mechanism that will sleep for 30s if no progress is made, so I don't see a significant risk here. My system (19GB of memory) has been perfectly fine with both watermarks hardcoded to 0. Link: https://lkml.kernel.org/r/20250404111103.1994507-1-mclapinski@google.com Link: https://lkml.kernel.org/r/20250404111103.1994507-2-mclapinski@google.com Signed-off-by: Michal Clapinski <mclapinski@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm/page_alloc: simplify free_page_is_bad by removing free_page_is_bad_reportYe Liu
Refactor free_page_is_bad() to call bad_page() directly, removing the intermediate free_page_is_bad_report(). This reduces unnecessary indirection, improving code clarity and maintainability without changing functionality. Link: https://lkml.kernel.org/r/20250328012031.1204993-1-ye.liu@linux.dev Signed-off-by: Ye Liu <liuye@kylinos.cn> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm/show_mem: optimize si_meminfo_node by reducing redundant codeYe Liu
Refactors the si_meminfo_node() function by reducing redundant code and improving readability. Moved the calculation of managed_pages inside the existing loop that processes pgdat->node_zones, eliminating the need for a separate loop. Simplified the logic by removing unnecessary preprocessor conditionals. Ensured that both totalram, totalhigh, and other memory statistics are consistently set without duplication. This change results in cleaner and more efficient code without altering functionality. Link: https://lkml.kernel.org/r/20250325073803.852594-1-ye.liu@linux.dev Signed-off-by: Ye Liu <liuye@kylinos.cn> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm/compaction: use folio in hugetlb pathwayVishal Moola (Oracle)
Use a folio in the hugetlb pathway during the compaction migrate-able pageblock scan. This removes a call to compound_head(). Link: https://lkml.kernel.org/r/20250401021025.637333-2-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm: page_alloc: remove redundant READ_ONCESongtang Liu
In the current code, batch is a local variable, and it cannot be concurrently modified. It's unnecessary to use READ_ONCE here, so remove it. Link: https://lkml.kernel.org/r/CAA=HWd1kn01ym8YuVFuAqK2Ggq3itEGkqX8T6eCXs_C7tiv-Jw@mail.gmail.com Fixes: 51a755c56dc0 ("mm: tune PCP high automatically") Signed-off-by: Songtang Liu <liusongtang@bytedance.com> Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11memcg, oom: do not bypass oom killer for dying tasksMichal Hocko
7775face2079 ("memcg: killed threads should not invoke memcg OOM killer") has added a bypass of the oom killer path for dying threads because a very specific workload (described in the changelog) could hit "no killable tasks" path. This itself is not fatal condition but it could be annoying if this was a common case. On the other hand the bypass has some issues on its own. Without triggering oom killer we won't be able to trigger async oom reclaim (oom_reaper) which can operate on killed tasks as well as long as they still have their mm available. This could be the case during futex cleanup when the memory as pointed out by Johannes in [1]. The said case is still not fully understood but let's drop this bypass that was mostly driven by an artificial workload and allow dying tasks to go into oom path. This will make the code easier to reason about and also help corner cases where oom_reaper could help to release memory. Link: https://lore.kernel.org/all/20241212183012.GB1026@cmpxchg.org/T/#u [1] Link: https://lkml.kernel.org/r/20250402090117.130245-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: David Rientjes <rientjes@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11zsmalloc: prefer the the original page's node for compressed dataNhat Pham
Currently, zsmalloc, zswap's and zram's backend memory allocator, does not enforce any policy for the allocation of memory for the compressed data, instead just adopting the memory policy of the task entering reclaim, or the default policy (prefer local node) if no such policy is specified. This can lead to several pathological behaviors in multi-node NUMA systems: 1. Systems with CXL-based memory tiering can encounter the following inversion with zswap/zram: the coldest pages demoted to the CXL tier can return to the high tier when they are reclaimed to compressed swap, creating memory pressure on the high tier. 2. Consider a direct reclaimer scanning nodes in order of allocation preference. If it ventures into remote nodes, the memory it compresses there should stay there. Trying to shift those contents over to the reclaiming thread's preferred node further *increases* its local pressure, and provoking more spills. The remote node is also the most likely to refault this data again. This undesirable behavior was pointed out by Johannes Weiner in [1]. 3. For zswap writeback, the zswap entries are organized in node-specific LRUs, based on the node placement of the original pages, allowing for targeted zswap writeback for specific nodes. However, the compressed data of a zswap entry can be placed on a different node from the LRU it is placed on. This means that reclaim targeted at one node might not free up memory used for zswap entries in that node, but instead reclaiming memory in a different node. All of these issues will be resolved if the compressed data go to the same node as the original page. This patch encourages this behavior by having zswap and zram pass the node of the original page to zsmalloc, and have zsmalloc prefer the specified node if we need to allocate new (zs)pages for the compressed data. Note that we are not strictly binding the allocation to the preferred node. We still allow the allocation to fall back to other nodes when the preferred node is full, or if we have zspages with slots available on a different node. This is OK, and still a strict improvement over the status quo: 1. On a system with demotion enabled, we will generally prefer demotions over compressed swapping, and only swap when pages have already gone to the lowest tier. This patch should achieve the desired effect for the most part. 2. If the preferred node is out of memory, letting the compressed data going to other nodes can be better than the alternative (OOMs, keeping cold memory unreclaimed, disk swapping, etc.). 3. If the allocation go to a separate node because we have a zspage with slots available, at least we're not creating extra immediate memory pressure (since the space is already allocated). 3. While there can be mixings, we generally reclaim pages in same-node batches, which encourage zspage grouping that is more likely to go to the right node. 4. A strict binding would require partitioning zsmalloc by node, which is more complicated, and more prone to regression, since it reduces the storage density of zsmalloc. We need to evaluate the tradeoff and benchmark carefully before adopting such an involved solution. [1]: https://lore.kernel.org/linux-mm/20250331165306.GC2110528@cmpxchg.org/ [senozhatsky@chromium.org: coding-style fixes] Link: https://lkml.kernel.org/r/mnvexa7kseswglcqbhlot4zg3b3la2ypv2rimdl5mh5glbmhvz@wi6bgqn47hge Link: https://lkml.kernel.org/r/20250402204416.3435994-1-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Gregory Price <gourry@gourry.net> Acked-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org> [zram, zsmalloc] Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Yosry Ahmed <yosry.ahmed@linux.dev> [zswap/zsmalloc] Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm: add folio_mk_pmd()Matthew Wilcox (Oracle)
Removes five conversions from folio to page. Also removes both callers of mk_pmd() that aren't part of mk_huge_pmd(), getting us a step closer to removing the confusion between mk_pmd(), mk_huge_pmd() and pmd_mkhuge(). Link: https://lkml.kernel.org/r/20250402181709.2386022-11-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Muchun Song <muchun.song@linux.dev> Cc: Richard Weinberger <richard@nod.at> Cc: <x86@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm: remove mk_huge_pte()Matthew Wilcox (Oracle)
The only remaining user of mk_huge_pte() is the debug code, so remove the API and replace its use with pfn_pte() which lets us remove the conversion to a page first. We should always call arch_make_huge_pte() to turn this PTE into a huge PTE before operating on it with huge_pte_mkdirty() etc. Link: https://lkml.kernel.org/r/20250402181709.2386022-10-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Zi Yan <ziy@nvidia.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Muchun Song <muchun.song@linux.dev> Cc: Richard Weinberger <richard@nod.at> Cc: <x86@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11hugetlb: simplify make_huge_pte()Matthew Wilcox (Oracle)
mk_huge_pte() is a bad API. Despite its name, it creates a normal PTE which is later transformed into a huge PTE by arch_make_huge_pte(). So replace the page argument with a folio argument and call folio_mk_pte() instead. Then, because we now know this is a regular PTE rather than a huge one, use pte_mkdirty() instead of huge_pte_mkdirty() (and similar functions). Link: https://lkml.kernel.org/r/20250402181709.2386022-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Zi Yan <ziy@nvidia.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Richard Weinberger <richard@nod.at> Cc: <x86@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm: add folio_mk_pte()Matthew Wilcox (Oracle)
Remove a cast from folio to page in four callers of mk_pte(). Link: https://lkml.kernel.org/r/20250402181709.2386022-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Muchun Song <muchun.song@linux.dev> Cc: Richard Weinberger <richard@nod.at> Cc: <x86@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm: set the pte dirty if the folio is already dirtyMatthew Wilcox (Oracle)
Patch series "Add folio_mk_pte()", v2. Today if you have a folio and want to create a PTE that points to the first page in it, you have to convert from a folio to a page. That's zero-cost today but will be more expensive in the future. I didn't want to add folio_mk_pte() to each architecture, and I didn't want to lose any optimisations that architectures have from their own implementation of mk_pte(). Fortunately, most architectures have by now turned their mk_pte() into a fairly bland variant of pfn_pte(), but s390 has a special optimisation that needs to be moved into generic code in the first patch. At the end of this patch set, we have mk_pte() and folio_mk_pte() in mm.h and each architecture only has to implement pfn_pte(). We've also eliminated mk_huge_pte(), mk_huge_pmd() and mk_pmd(). This patch (of 11): If the first access to a folio is a read that is then followed by a write, we can save a page fault. s390 implemented this in their mk_pte() in commit abf09bed3cce ("s390/mm: implement software dirty bits"), but other architectures can also benefit from this. Link: https://lkml.kernel.org/r/20250402181709.2386022-1-willy@infradead.org Link: https://lkml.kernel.org/r/20250402181709.2386022-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> # for s390 Cc: Zi Yan <ziy@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Muchun Song <muchun.song@linux.dev> Cc: Richard Weinberger <richard@nod.at> Cc: <x86@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>