summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-07-03arch/x86: do not explicitly clear Reserved flag in free_pagetableOscar Salvador
In free_pagetable() we use the non-atomic version for clearing the PageReserved bit from the page. free_pagetable() will either call free_reserved_page() or put_page_bootmem(), which will eventually end up calling free_reserved_page(), and in there we already clear the PageReserved flag. Link: https://lkml.kernel.org/r/20240527044523.29207-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: drop leftover comment references to pxx_huge()Peter Xu
pxx_huge() has been removed in recent commit 9636f055dae1 ("mm/treewide: remove pXd_huge()"), however there are still three comments referencing the API that got overlooked. Remove them. Link: https://lkml.kernel.org/r/20240527154855.528816-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03kmsan: introduce test_unpoison_memory()Brian Johannesmeyer
Add a regression test to ensure that kmsan_unpoison_memory() works the same as an unpoisoning operation added by the instrumentation. The test has two subtests: one that checks the instrumentation, and one that checks kmsan_unpoison_memory(). Each subtest initializes the first byte of a 4-byte buffer, then checks that the other 3 bytes are uninitialized. [glider@google.com: change description, remove comment about failing test case] Link: https://lkml.kernel.org/r/20240528104807.738758-2-glider@google.com Signed-off-by: Brian Johannesmeyer <bjohannesmeyer@gmail.com> Link: https://lore.kernel.org/lkml/20240524232804.1984355-1-bjohannesmeyer@gmail.com/T/ Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/vmalloc: use __this_cpu_try_cmpxchg() in preload_this_cpu_lock()Uros Bizjak
Use __this_cpu_try_cmpxchg() instead of __this_cpu_cmpxchg (*ptr, old, new) == old in preload_this_cpu_lock(). x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg. The generated code improves from: 4bb6: 48 85 f6 test %rsi,%rsi 4bb9: 0f 84 10 fa ff ff je 45cf <...> 4bbf: 4c 89 e8 mov %r13,%rax 4bc2: 65 48 0f b1 35 00 00 cmpxchg %rsi,%gs:0x0(%rip) 4bc9: 00 00 4bcb: 48 85 c0 test %rax,%rax 4bce: 0f 84 fb f9 ff ff je 45cf <...> to: 4bb6: 48 85 f6 test %rsi,%rsi 4bb9: 0f 84 10 fa ff ff je 45cf <...> 4bbf: 4c 89 e8 mov %r13,%rax 4bc2: 65 48 0f b1 35 00 00 cmpxchg %rsi,%gs:0x0(%rip) 4bc9: 00 00 4bcb: 0f 84 fe f9 ff ff je 45cf <...> No functional change intended. Link: https://lkml.kernel.org/r/20240528144345.5980-2-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03percpu: add __this_cpu_try_cmpxchg()Uros Bizjak
Add __this_cpu_try_cmpxchg() version of the percpu op. Link: https://lkml.kernel.org/r/20240528144345.5980-1-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03memcg: rearrange fields of mem_cgroup_per_nodeShakeel Butt
Kernel test robot reported [1] performance regression for will-it-scale test suite's page_fault2 test case for the commit 70a64b7919cb ("memcg: dynamically allocate lruvec_stats"). After inspection it seems like the commit has unintentionally introduced false cache sharing. After the commit the fields of mem_cgroup_per_node which get read on the performance critical path share the cacheline with the fields which get updated often on LRU page allocations or deallocations. This has caused contention on that cacheline and the workloads which manipulates a lot of LRU pages are regressed as reported by the test report. The solution is to rearrange the fields of mem_cgroup_per_node such that the false sharing is eliminated. Let's move all the read only pointers at the start of the struct, followed by memcg-v1 only fields and at the end fields which get updated often. Experiment setup: Ran fallocate1, fallocate2, page_fault1, page_fault2 and page_fault3 from the will-it-scale test suite inside a three level memcg with /tmp mounted as tmpfs on two different machines, one a single numa node and the other one, two node machine. $ ./[testcase]_processes -t $NR_CPUS -s 50 Results for single node, 52 CPU machine: Testcase base with-patch fallocate1 1031081 1431291 (38.80 %) fallocate2 1029993 1421421 (38.00 %) page_fault1 2269440 3405788 (50.07 %) page_fault2 2375799 3572868 (50.30 %) page_fault3 28641143 28673950 ( 0.11 %) Results for dual node, 80 CPU machine: Testcase base with-patch fallocate1 2976288 3641185 (22.33 %) fallocate2 2979366 3638181 (22.11 %) page_fault1 6221790 7748245 (24.53 %) page_fault2 6482854 7847698 (21.05 %) page_fault3 28804324 28991870 ( 0.65 %) Link: https://lkml.kernel.org/r/20240528164050.2625718-1-shakeel.butt@linux.dev Fixes: 70a64b7919cb ("memcg: dynamically allocate lruvec_stats") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reported-by: kernel test robot <oliver.sang@intel.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Feng Tang <feng.tang@intel.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/hugetlb: mm/memory_hotplug: use a folio in scan_movable_pages()Sidhartha Kumar
By using a folio in scan_movable_pages() we convert the last user of the page-based hugetlb information macro functions to the folio version. After this conversion, we can safely remove the page-based definitions from include/linux/hugetlb.h. Link: https://lkml.kernel.org/r/20240530171427.242018-1-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: swap: entirely map large folios found in swapcacheChuanhua Han
When a large folio is found in the swapcache, the current implementation requires calling do_swap_page() nr_pages times, resulting in nr_pages page faults. This patch opts to map the entire large folio at once to minimize page faults. Additionally, redundant checks and early exits for ARM64 MTE restoring are removed. Link: https://lkml.kernel.org/r/20240529082824.150954-7-21cnbao@gmail.com Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com> Co-developed-by: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Gao Xiang <xiang@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Len Brown <len.brown@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: swap: make should_try_to_free_swap() support large-folioChuanhua Han
The function should_try_to_free_swap() operates under the assumption that swap-in always occurs at the normal page granularity, i.e., folio_nr_pages() = 1. However, in reality, for large folios, add_to_swap_cache() will invoke folio_ref_add(folio, nr). To accommodate large folio swap-in, this patch eliminates this assumption. Link: https://lkml.kernel.org/r/20240529082824.150954-6-21cnbao@gmail.com Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com> Co-developed-by: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Gao Xiang <xiang@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Len Brown <len.brown@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: introduce arch_do_swap_page_nr() which allows restore metadata for nr pagesBarry Song
Should do_swap_page() have the capability to directly map a large folio, metadata restoration becomes necessary for a specified number of pages denoted as nr. It's important to highlight that metadata restoration is solely required by the SPARC platform, which, however, does not enable THP_SWAP. Consequently, in the present kernel configuration, there exists no practical scenario where users necessitate the restoration of nr metadata. Platforms implementing THP_SWAP might invoke this function with nr values exceeding 1, subsequent to do_swap_page() successfully mapping an entire large folio. Nonetheless, their arch_do_swap_page_nr() functions remain empty. Link: https://lkml.kernel.org/r/20240529082824.150954-5-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chuanhua Han <hanchuanhua@oppo.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gao Xiang <xiang@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Len Brown <len.brown@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: introduce pte_move_swp_offset() helper which can move offset bidirectionallyBarry Song
There could arise a necessity to obtain the first pte_t from a swap pte_t located in the middle. For instance, this may occur within the context of do_swap_page(), where a page fault can potentially occur in any PTE of a large folio. To address this, the following patch introduces pte_move_swp_offset(), a function capable of bidirectional movement by a specified delta argument. Consequently, pte_next_swp_offset() will directly invoke it with delta = 1. Link: https://lkml.kernel.org/r/20240529082824.150954-4-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Suggested-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chuanhua Han <hanchuanhua@oppo.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Gao Xiang <xiang@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Len Brown <len.brown@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: remove the implementation of swap_free() and always use swap_free_nr()Barry Song
To streamline maintenance efforts, we propose removing the implementation of swap_free(). Instead, we can simply invoke swap_free_nr() with nr set to 1. swap_free_nr() is designed with a bitmap consisting of only one long, resulting in overhead that can be ignored for cases where nr equals 1. A prime candidate for leveraging swap_free_nr() lies within kernel/power/swap.c. Implementing this change facilitates the adoption of batch processing for hibernation. Link: https://lkml.kernel.org/r/20240529082824.150954-3-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Suggested-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Len Brown <len.brown@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chuanhua Han <hanchuanhua@oppo.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Gao Xiang <xiang@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: swap: introduce swap_free_nr() for batched swap_free()Chuanhua Han
Patch series "large folios swap-in: handle refault cases first", v5. This patchset is extracted from the large folio swapin series[1], primarily addressing the handling of scenarios involving large folios in the swap cache. Currently, it is particularly focused on addressing the refaulting of mTHP, which is still undergoing reclamation. This approach aims to streamline code review and expedite the integration of this segment into the MM tree. It relies on Ryan's swap-out series[2], leveraging the helper function swap_pte_batch() introduced by that series. Presently, do_swap_page only encounters a large folio in the swap cache before the large folio is released by vmscan. However, the code should remain equally useful once we support large folio swap-in via swapin_readahead(). This approach can effectively reduce page faults and eliminate most redundant checks and early exits for MTE restoration in recent MTE patchset[3]. The large folio swap-in for SWP_SYNCHRONOUS_IO and swapin_readahead() will be split into separate patch sets and sent at a later time. [1] https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/ [2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@arm.com/ [3] https://lore.kernel.org/linux-mm/20240322114136.61386-1-21cnbao@gmail.com/ This patch (of 6): While swapping in a large folio, we need to free swaps related to the whole folio. To avoid frequently acquiring and releasing swap locks, it is better to introduce an API for batched free. Furthermore, this new function, swap_free_nr(), is designed to efficiently handle various scenarios for releasing a specified number, nr, of swap entries. Link: https://lkml.kernel.org/r/20240529082824.150954-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240529082824.150954-2-21cnbao@gmail.com Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com> Co-developed-by: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: rmap: abstract updating per-node and per-memcg statsYosry Ahmed
A lot of intricacies go into updating the stats when adding or removing mappings: which stat index to use and which function. Abstract this away into a new static helper in rmap.c, __folio_mod_stat(). This adds an unnecessary call to folio_test_anon() in __folio_add_anon_rmap() and __folio_add_file_rmap(). However, the folio struct should already be in the cache at this point, so it shouldn't cause any noticeable overhead. No functional change intended. [hughd@google.com: fix /proc/meminfo] Link: https://lkml.kernel.org/r/49914517-dfc7-e784-fde0-0e08fafbecc2@google.com Link: https://lkml.kernel.org/r/20240506211333.346605-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: zswap: make same_filled functions folio-friendlyYosry Ahmed
A variable name 'page' is used in zswap_is_folio_same_filled() and zswap_fill_page() to point at the kmapped data in a folio. Use 'data' instead to avoid confusion and stop it from showing up when searching for 'page' references in mm/zswap.c. While we are at it, move the kmap/kunmap calls into zswap_fill_page(), make it take in a folio, and rename it to zswap_fill_folio(). Link: https://lkml.kernel.org/r/20240524033819.1953587-4-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm :zswap: use kmap_local_folio() in zswap_load()Yosry Ahmed
Eliminate the last explicit 'struct page' reference in mm/zswap.c. Link: https://lkml.kernel.org/r/20240524033819.1953587-3-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: zswap: use sg_set_folio() in zswap_{compress/decompress}()Yosry Ahmed
Patch series "mm: zswap: trivial folio conversions". Some trivial folio conversions in zswap code. This patch (of 3): sg_set_folio() is equivalent to sg_set_page() for order-0 folios, which are the only ones supported by zswap. Now zswap_decompress() can take in a folio directly. Link: https://lkml.kernel.org/r/20240524033819.1953587-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20240524033819.1953587-2-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: remove MIGRATE_SYNC_NO_COPY modeKefeng Wang
Commit 2916ecc0f9d4 ("mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY") introduce a new MIGRATE_SYNC_NO_COPY mode to allow to offload the copy to a device DMA engine, which is only used __migrate_device_pages() to decide whether or not copy the old page, and the MIGRATE_SYNC_NO_COPY mode only set in hmm, as the MIGRATE_SYNC_NO_COPY set is removed by previous cleanup, it seems that we could remove the unnecessary MIGRATE_SYNC_NO_COPY. Link: https://lkml.kernel.org/r/20240524052843.182275-6-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: migrate: remove migrate_folio_extra()Kefeng Wang
migrate_folio_extra() is only called in migrate.c now, convert it a static function and take a new src_private argument which could be shared by migrate_folio() and filemap_migrate_folio() to simplify code a bit. Link: https://lkml.kernel.org/r/20240524052843.182275-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: migrate_device: unify migrate folio for MIGRATE_SYNC_NO_COPYKefeng Wang
The __migrate_device_pages() won't copy page so MIGRATE_SYNC_NO_COPY passed into migrate_folio()/migrate_folio_extra(), actually a easy way is just to call folio_migrate_mapping()/folio_migrate_flags(), converting it to unify and simplify the migrate device pages, which also remove the only call for MIGRATE_SYNC_NO_COPY. Link: https://lkml.kernel.org/r/20240524052843.182275-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: migrate_device: use a newfolio in __migrate_device_pages()Kefeng Wang
Use a newfolio instead of newpage and convert to more folio api in __migrate_device_pages(). Link: https://lkml.kernel.org/r/20240524052843.182275-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: migrate: simplify __buffer_migrate_folio()Kefeng Wang
Patch series "mm: cleanup MIGRATE_SYNC_NO_COPY mode". Commit 2916ecc0f9d4 ("mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY") introduce a new MIGRATE_SYNC_NO_COPY mode to allow to offload the copy to a device DMA engine, which is only used __migrate_device_pages() to decide whether or not copy the old page, and the MIGRATE_SYNC_NO_COPY mode only used in hmm, a easy way is just to call the folio_migrate_mapping() and folio_migrate_flags(), which help to remove the MIGRATE_SYNC_NO_COPY mode. This patch (of 5): Use filemap_migrate_folio() helper to simplify __buffer_migrate_folio(). Link: https://lkml.kernel.org/r/20240524052843.182275-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20240524052843.182275-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03rmap: remove DEFINE_PAGE_VMA_WALK()Kefeng Wang
This are no users since commit 40d707f33db5 ("mm/ksm: use folio in write_protect_page"), so remove DEFINE_PAGE_VMA_WALK(). Link: https://lkml.kernel.org/r/20240524053618.208895-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alex Shi <alexs@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: remove page_mapping()Matthew Wilcox (Oracle)
All callers are now converted, delete this compatibility wrapper. Also fix up some comments which referred to page_mapping. Link: https://lkml.kernel.org/r/20240423225552.4113447-7-willy@infradead.org Link: https://lkml.kernel.org/r/20240524181813.698813-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: memcontrol: remove page_memcg()Kefeng Wang
The page_memcg() only called by mod_memcg_page_state(), so squash it to cleanup page_memcg(). Link: https://lkml.kernel.org/r/20240524014950.187805-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/memory-failure: use helper llist_for_each_entry()Yifei Li
Change the llist_for_each_entry_safe function to the llist_for_each_entry function and delete the next variable. Because the linked list is not modified,the llist_for_each_entry_safe function is not required. No functional changes are intended. Link: https://lkml.kernel.org/r/20240513075830.2611-1-liyifei28@huawei.com Signed-off-by: Yifei Li <liyifei28@huawei.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shiyang Ruan <ruansy.fnst@fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03selftest: mm: Test if hugepage does not get leaked during __bio_release_pages()Donet Tom
Commit 1b151e2435fc ("block: Remove special-casing of compound pages") caused a change in behaviour when releasing the pages if the buffer does not start at the beginning of the page. This was because the calculation of the number of pages to release was incorrect. This was fixed by commit 38b43539d64b ("block: Fix page refcounts for unaligned buffers in __bio_release_pages()"). We pin the user buffer during direct I/O writes. If this buffer is a hugepage, bio_release_page() will unpin it and decrement all references and pin counts at ->bi_end_io. However, if any references to the hugepage remain post-I/O, the hugepage will not be freed upon unmap, leading to a memory leak. This patch verifies that a hugepage, used as a user buffer for DIO operations, is correctly freed upon unmapping, regardless of whether the offsets are aligned or unaligned w.r.t page boundary. Test Result Fail Scenario (Without the fix) -------------------------------------------------------- []# ./hugetlb_dio TAP version 13 1..4 No. Free pages before allocation : 7 No. Free pages after munmap : 7 ok 1 : Huge pages freed successfully ! No. Free pages before allocation : 7 No. Free pages after munmap : 7 ok 2 : Huge pages freed successfully ! No. Free pages before allocation : 7 No. Free pages after munmap : 7 ok 3 : Huge pages freed successfully ! No. Free pages before allocation : 7 No. Free pages after munmap : 6 not ok 4 : Huge pages not freed! Totals: pass:3 fail:1 xfail:0 xpass:0 skip:0 error:0 Test Result PASS Scenario (With the fix) --------------------------------------------------------- []#./hugetlb_dio TAP version 13 1..4 No. Free pages before allocation : 7 No. Free pages after munmap : 7 ok 1 : Huge pages freed successfully ! No. Free pages before allocation : 7 No. Free pages after munmap : 7 ok 2 : Huge pages freed successfully ! No. Free pages before allocation : 7 No. Free pages after munmap : 7 ok 3 : Huge pages freed successfully ! No. Free pages before allocation : 7 No. Free pages after munmap : 7 ok 4 : Huge pages freed successfully ! Totals: pass:4 fail:0 xfail:0 xpass:0 skip:0 error:0 [donettom@linux.ibm.com: address review comments from Muhammad] Link: https://lkml.kernel.org/r/20240604132801.23377-1-donettom@linux.ibm.com [donettom@linux.ibm.com: add this test to run_vmtests.sh] Link: https://lkml.kernel.org/r/20240607182000.6494-1-donettom@linux.ibm.com Link: https://lkml.kernel.org/r/20240523063905.3173-1-donettom@linux.ibm.com Fixes: 38b43539d64b ("block: Fix page refcounts for unaligned buffers in __bio_release_pages()") Signed-off-by: Donet Tom <donettom@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Co-developed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Tony Battersby <tonyb@cybernetics.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/zsmalloc: add MODULE_DESCRIPTION()Jeff Johnson
Fix the 'make W=1' warning: WARNING: modpost: missing MODULE_DESCRIPTION() in mm/zsmalloc.o Link: https://lkml.kernel.org/r/20240513-mm-md-v1-4-8c20e7d26842@quicinc.com Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/kfence: add MODULE_DESCRIPTION()Jeff Johnson
Fix the 'make W=1' warning: WARNING: modpost: missing MODULE_DESCRIPTION() in mm/kfence/kfence_test.o Link: https://lkml.kernel.org/r/20240513-mm-md-v1-3-8c20e7d26842@quicinc.com Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/dmapool: add MODULE_DESCRIPTION()Jeff Johnson
Fix the 'make W=1' warning: WARNING: modpost: missing MODULE_DESCRIPTION() in mm/dmapool_test.o Link: https://lkml.kernel.org/r/20240513-mm-md-v1-2-8c20e7d26842@quicinc.com Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/hwpoison: add MODULE_DESCRIPTION()Jeff Johnson
Patch series "mm: add missing MODULE_DESCRIPTION() macros". This fixes the instances of "WARNING: modpost: missing MODULE_DESCRIPTION()" that I'm seeing in mm/. This patch (of 4): Fix the 'make W=1' warning: WARNING: modpost: missing MODULE_DESCRIPTION() in mm/hwpoison-inject.o Link: https://lkml.kernel.org/r/20240513-mm-md-v1-0-8c20e7d26842@quicinc.com Link: https://lkml.kernel.org/r/20240513-mm-md-v1-1-8c20e7d26842@quicinc.com Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/mm_init: use node's number of cpus in deferred_page_init_max_threadsEric Chanudet
x86_64 is already using the node's cpu as maximum threads. Make that the default for all archs setting DEFERRED_STRUCT_PAGE_INIT. This returns to the behavior prior making the function arch-specific with commit ecd096506922 ("mm: make deferred init's max threads arch-specific"). Setting DEFERRED_STRUCT_PAGE_INIT and testing on a few arm64 platforms shows faster deferred_init_memmap completions: | | x13s | SA8775p-ride | Ampere R137-P31 | Ampere HR330 | | | Metal, 32GB | VM, 36GB | VM, 58GB | Metal, 128GB | | | 8cpus | 8cpus | 8cpus | 32cpus | |---------|-------------|--------------|-----------------|--------------| | threads | ms (%) | ms (%) | ms (%) | ms (%) | |---------|-------------|--------------|-----------------|--------------| | 1 | 108 (0%) | 72 (0%) | 224 (0%) | 324 (0%) | | cpus | 24 (-77%) | 36 (-50%) | 40 (-82%) | 56 (-82%) | Michael Ellerman reported: : On a machine here (1TB, 40 cores, 4KB pages) the existing code gives: : : [ 0.500124] node 2 deferred pages initialised in 210ms : [ 0.515790] node 3 deferred pages initialised in 230ms : [ 0.516061] node 0 deferred pages initialised in 230ms : [ 0.516522] node 7 deferred pages initialised in 230ms : [ 0.516672] node 4 deferred pages initialised in 230ms : [ 0.516798] node 6 deferred pages initialised in 230ms : [ 0.517051] node 5 deferred pages initialised in 230ms : [ 0.523887] node 1 deferred pages initialised in 240ms : : vs with the patch: : : [ 0.379613] node 0 deferred pages initialised in 90ms : [ 0.380388] node 1 deferred pages initialised in 90ms : [ 0.380540] node 4 deferred pages initialised in 100ms : [ 0.390239] node 6 deferred pages initialised in 100ms : [ 0.390249] node 2 deferred pages initialised in 100ms : [ 0.390786] node 3 deferred pages initialised in 110ms : [ 0.396721] node 5 deferred pages initialised in 110ms : [ 0.397095] node 7 deferred pages initialised in 110ms : : Which is a nice speedup. [echanude@redhat.com: v3] Link: https://lkml.kernel.org/r/20240528185455.643227-4-echanude@redhat.com Link: https://lkml.kernel.org/r/20240522203758.626932-4-echanude@redhat.com Signed-off-by: Eric Chanudet <echanude@redhat.com> Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: batch unlink_file_vma calls in free_pgd_rangeMateusz Guzik
Execs of dynamically linked binaries at 20-ish cores are bottlenecked on the i_mmap_rwsem semaphore, while the biggest singular contributor is free_pgd_range inducing the lock acquire back-to-back for all consecutive mappings of a given file. Tracing the count of said acquires while building the kernel shows: [1, 2) 799579 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [2, 3) 0 | | [3, 4) 3009 | | [4, 5) 3009 | | [5, 6) 326442 |@@@@@@@@@@@@@@@@@@@@@ | So in particular there were 326442 opportunities to coalesce 5 acquires into 1. Doing so increases execs per second by 4% (~50k to ~52k) when running the benchmark linked below. The lock remains the main bottleneck, I have not looked at other spots yet. Bench can be found here: http://apollo.backplane.com/DFlyMisc/doexec.c $ cc -O2 -o shared-doexec doexec.c $ ./shared-doexec $(nproc) Note this particular test makes sure binaries are separate, but the loader is shared. Stats collected on the patched kernel (+ "noinline") with: bpftrace -e 'kprobe:unlink_file_vma_batch_process { @ = lhist(((struct unlink_vma_file_batch *)arg0)->count, 0, 8, 1); }' Link: https://lkml.kernel.org/r/20240521234321.359501-1-mjguzik@gmail.com Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/memory-failure: send SIGBUS in the event of thp split failJane Chu
While handling hwpoison in a THP page, it is possible that try_to_split_thp_page() fails. For example, when the THP page has been RDMA pinned. At this point, the kernel cannot isolate the poisoned THP page, all it could do is to send a SIGBUS to the user process with meaningful payload to give user-level recovery a chance. Link: https://lkml.kernel.org/r/20240524215306.2705454-6-jane.chu@oracle.com Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <oalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/memory-failure: move hwpoison_filter() higher upJane Chu
Move hwpoison_filter() higher up as there is no need to spend a lot cycles only to find out later that the page is supposed to be skipped from hwpoison handling. Link: https://lkml.kernel.org/r/20240524215306.2705454-5-jane.chu@oracle.com Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <oalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/memory-failure: improve memory failure action_result messagesJane Chu
Added two explicit MF_MSG messages describing failure in get_hwpoison_page. Attemped to document the definition of various action names, and made a few adjustment to the action_result() calls. Link: https://lkml.kernel.org/r/20240524215306.2705454-4-jane.chu@oracle.com Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <oalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/madvise: add MF_ACTION_REQUIRED to madvise(MADV_HWPOISON)Jane Chu
The soft hwpoison injector via madvise(MADV_HWPOISON) operates in a synchrous way in a sense, the injector is also a process under test, and should it have the poisoned page mapped in its address space, it should get killed as much as in a real UE situation. Doing so align with what the madvise(2) man page says: " "This operation may result in the calling process receiving a SIGBUS and the page being unmapped." Link: https://lkml.kernel.org/r/20240524215306.2705454-3-jane.chu@oracle.com Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Oscar Salvador <oalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/memory-failure: try to send SIGBUS even if unmap failedJane Chu
Patch series "Enhance soft hwpoison handling and injection", v4. This series is aimed at the following enhancements: - Let one hwpoison injector, that is, madvise(MADV_HWPOISON) to behave more like as if a real UE occurred. Because the other two injectors such as hwpoison-inject and the 'einj' on x86 can't, and it seems to me we need a better simulation to real UE scenario. - For years, if the kernel is unable to unmap a hwpoisoned page, it send a SIGKILL instead of SIGBUS to prevent user process from potentially accessing the page again. But in doing so, the user process also lose important information: vaddr, for recovery. Fortunately, the kernel already has code to kill process re-accessing a hwpoisoned page, so remove the '!unmap_success' check. - Right now, if a thp page under GUP longterm pin is hwpoisoned, and kernel cannot split the thp page, memory-failure simply ignores the UE and returns. That's not ideal, it could deliver a SIGBUS with useful information for userspace recovery. This patch (of 5): For years when it comes down to kill a process due to hwpoison, a SIGBUS is delivered only if unmap has been successful. Otherwise, a SIGKILL is delivered. And the reason for that is to prevent the involved process from accessing the hwpoisoned page again. Since then a lot has changed, a hwpoisoned page is marked and upon being re-accessed, the memory-failure handler invokes kill_accessing_process() to kill the process immediately. So let's take out the '!unmap_success' factor and try to deliver SIGBUS if possible. Link: https://lkml.kernel.org/r/20240524215306.2705454-1-jane.chu@oracle.com Link: https://lkml.kernel.org/r/20240524215306.2705454-2-jane.chu@oracle.com Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <oalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: use update_mmu_tlb_range() to simplify codeBang Li
Let us simplify the code by update_mmu_tlb_range(). Link: https://lkml.kernel.org/r/20240522061204.117421-4-libang.li@antgroup.com Signed-off-by: Bang Li <libang.li@antgroup.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: implement update_mmu_tlb() using update_mmu_tlb_range()Bang Li
Let's make update_mmu_tlb() simply a generic wrapper around update_mmu_tlb_range(). Only the latter can now be overridden by the architecture. We can now remove __HAVE_ARCH_UPDATE_MMU_TLB as well. Link: https://lkml.kernel.org/r/20240522061204.117421-3-libang.li@antgroup.com Signed-off-by: Bang Li <libang.li@antgroup.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Lance Yang <ioworker0@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: add update_mmu_tlb_range()Bang Li
Patch series "Add update_mmu_tlb_range() to simplify code", v4. This series of commits mainly adds the update_mmu_tlb_range() to batch update tlb in an address range and implement update_mmu_tlb() using update_mmu_tlb_range(). After commit 19eaf44954df ("mm: thp: support allocation of anonymous multi-size THP"), We may need to batch update tlb of a certain address range by calling update_mmu_tlb() in a loop. Using the update_mmu_tlb_range(), we can simplify the code and possibly reduce the execution of some unnecessary code in some architectures. This patch (of 3): Add update_mmu_tlb_range(), we can batch update tlb of an address range. Link: https://lkml.kernel.org/r/20240522061204.117421-1-libang.li@antgroup.com Link: https://lkml.kernel.org/r/20240522061204.117421-2-libang.li@antgroup.com Signed-off-by: Bang Li <libang.li@antgroup.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Lance Yang <ioworker0@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03selftests/mm: va_high_addr_switch: dynamically initialize testcases to ↵Dev Jain
enable LPA2 testing Post FEAT_LPA2, the Aarch64 Linux kernel extends higher address support to 4K and 16K translation granules. To support testing this out, we need to do away with static initialization of page size, while still maintaining the nice array of testcases; this can be achieved by initializing and populating the array as a stack variable, and filling in the page size and hugepage size at runtime. Link: https://lkml.kernel.org/r/20240522070435.773918-3-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03selftests/mm: va_high_addr_switch: reduce test noiseDev Jain
Patch series "Restructure va_high_addr_switch". The va_high_addr_switch memory selftest tests out some corner cases related to allocation and page/hugepage faulting around the switch boundary. Currently, the page size and hugepage size have been statically defined. Post FEAT_LPA2, the Aarch64 Linux kernel adds support for 4k and 16k translation granules on higher addresses; we restructure the test to support the same. In addition, we avoid invocation of the binary twice, in the shell script, to reduce test noise. This patch (of 2): When invoking the binary with "--run-hugetlb" flag, the testcases involving the base page are anyways going to be run. Therefore, remove duplication by invoking the binary only once. Link: https://lkml.kernel.org/r/20240522070435.773918-1-dev.jain@arm.com Link: https://lkml.kernel.org/r/20240522070435.773918-2-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/rmap: sanity check that zeropages are not passed to RMAPDavid Hildenbrand
Using insert_page() we might have previously ended up passing the zeropage into rmap code. Make sure that won't happen again. Note that we won't check the huge zeropage for now, which might still end up in RMAP code. Link: https://lkml.kernel.org/r/20240522125713.775114-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/memory: cleanly support zeropage in vm_insert_page*(), vm_map_pages*() ↵David Hildenbrand
and vmf_insert_mixed() For now we only get the (small) zeropage mapped to user space in four cases (excluding VM_PFNMAP mappings, such as /proc/vmstat): (1) Read page faults in anonymous VMAs (MAP_PRIVATE|MAP_ANON): do_anonymous_page() will not refcount it and map it pte_mkspecial() (2) UFFDIO_ZEROPAGE on anonymous VMA or COW mapping of shmem (MAP_PRIVATE). mfill_atomic_pte_zeropage() will not refcount it and map it pte_mkspecial(). (3) KSM in mergeable VMA (anonymous VMA or COW mapping). cmp_and_merge_page() will not refcount it and map it pte_mkspecial(). (4) FSDAX as an optimization for holes. vmf_insert_mixed()->__vm_insert_mixed() might end up calling insert_page() without CONFIG_ARCH_HAS_PTE_SPECIAL, refcounting the zeropage and not mapping it pte_mkspecial(). With CONFIG_ARCH_HAS_PTE_SPECIAL, we'll call insert_pfn() where we will not refcount it and map it pte_mkspecial(). In case (4), we might not have VM_MIXEDMAP set: while fs/fuse/dax.c sets VM_MIXEDMAP, we removed it for ext4 fsdax in commit e1fb4a086495 ("dax: remove VM_MIXEDMAP for fsdax and device dax") and for XFS in commit e1fb4a086495 ("dax: remove VM_MIXEDMAP for fsdax and device dax"). Without CONFIG_ARCH_HAS_PTE_SPECIAL and with VM_MIXEDMAP, vm_normal_page() would currently return the zeropage. We'll refcount the zeropage when mapping and when unmapping. Without CONFIG_ARCH_HAS_PTE_SPECIAL and without VM_MIXEDMAP, vm_normal_page() would currently refuse to return the zeropage. So we'd refcount it when mapping but not when unmapping it ... do we have fsdax without CONFIG_ARCH_HAS_PTE_SPECIAL in practice? Hard to tell. Independent of that, we should never refcount the zeropage when we might be holding that reference for a long time, because even without an accounting imbalance we might overflow the refcount. As there is interest in using the zeropage also in other VM_MIXEDMAP mappings, let's add clean support for that in the cases where it makes sense: (A) Never refcount the zeropage when mapping it: In insert_page(), special-case the zeropage, do not refcount it, and use pte_mkspecial(). Don't involve insert_pfn(), adjusting insert_page() looks cleaner than branching off to insert_pfn(). (B) Never refcount the zeropage when unmapping it: In vm_normal_page(), also don't return the zeropage in a VM_MIXEDMAP mapping without CONFIG_ARCH_HAS_PTE_SPECIAL. Add a VM_WARN_ON_ONCE() sanity check if we'd ever return the zeropage, which could happen if someone forgets to set pte_mkspecial() when mapping the zeropage. Document that. (C) Allow the zeropage only where reasonable s390x never wants the zeropage in some processes running legacy KVM guests that make use of storage keys. So disallow that. Further, using the zeropage in COW mappings is unproblematic (just what we do for other COW mappings), because FAULT_FLAG_UNSHARE can just unshare it and GUP with FOLL_LONGTERM would work as expected. Similarly, mappings that can never have writable PTEs (implying no write faults) are also not problematic, because nothing could end up mapping the PTE writable by mistake later. But in case we could have writable PTEs, we'll only allow the zeropage in FSDAX VMAs, that are incompatible with GUP and are blocked there completely. We'll always require the zeropage to be mapped with pte_special(). GUP-fast will reject the zeropage that way, but GUP-slow will allow it. (Note that GUP does not refcount the zeropage with FOLL_PIN, because there were issues with overflowing the refcount in the past). Add sanity checks to can_change_pte_writable() and wp_page_reuse(), to catch early during testing if we'd ever find a zeropage unexpectedly in code that wants to upgrade write permissions. Convert the BUG_ON in vm_mixed_ok() to an ordinary check and simply fail with VM_FAULT_SIGBUS, like we do for other sanity checks. Drop the stale comment regarding reserved pages from insert_page(). Note that: * we won't mess with VM_PFNMAP mappings for now. remap_pfn_range() and vmf_insert_pfn() would allow the zeropage in some cases and not refcount it. * vmf_insert_pfn*() will reject the zeropage in VM_MIXEDMAP mappings and we'll leave that alone for now. People can simply use one of the other interfaces. * we won't bother with the huge zeropage for now. It's never PTE-mapped and also GUP does not special-case it yet. Link: https://lkml.kernel.org/r/20240522125713.775114-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/memory: move page_count() check into validate_page_before_insert()David Hildenbrand
Patch series "mm/memory: cleanly support zeropage in vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()", v2. There is interest in mapping zeropages via vm_insert_pages() [1] into MAP_SHARED mappings. For now, we only get zeropages in MAP_SHARED mappings via vmf_insert_mixed() from FSDAX code, and I think it's a bit shaky in some cases because we refcount the zeropage when mapping it but not necessarily always when unmapping it ... and we should actually never refcount it. It's all a bit tricky, especially how zeropages in MAP_SHARED mappings interact with GUP (FOLL_LONGTERM), mprotect(), write-faults and s390x forbidding the shared zeropage (rewrite [2] s now upstream). This series tries to take the careful approach of only allowing the zeropage where it is likely safe to use (which should cover the existing FSDAX use case and [1]), preventing that it could accidentally get mapped writable during a write fault, mprotect() etc, and preventing issues with FOLL_LONGTERM in the future with other users. Tested with a patch from Vincent that uses the zeropage in context of [1]. [1] https://lkml.kernel.org/r/20240430111354.637356-1-vdonnefort@google.com [2] https://lkml.kernel.org/r/20240411161441.910170-1-david@redhat.com This patch (of 3): We'll now also cover the case where insert_page() is called from __vm_insert_mixed(), which sounds like the right thing to do. Link: https://lkml.kernel.org/r/20240522125713.775114-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03selftests: mm: check return valuesMuhammad Usama Anjum
Check return value and return error/skip the tests. Link: https://lkml.kernel.org/r/20240520185248.1801945-1-usama.anjum@collabora.com Fixes: 46fd75d4a3c9 ("selftests: mm: add pagemap ioctl tests") Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/hugetlb: remove {Set,Clear}Hpage macrosSidhartha Kumar
All users have been converted to use the folio version of these macros, we can safely remove the page based interface. Link: https://lkml.kernel.org/r/20240520224407.110062-1-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/swap: reduce swap cache search spaceKairui Song
Currently we use one swap_address_space for every 64M chunk to reduce lock contention, this is like having a set of smaller swap files inside one swap device. But when doing swap cache look up or insert, we are still using the offset of the whole large swap device. This is OK for correctness, as the offset (key) is unique. But Xarray is specially optimized for small indexes, it creates the radix tree levels lazily to be just enough to fit the largest key stored in one Xarray. So we are wasting tree nodes unnecessarily. For 64M chunk it should only take at most 3 levels to contain everything. But if we are using the offset from the whole swap device, the offset (key) value will be way beyond 64M, and so will the tree level. Optimize this by using a new helper swap_cache_index to get a swap entry's unique offset in its own 64M swap_address_space. I see a ~1% performance gain in benchmark and actual workload with high memory pressure. Test with `time memhog 128G` inside a 8G memcg using 128G swap (ramdisk with SWP_SYNCHRONOUS_IO dropped, tested 3 times, results are stable. The test result is similar but the improvement is smaller if SWP_SYNCHRONOUS_IO is enabled, as swap out path can never skip swap cache): Before: 6.07user 250.74system 4:17.26elapsed 99%CPU (0avgtext+0avgdata 8373376maxresident)k 0inputs+0outputs (55major+33555018minor)pagefaults 0swaps After (1.8% faster): 6.08user 246.09system 4:12.58elapsed 99%CPU (0avgtext+0avgdata 8373248maxresident)k 0inputs+0outputs (54major+33555027minor)pagefaults 0swaps Similar result with MySQL and sysbench using swap: Before: 94055.61 qps After (0.8% faster): 94834.91 qps Radix tree slab usage is also very slightly lower. Link: https://lkml.kernel.org/r/20240521175854.96038-12-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: drop page_index and simplify folio_indexKairui Song
There are two helpers for retrieving the index within address space for mixed usage of swap cache and page cache: - page_index - folio_index This commit drops page_index, as we have eliminated all users, and converts folio_index's helper __page_file_index to use folio to avoid the page conversion. Link: https://lkml.kernel.org/r/20240521175854.96038-11-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>