summaryrefslogtreecommitdiff
path: root/mm/rmap.c
AgeCommit message (Collapse)Author
2023-02-02mm: remove mlock_vma_page()Matthew Wilcox (Oracle)
All callers now have a folio and can call mlock_vma_folio(). Update the documentation to refer to mlock_vma_folio(). Link: https://lkml.kernel.org/r/20230116192827.2146732-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02mm: remove __HAVE_ARCH_PTE_SWP_EXCLUSIVEDavid Hildenbrand
__HAVE_ARCH_PTE_SWP_EXCLUSIVE is now supported by all architectures that support swp PTEs, so let's drop it. Link: https://lkml.kernel.org/r/20230113171026.582290-27-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02mm: mlock: update the interface to use foliosLorenzo Stoakes
Update the mlock interface to accept folios rather than pages, bringing the interface in line with the internal implementation. munlock_vma_page() still requires a page_folio() conversion, however this is consistent with the existent mlock_vma_page() implementation and a product of rmap still dealing in pages rather than folios. Link: https://lkml.kernel.org/r/cba12777c5544305014bc0cbec56bb4cc71477d8.1673526881.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christian Brauner <brauner@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Hugh Dickins <hughd@google.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02mm: convert deferred_split_huge_page() to deferred_split_folio()Matthew Wilcox (Oracle)
Now that both callers use a folio, pass the folio in and save a call to compound_head(). Link: https://lkml.kernel.org/r/20230111142915.1001531-28-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02mm: use a folio in hugepage_add_anon_rmap() and hugepage_add_new_anon_rmap()Matthew Wilcox (Oracle)
Remove uses of compound_mapcount_ptr() Link: https://lkml.kernel.org/r/20230111142915.1001531-11-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02mm: add folio_add_new_anon_rmap()Matthew Wilcox (Oracle)
In contrast to other rmap functions, page_add_new_anon_rmap() is always called with a freshly allocated page. That means it can't be called with a tail page. Turn page_add_new_anon_rmap() into folio_add_new_anon_rmap() and add a page_add_new_anon_rmap() wrapper. Callers can be converted individually. [akpm@linux-foundation.org: fix NOMMU build. page_add_new_anon_rmap() requires CONFIG_MMU] [willy@infradead.org: folio-compat.c needs rmap.h] Link: https://lkml.kernel.org/r/20230111142915.1001531-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02mm: convert page_add_file_rmap() to use a folio internallyMatthew Wilcox (Oracle)
The API for page_add_file_rmap() needs to be page-based, because we can add mappings of individual pages. But inside the function, we want to only call compound_head() once and then use the folio APIs instead of the page APIs that each call compound_head(). Link: https://lkml.kernel.org/r/20230111142915.1001531-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02mm: convert page_add_anon_rmap() to use a folio internallyMatthew Wilcox (Oracle)
The API for page_add_anon_rmap() needs to be page-based, because we can add mappings of individual pages. But inside the function, we want to only call compound_head() once and then use the folio APIs instead of the page APIs that each call compound_head(). Link: https://lkml.kernel.org/r/20230111142915.1001531-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02mm: convert page_remove_rmap() to use a folio internallyMatthew Wilcox (Oracle)
The API for page_remove_rmap() needs to be page-based, because we can remove mappings of pages individually. But inside the function, we want to only call compound_head() once and then use the folio APIs instead of the page APIs that each call compound_head(). Link: https://lkml.kernel.org/r/20230111142915.1001531-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02mm: convert total_compound_mapcount() to folio_total_mapcount()Matthew Wilcox (Oracle)
Instead of enforcing that the argument must be a head page by naming, enforce it with the compiler by making it a folio. Also rename the counter in struct folio from _compound_mapcount to _entire_mapcount. Link: https://lkml.kernel.org/r/20230111142915.1001531-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02mm: convert head_subpages_mapcount() into folio_nr_pages_mapped()Matthew Wilcox (Oracle)
Calling this 'mapcount' is confusing since mapcount is usually the number of times something is mapped; instead this is the number of mapped pages. It's also better to enforce that this is a folio rather than a head page. Move folio_nr_pages_mapped() into mm/internal.h since this is not something we want device drivers or filesystems poking at. Get rid of folio_subpages_mapcount_ptr() and use folio->_nr_pages_mapped directly. Link: https://lkml.kernel.org/r/20230111142915.1001531-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02mm/mmu_notifier: remove unused mmu_notifier_range_update_to_read_only exportAlistair Popple
mmu_notifier_range_update_to_read_only() was originally introduced in commit c6d23413f81b ("mm/mmu_notifier: mmu_notifier_range_update_to_read_only() helper") as an optimisation for device drivers that know a range has only been mapped read-only. However there are no users of this feature so remove it. As it is the only user of the struct mmu_notifier_range.vma field remove that also. Link: https://lkml.kernel.org/r/20230110025722.600912-1-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18mm: add vma_has_recency()Yu Zhao
Add vma_has_recency() to indicate whether a VMA may exhibit temporal locality that the LRU algorithm relies on. This function returns false for VMAs marked by VM_SEQ_READ or VM_RAND_READ. While the former flag indicates linear access, i.e., a special case of spatial locality, both flags indicate a lack of temporal locality, i.e., the reuse of an area within a relatively small duration. "Recency" is chosen over "locality" to avoid confusion between temporal and spatial localities. Before this patch, the active/inactive LRU only ignored the accessed bit from VMAs marked by VM_SEQ_READ. After this patch, the active/inactive LRU and MGLRU share the same logic: they both ignore the accessed bit if vma_has_recency() returns false. For the active/inactive LRU, the following fio test showed a [6, 8]% increase in IOPS when randomly accessing mapped files under memory pressure. kb=$(awk '/MemTotal/ { print $2 }' /proc/meminfo) kb=$((kb - 8*1024*1024)) modprobe brd rd_nr=1 rd_size=$kb dd if=/dev/zero of=/dev/ram0 bs=1M mkfs.ext4 /dev/ram0 mount /dev/ram0 /mnt/ swapoff -a fio --name=test --directory=/mnt/ --ioengine=mmap --numjobs=8 \ --size=8G --rw=randrw --time_based --runtime=10m \ --group_reporting The discussion that led to this patch is here [1]. Additional test results are available in that thread. [1] https://lore.kernel.org/r/Y31s%2FK8T85jh05wH@google.com/ Link: https://lkml.kernel.org/r/20221230215252.2628425-1-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrea Righi <andrea.righi@canonical.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michael Larabel <Michael@MichaelLarabel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18mm: rmap: remove lock_page_memcg()Johannes Weiner
The previous patch made sure charge moving only touches pages for which page_mapped() is stable. lock_page_memcg() is no longer needed. Link: https://lkml.kernel.org/r/20221206171340.139790-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11mm,thp,rmap: fix races between updates of subpages_mapcountHugh Dickins
Commit 4b51634cd16a, introducing the COMPOUND_MAPPED bit, paid attention to the impossibility of subpages_mapcount ever appearing negative; but did not attend to those races in which it can momentarily appear larger than thought possible. These arise from how page_remove_rmap() first decrements page->_mapcount or compound_mapcount, then, if that transition goes negative (logical 0), decrements subpages_mapcount. The initial decrement lets a racing page_add_*_rmap() reincrement _mapcount or compound_mapcount immediately, and then in rare cases its corresponding increment of subpages_mapcount may be completed before page_remove_rmap()'s decrement. There could even (with increasing unlikelihood) be a series of increments intermixed with the decrements. In practice, checking subpages_mapcount with a temporary WARN on range, has caught values of 0x1000000 (2*COMPOUND_MAPPED, when move_pages() was using remove_migration_pmd()) and 0x800201 (do_huge_pmd_wp_page() using __split_huge_pmd()): page_add_anon_rmap() racing page_remove_rmap(), as predicted. I certainly found it harder to reason about than when bit_spin_locked, but the easy case gives a clue to how to handle the harder case. The easy case being the three !(nr & COMPOUND_MAPPED) checks, which should obviously be replaced by (nr < COMPOUND_MAPPED) checks - to count a page as compound mapped, even while the bit in that position is 0. The harder case is when trying to decide how many subpages are newly covered or uncovered, when compound map is first added or last removed: not knowing all that racily happened between first and second atomic ops. But the easy way to handle that, is again to count the page as compound mapped all the while that its subpages_mapcount indicates so - ignoring the _mapcount or compound_mapcount transition while it is on the way to being reversed. Link: https://lkml.kernel.org/r/4388158-3092-a960-ff2d-55f2b0fe4ef8@google.com Fixes: 4b51634cd16a ("mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mapped") Signed-off-by: Hugh Dickins <hughd@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mappedHugh Dickins
Can the lock_compound_mapcount() bit_spin_lock apparatus be removed now? Yes. Not by atomic64_t or cmpxchg games, those get difficult on 32-bit; but if we slightly abuse subpages_mapcount by additionally demanding that one bit be set there when the compound page is PMD-mapped, then a cascade of two atomic ops is able to maintain the stats without bit_spin_lock. This is harder to reason about than when bit_spin_locked, but I believe safe; and no drift in stats detected when testing. When there are racing removes and adds, of course the sequence of operations is less well- defined; but each operation on subpages_mapcount is atomically good. What might be disastrous, is if subpages_mapcount could ever fleetingly appear negative: but the pte lock (or pmd lock) these rmap functions are called under, ensures that a last remove cannot race ahead of a first add. Continue to make an exception for hugetlb (PageHuge) pages, though that exception can be easily removed by a further commit if necessary: leave subpages_mapcount 0, don't bother with COMPOUND_MAPPED in its case, just carry on checking compound_mapcount too in folio_mapped(), page_mapped(). Evidence is that this way goes slightly faster than the previous implementation in all cases (pmds after ptes now taking around 103ms); and relieves us of worrying about contention on the bit_spin_lock. Link: https://lkml.kernel.org/r/3978f3ca-5473-55a7-4e14-efea5968d892@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Carpenter <error27@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm,thp,rmap: subpages_mapcount of PTE-mapped subpagesHugh Dickins
Patch series "mm,thp,rmap: rework the use of subpages_mapcount", v2. This patch (of 3): Following suggestion from Linus, instead of counting every PTE map of a compound page in subpages_mapcount, just count how many of its subpages are PTE-mapped: this yields the exact number needed for NR_ANON_MAPPED and NR_FILE_MAPPED stats, without any need for a locked scan of subpages; and requires updating the count less often. This does then revert total_mapcount() and folio_mapcount() to needing a scan of subpages; but they are inherently racy, and need no locking, so Linus is right that the scans are much better done there. Plus (unlike in 6.1 and previous) subpages_mapcount lets us avoid the scan in the common case of no PTE maps. And page_mapped() and folio_mapped() remain scanless and just as efficient with the new meaning of subpages_mapcount: those are the functions which I most wanted to remove the scan from. The updated page_dup_compound_rmap() is no longer suitable for use by anon THP's __split_huge_pmd_locked(); but page_add_anon_rmap() can be used for that, so long as its VM_BUG_ON_PAGE(!PageLocked) is deleted. Evidence is that this way goes slightly faster than the previous implementation for most cases; but significantly faster in the (now scanless) pmds after ptes case, which started out at 870ms and was brought down to 495ms by the previous series, now takes around 105ms. Link: https://lkml.kernel.org/r/a5849eca-22f1-3517-bf29-95d982242742@google.com Link: https://lkml.kernel.org/r/eec17e16-4e1-7c59-f1bc-5bca90dac919@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Carpenter <error27@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm,thp,rmap: handle the normal !PageCompound case firstHugh Dickins
Commit ("mm,thp,rmap: lock_compound_mapcounts() on THP mapcounts") propagated the "if (compound) {lock} else if (PageCompound) {lock} else {atomic}" pattern throughout; but Linus hated the way that gives primacy to the uncommon case: switch to "if (!PageCompound) {atomic} else if (compound) {lock} else {lock}" throughout. Linus has a bigger idea for how to improve it all, but here just make that rearrangement. Link: https://lkml.kernel.org/r/fca2f694-2098-b0ef-d4e-f1d8b94d318c@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm,thp,rmap: lock_compound_mapcounts() on THP mapcountsHugh Dickins
Fix the races in maintaining compound_mapcount, subpages_mapcount and subpage _mapcount by using PG_locked in the first tail of any compound page for a bit_spin_lock() on such modifications; skipping the usual atomic operations on those fields in this case. Bring page_remove_file_rmap() and page_remove_anon_compound_rmap() back into page_remove_rmap() itself. Rearrange page_add_anon_rmap() and page_add_file_rmap() and page_remove_rmap() to follow the same "if (compound) {lock} else if (PageCompound) {lock} else {atomic}" pattern (with a PageTransHuge in the compound test, like before, to avoid BUG_ONs and optimize away that block when THP is not configured). Move all the stats updates outside, after the bit_spin_locked section, so that it is sure to be a leaf lock. Add page_dup_compound_rmap() to manage compound locking versus atomics in sync with the rest. In particular, hugetlb pages are still using the atomics: to avoid unnecessary interference there, and because they never have subpage mappings; but this exception can easily be changed. Conveniently, page_dup_compound_rmap() turns out to suit an anon THP's __split_huge_pmd_locked() too. bit_spin_lock() is not popular with PREEMPT_RT folks: but PREEMPT_RT sensibly excludes TRANSPARENT_HUGEPAGE already, so its only exposure is to the non-hugetlb non-THP pte-mapped compound pages (with large folios being currently dependent on TRANSPARENT_HUGEPAGE). There is never any scan of subpages in this case; but we have chosen to use PageCompound tests rather than PageTransCompound tests to gate the use of lock_compound_mapcounts(), so that page_mapped() is correct on all compound pages, whether or not TRANSPARENT_HUGEPAGE is enabled: could that be a problem for PREEMPT_RT, when there is contention on the lock - under heavy concurrent forking for example? If so, then it can be turned into a sleeping lock (like folio_lock()) when PREEMPT_RT. A simple 100 X munmap(mmap(2GB, MAP_SHARED|MAP_POPULATE, tmpfs), 2GB) took 18 seconds on small pages, and used to take 1 second on huge pages, but now takes 115 milliseconds on huge pages. Mapping by pmds a second time used to take 860ms and now takes 86ms; mapping by pmds after mapping by ptes (when the scan is needed) used to take 870ms and now takes 495ms. Mapping huge pages by ptes is largely unaffected but variable: between 5% faster and 5% slower in what I've recorded. Contention on the lock is likely to behave worse than contention on the atomics behaved. Link: https://lkml.kernel.org/r/1b42bd1a-8223-e827-602f-d466c2db7d3c@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm,thp,rmap: simplify compound page mapcount handlingHugh Dickins
Compound page (folio) mapcount calculations have been different for anon and file (or shmem) THPs, and involved the obscure PageDoubleMap flag. And each huge mapping and unmapping of a file (or shmem) THP involved atomically incrementing and decrementing the mapcount of every subpage of that huge page, dirtying many struct page cachelines. Add subpages_mapcount field to the struct folio and first tail page, so that the total of subpage mapcounts is available in one place near the head: then page_mapcount() and total_mapcount() and page_mapped(), and their folio equivalents, are so quick that anon and file and hugetlb don't need to be optimized differently. Delete the unloved PageDoubleMap. page_add and page_remove rmap functions must now maintain the subpages_mapcount as well as the subpage _mapcount, when dealing with pte mappings of huge pages; and correct maintenance of NR_ANON_MAPPED and NR_FILE_MAPPED statistics still needs reading through the subpages, using nr_subpages_unmapped() - but only when first or last pmd mapping finds subpages_mapcount raised (double-map case, not the common case). But are those counts (used to decide when to split an anon THP, and in vmscan's pagecache_reclaimable heuristic) correctly maintained? Not quite: since page_remove_rmap() (and also split_huge_pmd()) is often called without page lock, there can be races when a subpage pte mapcount 0<->1 while compound pmd mapcount 0<->1 is scanning - races which the previous implementation had prevented. The statistics might become inaccurate, and even drift down until they underflow through 0. That is not good enough, but is better dealt with in a followup patch. Update a few comments on first and second tail page overlaid fields. hugepage_add_new_anon_rmap() has to "increment" compound_mapcount, but subpages_mapcount and compound_pincount are already correctly at 0, so delete its reinitialization of compound_pincount. A simple 100 X munmap(mmap(2GB, MAP_SHARED|MAP_POPULATE, tmpfs), 2GB) took 18 seconds on small pages, and used to take 1 second on huge pages, but now takes 119 milliseconds on huge pages. Mapping by pmds a second time used to take 860ms and now takes 92ms; mapping by pmds after mapping by ptes (when the scan is needed) used to take 870ms and now takes 495ms. But there might be some benchmarks which would show a slowdown, because tail struct pages now fall out of cache until final freeing checks them. Link: https://lkml.kernel.org/r/47ad693-717-79c8-e1ba-46c3a6602e48@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08mm/hugetlb: unify clearing of RestoreReserve for private pagesPeter Xu
A trivial cleanup to move clearing of RestoreReserve into adding anon rmap of private hugetlb mappings. It matches with the shared mappings where we only clear the bit when adding into page cache, rather than spreading it around the code paths. Link: https://lkml.kernel.org/r/20221020193832.776173-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08mm: rmap: rename page_not_mapped() to folio_not_mapped()Kefeng Wang
Since commit 2f031c6f042c ("mm/rmap: Convert rmap_walk() to take a folio"), page_not_mapped() takes folio as parameter, rename it to be consistent. Link: https://lkml.kernel.org/r/20220927063826.159590-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08mm/rmap: fix comment in anon_vma_clone()Ma Wupeng
Commit 2555283eb40d ("mm/rmap: Fix anon_vma->degree ambiguity leading to double-reuse") use num_children and num_active_vmas to replace the origin degree to fix anon_vma UAF problem. Update the comment in anon_vma_clone to fit this change. Link: https://lkml.kernel.org/r/20221014013931.1565969-1-mawupeng1@huawei.com Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-10Merge tag 'mm-stable-2022-10-08' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in linux-next for a couple of months without, to my knowledge, any negative reports (or any positive ones, come to that). - Also the Maple Tree from Liam Howlett. An overlapping range-based tree for vmas. It it apparently slightly more efficient in its own right, but is mainly targeted at enabling work to reduce mmap_lock contention. Liam has identified a number of other tree users in the kernel which could be beneficially onverted to mapletrees. Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat at [1]. This has yet to be addressed due to Liam's unfortunately timed vacation. He is now back and we'll get this fixed up. - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses clang-generated instrumentation to detect used-unintialized bugs down to the single bit level. KMSAN keeps finding bugs. New ones, as well as the legacy ones. - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of memory into THPs. - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support file/shmem-backed pages. - userfaultfd updates from Axel Rasmussen - zsmalloc cleanups from Alexey Romanov - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure - Huang Ying adds enhancements to NUMA balancing memory tiering mode's page promotion, with a new way of detecting hot pages. - memcg updates from Shakeel Butt: charging optimizations and reduced memory consumption. - memcg cleanups from Kairui Song. - memcg fixes and cleanups from Johannes Weiner. - Vishal Moola provides more folio conversions - Zhang Yi removed ll_rw_block() :( - migration enhancements from Peter Xu - migration error-path bugfixes from Huang Ying - Aneesh Kumar added ability for a device driver to alter the memory tiering promotion paths. For optimizations by PMEM drivers, DRM drivers, etc. - vma merging improvements from Jakub Matěn. - NUMA hinting cleanups from David Hildenbrand. - xu xin added aditional userspace visibility into KSM merging activity. - THP & KSM code consolidation from Qi Zheng. - more folio work from Matthew Wilcox. - KASAN updates from Andrey Konovalov. - DAMON cleanups from Kaixu Xia. - DAMON work from SeongJae Park: fixes, cleanups. - hugetlb sysfs cleanups from Muchun Song. - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core. Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1] * tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits) hugetlb: allocate vma lock for all sharable vmas hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer hugetlb: fix vma lock handling during split vma and range unmapping mglru: mm/vmscan.c: fix imprecise comments mm/mglru: don't sync disk for each aging cycle mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol mm: memcontrol: use do_memsw_account() in a few more places mm: memcontrol: deprecate swapaccounting=0 mode mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled mm/secretmem: remove reduntant return value mm/hugetlb: add available_huge_pages() func mm: remove unused inline functions from include/linux/mm_inline.h selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd selftests/vm: add thp collapse shmem testing selftests/vm: add thp collapse file and tmpfs testing selftests/vm: modularize thp collapse memory operations selftests/vm: dedup THP helpers mm/khugepaged: add tracepoint to hpage_collapse_scan_file() mm/madvise: add file and shmem support to MADV_COLLAPSE ...
2022-10-03hugetlb: use new vma_lock for pmd sharing synchronizationMike Kravetz
The new hugetlb vma lock is used to address this race: Faulting thread Unsharing thread ... ... ptep = huge_pte_offset() or ptep = huge_pte_alloc() ... i_mmap_lock_write lock page table ptep invalid <------------------------ huge_pmd_unshare() Could be in a previously unlock_page_table sharing process or worse i_mmap_unlock_write ... The vma_lock is used as follows: - During fault processing. The lock is acquired in read mode before doing a page table lock and allocation (huge_pte_alloc). The lock is held until code is finished with the page table entry (ptep). - The lock must be held in write mode whenever huge_pmd_unshare is called. Lock ordering issues come into play when unmapping a page from all vmas mapping the page. The i_mmap_rwsem must be held to search for the vmas, and the vma lock must be held before calling unmap which will call huge_pmd_unshare. This is done today in: - try_to_migrate_one and try_to_unmap_ for page migration and memory error handling. In these routines we 'try' to obtain the vma lock and fail to unmap if unsuccessful. Calling routines already deal with the failure of unmapping. - hugetlb_vmdelete_list for truncation and hole punch. This routine also tries to acquire the vma lock. If it fails, it skips the unmapping. However, we can not have file truncation or hole punch fail because of contention. After hugetlb_vmdelete_list, truncation and hole punch call remove_inode_hugepages. remove_inode_hugepages checks for mapped pages and call hugetlb_unmap_file_page to unmap them. hugetlb_unmap_file_page is designed to drop locks and reacquire in the correct order to guarantee unmap success. Link: https://lkml.kernel.org/r/20220914221810.95771-9-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03hugetlb: add vma based lock for pmd sharingMike Kravetz
Allocate a new hugetlb_vma_lock structure and hang off vm_private_data for synchronization use by vmas that could be involved in pmd sharing. This data structure contains a rw semaphore that is the primary tool used for synchronization. This new structure is ref counted, so that it can exist when NOT attached to a vma. This is only helpful in resolving lock ordering issues where code may need to obtain the vma_lock while there are no guarantees the vma may go away. By obtaining a ref on the structure, it can be guaranteed that at least the rw semaphore will not go away. Only add infrastructure for the new lock here. Actual use will be added in subsequent patches. [mike.kravetz@oracle.com: fix build issue for missing hugetlb_vma_lock_release] Link: https://lkml.kernel.org/r/YyNUtA1vRASOE4+M@monkey Link: https://lkml.kernel.org/r/20220914221810.95771-7-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronizationMike Kravetz
Commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization") added code to take i_mmap_rwsem in read mode for the duration of fault processing. However, this has been shown to cause performance/scaling issues. Revert the code and go back to only taking the semaphore in huge_pmd_share during the fault path. Keep the code that takes i_mmap_rwsem in write mode before calling try_to_unmap as this is required if huge_pmd_unshare is called. NOTE: Reverting this code does expose the following race condition. Faulting thread Unsharing thread ... ... ptep = huge_pte_offset() or ptep = huge_pte_alloc() ... i_mmap_lock_write lock page table ptep invalid <------------------------ huge_pmd_unshare() Could be in a previously unlock_page_table sharing process or worse i_mmap_unlock_write ... ptl = huge_pte_lock(ptep) get/update pte set_pte_at(pte, ptep) It is unknown if the above race was ever experienced by a user. It was discovered via code inspection when initially addressed. In subsequent patches, a new synchronization mechanism will be added to coordinate pmd sharing and eliminate this race. Link: https://lkml.kernel.org/r/20220914221810.95771-3-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03rmap: remove page_unlock_anon_vma_read()Matthew Wilcox (Oracle)
This was simply an alias for anon_vma_unlock_read() since 2011. Link: https://lkml.kernel.org/r/20220902194653.1739778-56-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03mm: convert page_get_anon_vma() to folio_get_anon_vma()Matthew Wilcox (Oracle)
With all callers now passing in a folio, rename the function and convert all callers. Removes a couple of calls to compound_head() and a reference to page->mapping. Link: https://lkml.kernel.org/r/20220902194653.1739778-55-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03rmap: convert page_move_anon_rmap() to use a folioMatthew Wilcox (Oracle)
Removes one call to compound_head() and a reference to page->mapping. Link: https://lkml.kernel.org/r/20220902194653.1739778-50-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26mm: multi-gen LRU: exploit locality in rmapYu Zhao
Searching the rmap for PTEs mapping each page on an LRU list (to test and clear the accessed bit) can be expensive because pages from different VMAs (PA space) are not cache friendly to the rmap (VA space). For workloads mostly using mapped pages, searching the rmap can incur the highest CPU cost in the reclaim path. This patch exploits spatial locality to reduce the trips into the rmap. When shrink_page_list() walks the rmap and finds a young PTE, a new function lru_gen_look_around() scans at most BITS_PER_LONG-1 adjacent PTEs. On finding another young PTE, it clears the accessed bit and updates the gen counter of the page mapped by this PTE to (max_seq%MAX_NR_GENS)+1. Server benchmark results: Single workload: fio (buffered I/O): no change Single workload: memcached (anon): +[3, 5]% Ops/sec KB/sec patch1-6: 1106168.46 43025.04 patch1-7: 1147696.57 44640.29 Configurations: no change Client benchmark results: kswapd profiles: patch1-6 39.03% lzo1x_1_do_compress (real work) 18.47% page_vma_mapped_walk (overhead) 6.74% _raw_spin_unlock_irq 3.97% do_raw_spin_lock 2.49% ptep_clear_flush 2.48% anon_vma_interval_tree_iter_first 1.92% folio_referenced_one 1.88% __zram_bvec_write 1.48% memmove 1.31% vma_interval_tree_iter_next patch1-7 48.16% lzo1x_1_do_compress (real work) 8.20% page_vma_mapped_walk (overhead) 7.06% _raw_spin_unlock_irq 2.92% ptep_clear_flush 2.53% __zram_bvec_write 2.11% do_raw_spin_lock 2.02% memmove 1.93% lru_gen_look_around 1.56% free_unref_page_list 1.40% memset Configurations: no change Link: https://lkml.kernel.org/r/20220918080010.2920238-8-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Barry Song <baohua@kernel.org> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26mm: remember young/dirty bit for page migrationsPeter Xu
When page migration happens, we always ignore the young/dirty bit settings in the old pgtable, and marking the page as old in the new page table using either pte_mkold() or pmd_mkold(), and keeping the pte clean. That's fine from functional-wise, but that's not friendly to page reclaim because the moving page can be actively accessed within the procedure. Not to mention hardware setting the young bit can bring quite some overhead on some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit. The same slowdown problem to dirty bits when the memory is first written after page migration happened. Actually we can easily remember the A/D bit configuration and recover the information after the page is migrated. To achieve it, define a new set of bits in the migration swap offset field to cache the A/D bits for old pte. Then when removing/recovering the migration entry, we can recover the A/D bits even if the page changed. One thing to mention is that here we used max_swapfile_size() to detect how many swp offset bits we have, and we'll only enable this feature if we know the swp offset is big enough to store both the PFN value and the A/D bits. Otherwise the A/D bits are dropped like before. Link: https://lkml.kernel.org/r/20220811161331.37055-6-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Minchan Kim <minchan@kernel.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11mm: fix PageAnonExclusive clearing racing with concurrent RCU GUP-fastDavid Hildenbrand
commit 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive") made sure that when PageAnonExclusive() has to be cleared during temporary unmapping of a page, that the PTE is cleared/invalidated and that the TLB is flushed. What we want to achieve in all cases is that we cannot end up with a pin on an anonymous page that may be shared, because such pins would be unreliable and could result in memory corruptions when the mapped page and the pin go out of sync due to a write fault. That TLB flush handling was inspired by an outdated comment in mm/ksm.c:write_protect_page(), which similarly required the TLB flush in the past to synchronize with GUP-fast. However, ever since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer sufficient to handle concurrent GUP-fast in all cases -- it only handles traditional IPI-based GUP-fast correctly. Peter Xu (thankfully) questioned whether that TLB flush is really required. On architectures that send an IPI broadcast on TLB flush, it works as expected. To synchronize with RCU GUP-fast properly, we're conceptually fine, however, we have to enforce a certain memory order and are missing memory barriers. Let's document that, avoid the TLB flush where possible and use proper explicit memory barriers where required. We shouldn't really care about the additional memory barriers here, as we're not on extremely hot paths -- and we're getting rid of some TLB flushes. We use a smp_mb() pair for handling concurrent pinning and a smp_rmb()/smp_wmb() pair for handling the corner case of only temporary PTE changes but permanent PageAnonExclusive changes. One extreme example, whereby GUP-fast takes a R/O pin and KSM wants to convert an exclusive anonymous page to a KSM page, and that page is already mapped write-protected (-> no PTE change) would be: Thread 0 (KSM) Thread 1 (GUP-fast) (B1) Read the PTE # (B2) skipped without FOLL_WRITE (A1) Clear PTE smp_mb() (A2) Check pinned (B3) Pin the mapped page smp_mb() (A3) Clear PageAnonExclusive smp_wmb() (A4) Restore PTE (B4) Check if the PTE changed smp_rmb() (B5) Check PageAnonExclusive Thread 1 will properly detect that PageAnonExclusive was cleared and back off. Note that we don't need a memory barrier between checking if the page is pinned and clearing PageAnonExclusive, because stores are not speculated. The possible issues due to reordering are of theoretical nature so far and attempts to reproduce the race failed. Especially the "no PTE change" case isn't the common case, because we'd need an exclusive anonymous page that's mapped R/O and the PTE is clean in KSM code -- and using KSM with page pinning isn't extremely common. Further, the clear+TLB flush we used for now implies a memory barrier. So the problematic missing part should be the missing memory barrier after pinning but before checking if the PTE changed. Link: https://lkml.kernel.org/r/20220901083559.67446-1-david@redhat.com Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive") Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Nadav Amit <namit@vmware.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Parri <parri.andrea@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Christoph von Recklinghausen <crecklin@redhat.com> Cc: Don Dutile <ddutile@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepageZach O'Keefe
When scanning an anon pmd to see if it's eligible for collapse, return SCAN_PMD_MAPPED if the pmd already maps a hugepage. Note that SCAN_PMD_MAPPED is different from SCAN_PAGE_COMPOUND used in the file-collapse path, since the latter might identify pte-mapped compound pages. This is required by MADV_COLLAPSE which necessarily needs to know what hugepage-aligned/sized regions are already pmd-mapped. In order to determine if a pmd already maps a hugepage, refactor mm_find_pmd(): Return mm_find_pmd() to it's pre-commit f72e7dcdd252 ("mm: let mm_find_pmd fix buggy race with THP fault") behavior. ksm was the only caller that explicitly wanted a pte-mapping pmd, so open code the pte-mapping logic there (pmd_present() and pmd_trans_huge() checks). Undo revert change in commit f72e7dcdd252 ("mm: let mm_find_pmd fix buggy race with THP fault") that open-coded split_huge_pmd_address() pmd lookup and use mm_find_pmd() instead. Link: https://lkml.kernel.org/r/20220706235936.2197195-9-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Helge Deller <deller@gmx.de> Cc: Hugh Dickins <hughd@google.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-31mm/rmap: Fix anon_vma->degree ambiguity leading to double-reuseJann Horn
anon_vma->degree tracks the combined number of child anon_vmas and VMAs that use the anon_vma as their ->anon_vma. anon_vma_clone() then assumes that for any anon_vma attached to src->anon_vma_chain other than src->anon_vma, it is impossible for it to be a leaf node of the VMA tree, meaning that for such VMAs ->degree is elevated by 1 because of a child anon_vma, meaning that if ->degree equals 1 there are no VMAs that use the anon_vma as their ->anon_vma. This assumption is wrong because the ->degree optimization leads to leaf nodes being abandoned on anon_vma_clone() - an existing anon_vma is reused and no new parent-child relationship is created. So it is possible to reuse an anon_vma for one VMA while it is still tied to another VMA. This is an issue because is_mergeable_anon_vma() and its callers assume that if two VMAs have the same ->anon_vma, the list of anon_vmas attached to the VMAs is guaranteed to be the same. When this assumption is violated, vma_merge() can merge pages into a VMA that is not attached to the corresponding anon_vma, leading to dangling page->mapping pointers that will be dereferenced during rmap walks. Fix it by separately tracking the number of child anon_vmas and the number of VMAs using the anon_vma as their ->anon_vma. Fixes: 7a3ef208e662 ("mm: prevent endless growth of anon_vma hierarchy") Cc: stable@kernel.org Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-08-05Merge tag 'mm-stable-2022-08-03' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Most of the MM queue. A few things are still pending. Liam's maple tree rework didn't make it. This has resulted in a few other minor patch series being held over for next time. Multi-gen LRU still isn't merged as we were waiting for mapletree to stabilize. The current plan is to merge MGLRU into -mm soon and to later reintroduce mapletree, with a view to hopefully getting both into 6.1-rc1. Summary: - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe Lin, Yang Shi, Anshuman Khandual and Mike Rapoport - Some kmemleak fixes from Patrick Wang and Waiman Long - DAMON updates from SeongJae Park - memcg debug/visibility work from Roman Gushchin - vmalloc speedup from Uladzislau Rezki - more folio conversion work from Matthew Wilcox - enhancements for coherent device memory mapping from Alex Sierra - addition of shared pages tracking and CoW support for fsdax, from Shiyang Ruan - hugetlb optimizations from Mike Kravetz - Mel Gorman has contributed some pagealloc changes to improve latency and realtime behaviour. - mprotect soft-dirty checking has been improved by Peter Xu - Many other singleton patches all over the place" [ XFS merge from hell as per Darrick Wong in https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ] * tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits) tools/testing/selftests/vm/hmm-tests.c: fix build mm: Kconfig: fix typo mm: memory-failure: convert to pr_fmt() mm: use is_zone_movable_page() helper hugetlbfs: fix inaccurate comment in hugetlbfs_statfs() hugetlbfs: cleanup some comments in inode.c hugetlbfs: remove unneeded header file hugetlbfs: remove unneeded hugetlbfs_ops forward declaration hugetlbfs: use helper macro SZ_1{K,M} mm: cleanup is_highmem() mm/hmm: add a test for cross device private faults selftests: add soft-dirty into run_vmtests.sh selftests: soft-dirty: add test for mprotect mm/mprotect: fix soft-dirty check in can_change_pte_writable() mm: memcontrol: fix potential oom_lock recursion deadlock mm/gup.c: fix formatting in check_and_migrate_movable_page() xfs: fail dax mount if reflink is enabled on a partition mm/memcontrol.c: remove the redundant updating of stats_flush_threshold userfaultfd: don't fail on unrecognized features hugetlb_cgroup: fix wrong hugetlb cgroup numa stat ...
2022-07-17hugetlb: do not update address in huge_pmd_unshareMike Kravetz
As an optimization for loops sequentially processing hugetlb address ranges, huge_pmd_unshare would update a passed address if it unshared a pmd. Updating a loop control variable outside the loop like this is generally a bad idea. These loops are now using hugetlb_mask_last_page to optimize scanning when non-present ptes are discovered. The same can be done when huge_pmd_unshare returns 1 indicating a pmd was unshared. Remove address update from huge_pmd_unshare. Change the passed argument type and update all callers. In loops sequentially processing addresses use hugetlb_mask_last_page to update address if pmd is unshared. [sfr@canb.auug.org.au: fix an unused variable warning/error] Link: https://lkml.kernel.org/r/20220622171117.70850960@canb.auug.org.au Link: https://lkml.kernel.org/r/20220621235620.291305-4-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: kernel test robot <lkp@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rolf Eike Beer <eike-kernel@sf-tec.de> Cc: Will Deacon <will@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17mm: add zone device coherent type memory supportAlex Sierra
Device memory that is cache coherent from device and CPU point of view. This is used on platforms that have an advanced system bus (like CAPI or CXL). Any page of a process can be migrated to such memory. However, no one should be allowed to pin such memory so that it can always be evicted. [hch@lst.de: rebased ontop of the refcount changes, remove is_dev_private_or_coherent_page] Link: https://lkml.kernel.org/r/20220715150521.18165-4-alex.sierra@amd.com Signed-off-by: Alex Sierra <alex.sierra@amd.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03mm: hugetlb: kill set_huge_swap_pte_at()Qi Zheng
Commit e5251fd43007 ("mm/hugetlb: introduce set_huge_swap_pte_at() helper") add set_huge_swap_pte_at() to handle swap entries on architectures that support hugepages consisting of contiguous ptes. And currently the set_huge_swap_pte_at() is only overridden by arm64. set_huge_swap_pte_at() provide a sz parameter to help determine the number of entries to be updated. But in fact, all hugetlb swap entries contain pfn information, so we can find the corresponding folio through the pfn recorded in the swap entry, then the folio_size() is the number of entries that need to be updated. And considering that users will easily cause bugs by ignoring the difference between set_huge_swap_pte_at() and set_huge_pte_at(). Let's handle swap entries in set_huge_pte_at() and remove the set_huge_swap_pte_at(), then we can call set_huge_pte_at() anywhere, which simplifies our coding. Link: https://lkml.kernel.org/r/20220626145717.53572-1-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03mm: rmap: simplify the hugetlb handling when unmapping or migrationBaolin Wang
According to previous discussion [1], there are so many levels of indenting to handle the hugetlb case when unmapping or migration. We can combine folio_test_anon() and huge_pmd_unshare() to save one level of indenting, by adding a local variable and moving the VM_BUG_ON() a little forward. No intended functional changes in this patch. [1] https://lore.kernel.org/all/0b986dc4-5843-3e2d-c2df-5a2e9f13e6ab@oracle.com/ Link: https://lkml.kernel.org/r/28414b1b96f095e838c1e548074f8e0fc70d78cf.1655724713.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03mm/rmap: fix dereferencing invalid subpage pointer in try_to_migrate_one()David Hildenbrand
The subpage we calculate is an invalid pointer for device private pages, because device private pages are mapped via non-present device private entries, not ordinary present PTEs. Let's just not compute broken pointers and fixup later. Move the proper assignment of the correct subpage to the beginning of the function and assert that we really only have a single page in our folio. This currently results in a BUG when tying to compute anon_exclusive, because: [ 528.727237] BUG: unable to handle page fault for address: ffffea1fffffffc0 [ 528.739585] #PF: supervisor read access in kernel mode [ 528.745324] #PF: error_code(0x0000) - not-present page [ 528.751062] PGD 44eaf2067 P4D 44eaf2067 PUD 0 [ 528.756026] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 528.760890] CPU: 120 PID: 18275 Comm: hmm-tests Not tainted 5.19.0-rc3-kfd-alex #257 [ 528.769542] Hardware name: AMD Corporation BardPeak/BardPeak, BIOS RTY1002BDS 09/17/2021 [ 528.778579] RIP: 0010:try_to_migrate_one+0x21a/0x1000 [ 528.784225] Code: f6 48 89 c8 48 2b 05 45 d1 6a 01 48 c1 f8 06 48 29 c3 48 8b 45 a8 48 c1 e3 06 48 01 cb f6 41 18 01 48 89 85 50 ff ff ff 74 0b <4c> 8b 33 49 c1 ee 11 41 83 e6 01 48 8b bd 48 ff ff ff e8 3f 99 02 [ 528.805194] RSP: 0000:ffffc90003cdfaa0 EFLAGS: 00010202 [ 528.811027] RAX: 00007ffff7ff4000 RBX: ffffea1fffffffc0 RCX: ffffeaffffffffc0 [ 528.818995] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffc90003cdfaf8 [ 528.826962] RBP: ffffc90003cdfb70 R08: 0000000000000000 R09: 0000000000000000 [ 528.834930] R10: ffffc90003cdf910 R11: 0000000000000002 R12: ffff888194450540 [ 528.842899] R13: ffff888160d057c0 R14: 0000000000000000 R15: 03ffffffffffffff [ 528.850865] FS: 00007ffff7fdb740(0000) GS:ffff8883b0600000(0000) knlGS:0000000000000000 [ 528.859891] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 528.866308] CR2: ffffea1fffffffc0 CR3: 00000001562b4003 CR4: 0000000000770ee0 [ 528.874275] PKRU: 55555554 [ 528.877286] Call Trace: [ 528.880016] <TASK> [ 528.882356] ? lock_is_held_type+0xdf/0x130 [ 528.887033] rmap_walk_anon+0x167/0x410 [ 528.891316] try_to_migrate+0x90/0xd0 [ 528.895405] ? try_to_unmap_one+0xe10/0xe10 [ 528.900074] ? anon_vma_ctor+0x50/0x50 [ 528.904260] ? put_anon_vma+0x10/0x10 [ 528.908347] ? invalid_mkclean_vma+0x20/0x20 [ 528.913114] migrate_vma_setup+0x5f4/0x750 [ 528.917691] dmirror_devmem_fault+0x8c/0x250 [test_hmm] [ 528.923532] do_swap_page+0xac0/0xe50 [ 528.927623] ? __lock_acquire+0x4b2/0x1ac0 [ 528.932199] __handle_mm_fault+0x949/0x1440 [ 528.936876] handle_mm_fault+0x13f/0x3e0 [ 528.941256] do_user_addr_fault+0x215/0x740 [ 528.945928] exc_page_fault+0x75/0x280 [ 528.950115] asm_exc_page_fault+0x27/0x30 [ 528.954593] RIP: 0033:0x40366b ... Link: https://lkml.kernel.org/r/20220623205332.319257-1-david@redhat.com Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Tested-by: Alistair Popple <apopple@nvidia.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-27docs: rename Documentation/vm to Documentation/mmMike Rapoport
so it will be consistent with code mm directory and with Documentation/admin-guide/mm and won't be confused with virtual machines. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Tested-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Jonathan Corbet <corbet@lwn.net> Acked-by: Wu XiangCheng <bobwxc@email.cn>
2022-05-19mm: don't be stuck to rmap lock on reclaim pathMinchan Kim
The rmap locks(i_mmap_rwsem and anon_vma->root->rwsem) could be contended under memory pressure if processes keep working on their vmas(e.g., fork, mmap, munmap). It makes reclaim path stuck. In our real workload traces, we see kswapd is waiting the lock for 300ms+(worst case, a sec) and it makes other processes entering direct reclaim, which were also stuck on the lock. This patch makes lru aging path try_lock mode like shink_page_list so the reclaim context will keep working with next lru pages without being stuck. if it found the rmap lock contended, it rotates the page back to head of lru in both active/inactive lrus to make them consistent behavior, which is basic starting point rather than adding more heristic. Since this patch introduces a new "contended" field as out-param along with try_lock in-param in rmap_walk_control, it's not immutable any longer if the try_lock is set so remove const keywords on rmap related functions. Since rmap walking is already expensive operation, I doubt the const would help sizable benefit( And we didn't have it until 5.17). In a heavy app workload in Android, trace shows following statistics. It almost removes rmap lock contention from reclaim path. Martin Liu reported: Before: max_dur(ms) min_dur(ms) max-min(dur)ms avg_dur(ms) sum_dur(ms) count blocked_function 1632 0 1631 151.542173 31672 209 page_lock_anon_vma_read 601 0 601 145.544681 28817 198 rmap_walk_file After: max_dur(ms) min_dur(ms) max-min(dur)ms avg_dur(ms) sum_dur(ms) count blocked_function NaN NaN NaN NaN NaN 0.0 NaN 0 0 0 0.127645 1 12 rmap_walk_file [minchan@kernel.org: add comment, per Matthew] Link: https://lkml.kernel.org/r/YnNqeB5tUf6LZ57b@google.com Link: https://lkml.kernel.org/r/20220510215423.164547-1-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: John Dias <joaodias@google.com> Cc: Tim Murray <timmurray@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Martin Liu <liumartin@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm: rmap: fix CONT-PTE/PMD size hugetlb issue when unmappingBaolin Wang
On some architectures (like ARM64), it can support CONT-PTE/PMD size hugetlb, which means it can support not only PMD/PUD size hugetlb: 2M and 1G, but also CONT-PTE/PMD size: 64K and 32M if a 4K page size specified. When unmapping a hugetlb page, we will get the relevant page table entry by huge_pte_offset() only once to nuke it. This is correct for PMD or PUD size hugetlb, since they always contain only one pmd entry or pud entry in the page table. However this is incorrect for CONT-PTE and CONT-PMD size hugetlb, since they can contain several continuous pte or pmd entry with same page table attributes, so we will nuke only one pte or pmd entry for this CONT-PTE/PMD size hugetlb page. And now try_to_unmap() is only passed a hugetlb page in the case where the hugetlb page is poisoned. Which means now we will unmap only one pte entry for a CONT-PTE or CONT-PMD size poisoned hugetlb page, and we can still access other subpages of a CONT-PTE or CONT-PMD size poisoned hugetlb page, which will cause serious issues possibly. So we should change to use huge_ptep_clear_flush() to nuke the hugetlb page table to fix this issue, which already considered CONT-PTE and CONT-PMD size hugetlb. We've already used set_huge_swap_pte_at() to set a poisoned swap entry for a poisoned hugetlb page. Meanwhile adding a VM_BUG_ON() to make sure the passed hugetlb page is poisoned in try_to_unmap(). Link: https://lkml.kernel.org/r/0a2e547238cad5bc153a85c3e9658cb9d55f9cac.1652270205.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/730ea4b6d292f32fb10b7a4e87dad49b0eb30474.1652147571.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Rich Felker <dalias@libc.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.osdn.me> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm: rmap: fix CONT-PTE/PMD size hugetlb issue when migrationBaolin Wang
On some architectures (like ARM64), it can support CONT-PTE/PMD size hugetlb, which means it can support not only PMD/PUD size hugetlb: 2M and 1G, but also CONT-PTE/PMD size: 64K and 32M if a 4K page size specified. When migrating a hugetlb page, we will get the relevant page table entry by huge_pte_offset() only once to nuke it and remap it with a migration pte entry. This is correct for PMD or PUD size hugetlb, since they always contain only one pmd entry or pud entry in the page table. However this is incorrect for CONT-PTE and CONT-PMD size hugetlb, since they can contain several continuous pte or pmd entry with same page table attributes. So we will nuke or remap only one pte or pmd entry for this CONT-PTE/PMD size hugetlb page, which is not expected for hugetlb migration. The problem is we can still continue to modify the subpages' data of a hugetlb page during migrating a hugetlb page, which can cause a serious data consistent issue, since we did not nuke the page table entry and set a migration pte for the subpages of a hugetlb page. To fix this issue, we should change to use huge_ptep_clear_flush() to nuke a hugetlb page table, and remap it with set_huge_pte_at() and set_huge_swap_pte_at() when migrating a hugetlb page, which already considered the CONT-PTE or CONT-PMD size hugetlb. [akpm@linux-foundation.org: fix nommu build] [baolin.wang@linux.alibaba.com: fix build errors for !CONFIG_MMU] Link: https://lkml.kernel.org/r/a4baca670aca637e7198d9ae4543b8873cb224dc.1652270205.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/ea5abf529f0997b5430961012bfda6166c1efc8c.1652147571.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Rich Felker <dalias@libc.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.osdn.me> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/shmem: persist uffd-wp bit across zapping for file-backedPeter Xu
File-backed memory is prone to being unmapped at any time. It means all information in the pte will be dropped, including the uffd-wp flag. To persist the uffd-wp flag, we'll use the pte markers. This patch teaches the zap code to understand uffd-wp and know when to keep or drop the uffd-wp bit. Add a new flag ZAP_FLAG_DROP_MARKER and set it in zap_details when we don't want to persist such an information, for example, when destroying the whole vma, or punching a hole in a shmem file. For the rest cases we should never drop the uffd-wp bit, or the wr-protect information will get lost. The new ZAP_FLAG_DROP_MARKER needs to be put into mm.h rather than memory.c because it'll be further referenced in hugetlb files later. Link: https://lkml.kernel.org/r/20220405014847.14295-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/rmap: Fix typos in commentsAdrian Huang
Fix spelling/grammar mistakes in comments. Link: https://lkml.kernel.org/r/20220428061522.666-1-adrianhuang0701@gmail.com Signed-off-by: Adrian Huang <ahuang12@lenovo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm: rmap: use flush_cache_range() to flush cache for hugetlb pagesBaolin Wang
Now we will use flush_cache_page() to flush cache for anonymous hugetlb pages when unmapping or migrating a hugetlb page mapping, but the flush_cache_page() only handles a PAGE_SIZE range on some architectures (like arm32, arc and so on), which will cause potential cache issues. Thus change to use flush_cache_range() to cover the whole size of a hugetlb page. Link: https://lkml.kernel.org/r/dc903b378d1e2d26bbbe85409ab9d009631f175c.1651056365.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm: rmap: move the cache flushing to the correct place for hugetlb PMD sharingBaolin Wang
The cache level flush will always be first when changing an existing virtual–>physical mapping to a new value, since this allows us to properly handle systems whose caches are strict and require a virtual–>physical translation to exist for a virtual address. So we should move the cache flushing before huge_pmd_unshare(). As Muchun pointed out[1], now the architectures whose supporting hugetlb PMD sharing have no cache flush issues in practice. But I think we should still follow the cache/TLB flushing rules when changing a valid virtual address mapping in case of potential issues in future. [1] https://lore.kernel.org/all/YmT%2F%2FhuUbFX+KHcy@FVFYT0MHHV2J.usts.net/ Link: https://lkml.kernel.org/r/4f7ae6dfdc838ab71e1655188b657c032ff1f28f.1651056365.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-09mm/swap: remember PG_anon_exclusive via a swp pte bitDavid Hildenbrand
Patch series "mm: COW fixes part 3: reliable GUP R/W FOLL_GET of anonymous pages", v2. This series fixes memory corruptions when a GUP R/W reference (FOLL_WRITE | FOLL_GET) was taken on an anonymous page and COW logic fails to detect exclusivity of the page to then replacing the anonymous page by a copy in the page table: The GUP reference lost synchronicity with the pages mapped into the page tables. This series focuses on x86, arm64, s390x and ppc64/book3s -- other architectures are fairly easy to support by implementing __HAVE_ARCH_PTE_SWP_EXCLUSIVE. This primarily fixes the O_DIRECT memory corruptions that can happen on concurrent swapout, whereby we lose DMA reads to a page (modifying the user page by writing to it). O_DIRECT currently uses FOLL_GET for short-term (!FOLL_LONGTERM) DMA from/to a user page. In the long run, we want to convert it to properly use FOLL_PIN, and John is working on it, but that might take a while and might not be easy to backport. In the meantime, let's restore what used to work before we started modifying our COW logic: make R/W FOLL_GET references reliable as long as there is no fork() after GUP involved. This is just the natural follow-up of part 2, that will also further reduce "wrong COW" on the swapin path, for example, when we cannot remove a page from the swapcache due to concurrent writeback, or if we have two threads faulting on the same swapped-out page. Fixing O_DIRECT is just a nice side-product This issue, including other related COW issues, has been summarized in [3] under 2): " 2. Intra Process Memory Corruptions due to Wrong COW (FOLL_GET) It was discovered that we can create a memory corruption by reading a file via O_DIRECT to a part (e.g., first 512 bytes) of a page, concurrently writing to an unrelated part (e.g., last byte) of the same page, and concurrently write-protecting the page via clear_refs SOFTDIRTY tracking [6]. For the reproducer, the issue is that O_DIRECT grabs a reference of the target page (via FOLL_GET) and clear_refs write-protects the relevant page table entry. On successive write access to the page from the process itself, we wrongly COW the page when resolving the write fault, resulting in a loss of synchronicity and consequently a memory corruption. While some people might think that using clear_refs in this combination is a corner cases, it turns out to be a more generic problem unfortunately. For example, it was just recently discovered that we can similarly create a memory corruption without clear_refs, simply by concurrently swapping out the buffer pages [7]. Note that we nowadays even use the swap infrastructure in Linux without an actual swap disk/partition: the prime example is zram which is enabled as default under Fedora [10]. The root issue is that a write-fault on a page that has additional references results in a COW and thereby a loss of synchronicity and consequently a memory corruption if two parties believe they are referencing the same page. " We don't particularly care about R/O FOLL_GET references: they were never reliable and O_DIRECT doesn't expect to observe modifications from a page after DMA was started. Note that: * this only fixes the issue on x86, arm64, s390x and ppc64/book3s ("enterprise architectures"). Other architectures have to implement __HAVE_ARCH_PTE_SWP_EXCLUSIVE to achieve the same. * this does *not * consider any kind of fork() after taking the reference: fork() after GUP never worked reliably with FOLL_GET. * Not losing PG_anon_exclusive during swapout was the last remaining piece. KSM already makes sure that there are no other references on a page before considering it for sharing. Page migration maintains PG_anon_exclusive and simply fails when there are additional references (freezing the refcount fails). Only swapout code dropped the PG_anon_exclusive flag because it requires more work to remember + restore it. With this series in place, most COW issues of [3] are fixed on said architectures. Other architectures can implement __HAVE_ARCH_PTE_SWP_EXCLUSIVE fairly easily. [1] https://lkml.kernel.org/r/20220329160440.193848-1-david@redhat.com [2] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com [3] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com This patch (of 8): Currently, we clear PG_anon_exclusive in try_to_unmap() and forget about it. We do this, to keep fork() logic on swap entries easy and efficient: for example, if we wouldn't clear it when unmapping, we'd have to lookup the page in the swapcache for each and every swap entry during fork() and clear PG_anon_exclusive if set. Instead, we want to store that information directly in the swap pte, protected by the page table lock, similarly to how we handle SWP_MIGRATION_READ_EXCLUSIVE for migration entries. However, for actual swap entries, we don't want to mess with the swap type (e.g., still one bit) because it overcomplicates swap code. In try_to_unmap(), we already reject to unmap in case the page might be pinned, because we must not lose PG_anon_exclusive on pinned pages ever. Checking if there are other unexpected references reliably *before* completely unmapping a page is unfortunately not really possible: THP heavily overcomplicate the situation. Once fully unmapped it's easier -- we, for example, make sure that there are no unexpected references *after* unmapping a page before starting writeback on that page. So, we currently might end up unmapping a page and clearing PG_anon_exclusive if that page has additional references, for example, due to a FOLL_GET. do_swap_page() has to re-determine if a page is exclusive, which will easily fail if there are other references on a page, most prominently GUP references via FOLL_GET. This can currently result in memory corruptions when taking a FOLL_GET | FOLL_WRITE reference on a page even when fork() is never involved: try_to_unmap() will succeed, and when refaulting the page, it cannot be marked exclusive and will get replaced by a copy in the page tables on the next write access, resulting in writes via the GUP reference to the page being lost. In an ideal world, everybody that uses GUP and wants to modify page content, such as O_DIRECT, would properly use FOLL_PIN. However, that conversion will take a while. It's easier to fix what used to work in the past (FOLL_GET | FOLL_WRITE) remembering PG_anon_exclusive. In addition, by remembering PG_anon_exclusive we can further reduce unnecessary COW in some cases, so it's the natural thing to do. So let's transfer the PG_anon_exclusive information to the swap pte and store it via an architecture-dependant pte bit; use that information when restoring the swap pte in do_swap_page() and unuse_pte(). During fork(), we simply have to clear the pte bit and are done. Of course, there is one corner case to handle: swap backends that don't support concurrent page modifications while the page is under writeback. Special case these, and drop the exclusive marker. Add a comment why that is just fine (also, reuse_swap_page() would have done the same in the past). In the future, we'll hopefully have all architectures support __HAVE_ARCH_PTE_SWP_EXCLUSIVE, such that we can get rid of the empty stubs and the define completely. Then, we can also convert SWP_MIGRATION_READ_EXCLUSIVE. For architectures it's fairly easy to support: either simply use a yet unused pte bit that can be used for swap entries, steal one from the arch type bits if they exceed 5, or steal one from the offset bits. Note: R/O FOLL_GET references were never really reliable, especially when taking one on a shared page and then writing to the page (e.g., GUP after fork()). FOLL_GET, including R/W references, were never really reliable once fork was involved (e.g., GUP before fork(), GUP during fork()). KSM steps back in case it stumbles over unexpected references and is, therefore, fine. [david@redhat.com: fix SWP_STABLE_WRITES test] Link: https://lkml.kernel.org/r/ac725bcb-313a-4fff-250a-68ba9a8f85fb@redhat.comLink: https://lkml.kernel.org/r/20220329164329.208407-1-david@redhat.com Link: https://lkml.kernel.org/r/20220329164329.208407-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Jann Horn <jannh@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Don Dutile <ddutile@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Liang Zhang <zhangliang5@huawei.com> Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com> Cc: Oded Gabbay <oded.gabbay@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>