summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2023-06-19mm/vmalloc: vmalloc_to_page() use pte_offset_kernel()Hugh Dickins
vmalloc_to_page() was using pte_offset_map() (followed by pte_unmap()), but it's intended for userspace page tables: prefer pte_offset_kernel(). Link: https://lkml.kernel.org/r/696386a-84f8-b33c-82e5-f865ed6eb39@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zack Rusin <zackr@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm/vmwgfx: simplify pmd & pud mapping dirty helpersHugh Dickins
wp_clean_pmd_entry() need not check pmd_trans_unstable() or pmd_none(), wp_clean_pud_entry() need not check pud_trans_unstable() or pud_none(): it's just the ACTION_CONTINUE when trans_huge or devmap that's needed to prevent splitting, and we're hoping to remove pmd_trans_unstable(). Is that PUD #ifdef necessary? Maybe some configs are missing a stub. Link: https://lkml.kernel.org/r/d3379c7-65db-26d3-1764-8e866490925f@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zack Rusin <zackr@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm/pagewalk: walk_pte_range() allow for pte_offset_map()Hugh Dickins
walk_pte_range() has a no_vma option to serve walk_page_range_novma(). I don't know of any problem, but it looks safer to check for init_mm, and use pte_offset_kernel() rather than pte_offset_map() in that case: pte_offset_map()'s pmdval validation is intended for userspace. Allow for its pte_offset_map() or pte_offset_map_lock() to fail, and retry with ACTION_AGAIN if so. Add a second check for ACTION_AGAIN in walk_pmd_range(), to catch it after return from walk_pte_range(). Remove the pmd_trans_unstable() check after split_huge_pmd() in walk_pmd_range(): walk_pte_range() now handles those cases safely (and they must fail powerpc's is_hugepd() check). Link: https://lkml.kernel.org/r/3eba6f0-2b-fb66-6bb6-2ee8533e221@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zack Rusin <zackr@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm/pagewalkers: ACTION_AGAIN if pte_offset_map_lock() failsHugh Dickins
Simple walk_page_range() users should set ACTION_AGAIN to retry when pte_offset_map_lock() fails. No need to check pmd_trans_unstable(): that was precisely to avoid the possiblity of calling pte_offset_map() on a racily removed or inserted THP entry, but such cases are now safely handled inside it. Likewise there is no need to check pmd_none() or pmd_bad() before calling it. Link: https://lkml.kernel.org/r/c77d9d10-3aad-e3ce-4896-99e91c7947f3@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: SeongJae Park <sj@kernel.org> for mm/damon part Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Song Liu <song@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zack Rusin <zackr@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm/page_vma_mapped: pte_offset_map_nolock() not pte_lockptr()Hugh Dickins
map_pte() use pte_offset_map_nolock(), to make sure of the ptl belonging to pte, even if pmd entry is then changed racily: page_vma_mapped_walk() use that instead of getting pte_lockptr() later, or restart if map_pte() found no page table. Link: https://lkml.kernel.org/r/cba186e0-5ed7-e81b-6cd-dade4c33c248@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zack Rusin <zackr@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm/page_vma_mapped: reformat map_pte() with less indentationHugh Dickins
No functional change here, but adjust the format of map_pte() so that the following commit will be easier to read: separate out the PVMW_SYNC case first, and remove two levels of indentation from the ZONE_DEVICE case. Link: https://lkml.kernel.org/r/bf723f59-e3fc-6839-1cc3-c0631ee248bc@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zack Rusin <zackr@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm/page_vma_mapped: delete bogosity in page_vma_mapped_walk()Hugh Dickins
Revert commit a7a69d8ba88d ("mm/thp: another PVMW_SYNC fix in page_vma_mapped_walk()"): I was proud of that "Aha!" commit at the time, but in revisiting page_vma_mapped_walk() for pte_offset_map() failure, that block raised a doubt: and it now seems utterly bogus. The prior map_pte() has taken ptl unconditionally when PVMW_SYNC: I must have forgotten that when making the change. It did no harm, but could not have fixed a BUG or WARN, and is hard to reconcile with coming changes. Link: https://lkml.kernel.org/r/87475a22-e59e-2d8b-d78a-df376d314bd@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zack Rusin <zackr@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm/filemap: allow pte_offset_map_lock() to failHugh Dickins
filemap_map_pages() allow pte_offset_map_lock() to fail; and remove the pmd_devmap_trans_unstable() check from filemap_map_pmd(), which can safely return to filemap_map_pages() and let pte_offset_map_lock() discover that. Link: https://lkml.kernel.org/r/54607cf4-ddb6-7ef3-043-1d2de1a9a71@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zack Rusin <zackr@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm/pgtable: allow pte_offset_map[_lock]() to failHugh Dickins
Make pte_offset_map() a wrapper for __pte_offset_map() (optionally outputs pmdval), pte_offset_map_lock() a sparse __cond_lock wrapper for __pte_offset_map_lock(): those __funcs added in mm/pgtable-generic.c. __pte_offset_map() do pmdval validation (including pmd_clear_bad() when pmd_bad()), returning NULL if pmdval is not for a page table. __pte_offset_map_lock() verify pmdval unchanged after getting the lock, trying again if it changed. No #ifdef CONFIG_TRANSPARENT_HUGEPAGE around them: that could be done to cover the imminent case, but we expect to generalize it later, and it makes a mess of where to do the pmd_bad() clearing. Add pte_offset_map_nolock(): outputs ptl like pte_offset_map_lock(), without actually taking the lock. This will be preferred to open uses of pte_lockptr(), because (when split ptlock is in page table's struct page) it points to the right lock for the returned pte pointer, even if *pmd gets changed racily afterwards. Update corresponding Documentation. Do not add the anticipated rcu_read_lock() and rcu_read_unlock()s yet: they have to wait until all architectures are balancing pte_offset_map()s with pte_unmap()s (as in the arch series posted earlier). But comment where they will go, so that it's easy to add them for experiments. And only when those are in place can transient racy failure cases be enabled. Add more safety for the PAE mismatched pmd_low pmd_high case at that time. Link: https://lkml.kernel.org/r/2929bfd-9893-a374-e463-4c3127ff9b9d@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zack Rusin <zackr@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm/migrate: remove cruft from migration_entry_wait()sHugh Dickins
migration_entry_wait_on_locked() does not need to take a mapped pte pointer, its callers can do the unmap first. Annotate it with __releases(ptl) to reduce sparse warnings. Fold __migration_entry_wait_huge() into migration_entry_wait_huge(). Fold __migration_entry_wait() into migration_entry_wait(), preferring the tighter pte_offset_map_lock() to pte_offset_map() and pte_lockptr(). Link: https://lkml.kernel.org/r/b0e2a532-cdf2-561b-e999-f3b13b8d6d3@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zack Rusin <zackr@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm: use pmdp_get_lockless() without surplus barrier()Hugh Dickins
Patch series "mm: allow pte_offset_map[_lock]() to fail", v2. What is it all about? Some mmap_lock avoidance i.e. latency reduction. Initially just for the case of collapsing shmem or file pages to THPs; but likely to be relied upon later in other contexts e.g. freeing of empty page tables (but that's not work I'm doing). mmap_write_lock avoidance when collapsing to anon THPs? Perhaps, but again that's not work I've done: a quick attempt was not as easy as the shmem/file case. I would much prefer not to have to make these small but wide-ranging changes for such a niche case; but failed to find another way, and have heard that shmem MADV_COLLAPSE's usefulness is being limited by that mmap_write_lock it currently requires. These changes (though of course not these exact patches) have been in Google's data centre kernel for three years now: we do rely upon them. What is this preparatory series about? The current mmap locking will not be enough to guard against that tricky transition between pmd entry pointing to page table, and empty pmd entry, and pmd entry pointing to huge page: pte_offset_map() will have to validate the pmd entry for itself, returning NULL if no page table is there. What to do about that varies: sometimes nearby error handling indicates just to skip it; but in many cases an ACTION_AGAIN or "goto again" is appropriate (and if that risks an infinite loop, then there must have been an oops, or pfn 0 mistaken for page table, before). Given the likely extension to freeing empty page tables, I have not limited this set of changes to a THP config; and it has been easier, and sets a better example, if each site is given appropriate handling: even where deeper study might prove that failure could only happen if the pmd table were corrupted. Several of the patches are, or include, cleanup on the way; and by the end, pmd_trans_unstable() and suchlike are deleted: pte_offset_map() and pte_offset_map_lock() then handle those original races and more. Most uses of pte_lockptr() are deprecated, with pte_offset_map_nolock() taking its place. This patch (of 32): Use pmdp_get_lockless() in preference to READ_ONCE(*pmdp), to get a more reliable result with PAE (or READ_ONCE as before without PAE); and remove the unnecessary extra barrier()s which got left behind in its callers. HOWEVER: Note the small print in linux/pgtable.h, where it was designed specifically for fast GUP, and depends on interrupts being disabled for its full guarantee: most callers which have been added (here and before) do NOT have interrupts disabled, so there is still some need for caution. Link: https://lkml.kernel.org/r/f35279a9-9ac0-de22-d245-591afbfb4dc@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Yu Zhao <yuzhao@google.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zack Rusin <zackr@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm/vmalloc: replace the ternary conditional operator with min()Lu Hongfei
It would be better to replace the traditional ternary conditional operator with min() in zero_iter Link: https://lkml.kernel.org/r/20230609093057.27777-1-luhongfei@vivo.com Signed-off-by: Lu Hongfei <luhongfei@vivo.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm/folio: avoid special handling for order value 0 in folio_set_orderTarun Sahu
folio_set_order(folio, 0) is used in kernel at two places __destroy_compound_gigantic_folio and __prep_compound_gigantic_folio. Currently, It is called to clear out the folio->_folio_nr_pages and folio->_folio_order. For __destroy_compound_gigantic_folio: In past, folio_set_order(folio, 0) was needed because page->mapping used to overlap with _folio_nr_pages and _folio_order. So if these fields were left uncleared during freeing gigantic hugepages, they were causing "BUG: bad page state" due to non-zero page->mapping. Now, After Commit a01f43901cfb ("hugetlb: be sure to free demoted CMA pages to CMA") page->mapping has explicitly been cleared out for tail pages. Also, _folio_order and _folio_nr_pages no longer overlaps with page->mapping. So, folio_set_order(folio, 0) can be removed from freeing gigantic folio path (__destroy_compound_gigantic_folio). Another place, folio_set_order(folio, 0) is called inside __prep_compound_gigantic_folio during error path. Here, folio_set_order(folio, 0) can also be removed if we move folio_set_order(folio, order) after for loop. The patch also moves _folio_set_head call in __prep_compound_gigantic_folio() such that we avoid clearing them in the error path. Also, as Mike pointed out: "It would actually be better to move the calls _folio_set_head and folio_set_order in __prep_compound_gigantic_folio() as suggested here. Why? In the current code, the ref count on the 'head page' is still 1 (or more) while those calls are made. So, someone could take a speculative ref on the page BEFORE the tail pages are set up." This way, folio_set_order(folio, 0) is no more needed. And it will also helps removing the confusion of folio order being set to 0 (as _folio_order field is part of first tail page). Testing: I have run LTP tests, which all passes. and also I have written the test in LTP which tests the bug caused by compound_nr and page->mapping overlapping. https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/hugetlb/hugemmap/hugemmap32.c Running on older kernel ( < 5.10-rc7) with the above bug this fails while on newer kernel and, also with this patch it passes. Link: https://lkml.kernel.org/r/20230609162907.111756-1-tsahu@linux.ibm.com Signed-off-by: Tarun Sahu <tsahu@linux.ibm.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19vmstat: skip periodic vmstat update for isolated CPUsMarcelo Tosatti
Problem: The interruption caused by vmstat_update is undesirable for certain applications. With workloads that are running on isolated cpus with nohz full mode to shield off any kernel interruption. For example, a VM running a time sensitive application with a 50us maximum acceptable interruption (use case: soft PLC). oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000) oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ... oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ... kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ... The example above shows an additional 7us for the oslat -> kworker -> oslat switches. In the case of a virtualized CPU, and the vmstat_update interruption in the host (of a qemu-kvm vcpu), the latency penalty observed in the guest is higher than 50us, violating the acceptable latency threshold. The isolated vCPU can perform operations that modify per-CPU page counters, for example to complete I/O operations: CPU 11/KVM-9540 [001] dNh1. 2314.248584: mod_zone_page_state <-__folio_end_writeback CPU 11/KVM-9540 [001] dNh1. 2314.248585: <stack trace> => 0xffffffffc042b083 => mod_zone_page_state => __folio_end_writeback => folio_end_writeback => iomap_finish_ioend => blk_mq_end_request_batch => nvme_irq => __handle_irq_event_percpu => handle_irq_event => handle_edge_irq => __common_interrupt => common_interrupt => asm_common_interrupt => vmx_do_interrupt_nmi_irqoff => vmx_handle_exit_irqoff => vcpu_enter_guest => vcpu_run => kvm_arch_vcpu_ioctl_run => kvm_vcpu_ioctl => __x64_sys_ioctl => do_syscall_64 => entry_SYSCALL_64_after_hwframe In kernel users of vmstat counters either require the precise value and they are using zone_page_state_snapshot interface or they can live with an imprecision as the regular flushing can happen at arbitrary time and cumulative error can grow (see calculate_normal_threshold). From that POV the regular flushing can be postponed for CPUs that have been isolated from the kernel interference without critical infrastructure ever noticing. Skip regular flushing from vmstat_shepherd for all isolated CPUs to avoid interference with the isolated workload. Suggested by Michal Hocko. Link: https://lkml.kernel.org/r/ZIDoV/zxFKVmQl7W@tpad Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm: zswap: support exclusive loadsYosry Ahmed
Commit 71024cb4a0bf ("frontswap: remove frontswap_tmem_exclusive_gets") removed support for exclusive loads from frontswap as it was not used. Bring back exclusive loads support to frontswap by adding an "exclusive" output parameter to frontswap_ops->load. On the zswap side, add a module parameter to enable/disable exclusive loads, and a config option to control the boot default value. Refactor zswap entry invalidation in zswap_frontswap_invalidate_page() into zswap_invalidate_entry() to reuse it in zswap_frontswap_load() if exclusive loads are enabled. With exclusive loads, we avoid having two copies of the same page in memory (compressed & uncompressed) after faulting it in from zswap. On the other hand, if the page is to be reclaimed again without being dirtied, it will be re-compressed. Compression is not usually slow, and a page that was just faulted in is less likely to be reclaimed again soon. Link: https://lkml.kernel.org/r/20230607195143.1473802-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Suggested-by: Yu Zhao <yuzhao@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm/mm_init.c: remove reset_node_present_pages()Haifeng Xu
reset_node_present_pages() only get called in hotadd_init_pgdat(), move the action that clear present pages to free_area_init_core_hotplug(), so the helper can be removed. Link: https://lkml.kernel.org/r/20230607025056.1348-1-haifeng.xu@shopee.com Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Suggested-by: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm/memory_hotplug: remove reset_node_managed_pages() in hotadd_init_pgdat()Haifeng Xu
managed pages has already been set to 0 in free_area_init_core_hotplug(), via zone_init_internals() on each zone. It's pointless to reset again. Furthermore, reset_node_managed_pages() no longer needs to be exposed outside of mm/memblock.c. Remove declaration in include/linux/memblock.h and define it as static. In addtion to this, the only caller of reset_node_managed_pages() is reset_all_zones_managed_pages(), which is annotated with __init, so it should be safe to also mark reset_node_managed_pages() as __init. Link: https://lkml.kernel.org/r/20230607024548.1240-1-haifeng.xu@shopee.com Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Suggested-by: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19shmem: use ramfs_kill_sb() for kill_sb method of ramfs-based tmpfsRoberto Sassu
As the ramfs-based tmpfs uses ramfs_init_fs_context() for the init_fs_context method, which allocates fc->s_fs_info, use ramfs_kill_sb() to free it and avoid a memory leak. Link: https://lkml.kernel.org/r/20230607161523.2876433-1-roberto.sassu@huaweicloud.com Fixes: c3b1b1cbf002 ("ramfs: add support for "mode=" mount option") Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm/mm_init.c: drop 'nid' parameter from check_for_memory()Haifeng Xu
The node_id in pgdat has already been set in free_area_init_node(), so use it internally instead of passing a redundant parameter. Link: https://lkml.kernel.org/r/20230607032402.4679-1-haifeng.xu@shopee.com Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm/sparse: remove unused parameters in sparse_remove_section()Yajun Deng
These parameters ms and map_offset are not used in sparse_remove_section(), so remove them. The __remove_section() is only called by __remove_pages(), remove it. And put the WARN_ON_ONCE() in sparse_remove_section(). Link: https://lkml.kernel.org/r/20230607023952.2247489-1-yajun.deng@linux.dev Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm/hugetlb: use a folio in hugetlb_fault()ZhangPeng
We can replace seven implicit calls to compound_head() with one by using folio. [akpm@linux-foundation.org: update comment, per Sidhartha] Link: https://lkml.kernel.org/r/20230606062013.2947002-4-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reviewed-by Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm/hugetlb: use a folio in hugetlb_wp()ZhangPeng
We can replace nine implict calls to compound_head() with one by using old_folio. The page we get back is always a head page, so we just convert old_page to old_folio. Link: https://lkml.kernel.org/r/20230606062013.2947002-3-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm/hugetlb: use a folio in copy_hugetlb_page_range()ZhangPeng
Patch series "Convert several functions in hugetlb.c to use a folio", v2. This patch series converts three functions in hugetlb.c to use a folio, which can remove several implicit calls to compound_head(). This patch (of 3): We can replace five implict calls to compound_head() with one by using pte_folio. The page we get back is always a head page, so we just convert ptepage to pte_folio. Link: https://lkml.kernel.org/r/20230606062013.2947002-1-zhangpeng362@huawei.com Link: https://lkml.kernel.org/r/20230606062013.2947002-2-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19memory tier: remove unneeded disable_all_demotion_targets() when ↵Miaohe Lin
!CONFIG_MIGRATION There's no caller of disable_all_demotion_targets() when CONFIG_MIGRATION is disabled. Remove it. Link: https://lkml.kernel.org/r/20230606120724.208552-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm: vmscan: mark kswapd_run() and kswapd_stop() __meminitMiaohe Lin
Add __meminit to kswapd_run() and kswapd_stop() to ensure they're default to __init when memory hotplug is not enabled. Link: https://lkml.kernel.org/r/20230606121813.242163-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Yu Zhao <yuzhao@google.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm: skip CMA pages when they are not availableZhaoyang Huang
This patch fixes unproductive reclaiming of CMA pages by skipping them when they are not available for current context. It arises from the below OOM issue, which was caused by a large proportion of MIGRATE_CMA pages among free pages. [ 36.172486] [03-19 10:05:52.172] ActivityManager: page allocation failure: order:0, mode:0xc00(GFP_NOIO), nodemask=(null),cpuset=foreground,mems_allowed=0 [ 36.189447] [03-19 10:05:52.189] DMA32: 0*4kB 447*8kB (C) 217*16kB (C) 124*32kB (C) 136*64kB (C) 70*128kB (C) 22*256kB (C) 3*512kB (C) 0*1024kB 0*2048kB 0*4096kB = 35848kB [ 36.193125] [03-19 10:05:52.193] Normal: 231*4kB (UMEH) 49*8kB (MEH) 14*16kB (H) 13*32kB (H) 8*64kB (H) 2*128kB (H) 0*256kB 1*512kB (H) 0*1024kB 0*2048kB 0*4096kB = 3236kB ... [ 36.234447] [03-19 10:05:52.234] SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC) [ 36.234455] [03-19 10:05:52.234] cache: ext4_io_end, object size: 64, buffer size: 64, default order: 0, min order: 0 [ 36.234459] [03-19 10:05:52.234] node 0: slabs: 53,objs: 3392, free: 0 This change further decreases the chance for wrong OOMs in the presence of a lot of CMA memory. [david@redhat.com: changelog addition] Link: https://lkml.kernel.org/r/1685501461-19290-1-git-send-email-zhaoyang.huang@unisoc.com Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: ke.wang <ke.wang@unisoc.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm: page_isolation: write proper kerneldocJohannes Weiner
And remove the incorrect header comments. [akpm@linux-foundation.org: s/lower/first/, s/upper/last/, per Mike] Link: https://lkml.kernel.org/r/20230519111652.40658-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19Revert "mm: vmscan: make global slab shrink lockless"Qi Zheng
This reverts commit f95bdb700bc6bb74e1199b1f5f90c613e152cfa7. Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec test case [1], which is caused by commit f95bdb700bc6 ("mm: vmscan: make global slab shrink lockless"). The root cause is that SRCU has to be careful to not frequently check for SRCU read-side critical section exits. Therefore, even if no one is currently in the SRCU read-side critical section, synchronize_srcu() cannot return quickly. That's why unregister_shrinker() has become slower. After discussion, we will try to use the refcount+RCU method [2] proposed by Dave Chinner to continue to re-implement the lockless slab shrink. So revert the shrinker_srcu related changes first. [1]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/ [2]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/ Link: https://lkml.kernel.org/r/20230609081518.3039120-8-qi.zheng@linux.dev Reported-by: kernel test robot <yujie.liu@intel.com> Closes: https://lore.kernel.org/oe-lkp/202305230837.db2c233f-yujie.liu@intel.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19Revert "mm: vmscan: make memcg slab shrink lockless"Qi Zheng
This reverts commit caa05325c9126c77ebf114edce51536a0d0a9a08. Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec test case [1], which is caused by commit f95bdb700bc6 ("mm: vmscan: make global slab shrink lockless"). The root cause is that SRCU has to be careful to not frequently check for SRCU read-side critical section exits. Therefore, even if no one is currently in the SRCU read-side critical section, synchronize_srcu() cannot return quickly. That's why unregister_shrinker() has become slower. After discussion, we will try to use the refcount+RCU method [2] proposed by Dave Chinner to continue to re-implement the lockless slab shrink. So revert the shrinker_srcu related changes first. [1]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/ [2]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/ Link: https://lkml.kernel.org/r/20230609081518.3039120-7-qi.zheng@linux.dev Reported-by: kernel test robot <yujie.liu@intel.com> Closes: https://lore.kernel.org/oe-lkp/202305230837.db2c233f-yujie.liu@intel.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19Revert "mm: vmscan: add shrinker_srcu_generation"Qi Zheng
This reverts commit 475733dda5aedba9e086379aafe6b5ffd53e8f5e. Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec test case [1], which is caused by commit f95bdb700bc6 ("mm: vmscan: make global slab shrink lockless"). The root cause is that SRCU has to be careful to not frequently check for SRCU read-side critical section exits. Therefore, even if no one is currently in the SRCU read-side critical section, synchronize_srcu() cannot return quickly. That's why unregister_shrinker() has become slower. We will try to use the refcount+RCU method [2] proposed by Dave Chinner to continue to re-implement the lockless slab shrink. So revert the shrinker_srcu related changes first. [1]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/ [2]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/ Link: https://lkml.kernel.org/r/20230609081518.3039120-6-qi.zheng@linux.dev Reported-by: kernel test robot <yujie.liu@intel.com> Closes: https://lore.kernel.org/oe-lkp/202305230837.db2c233f-yujie.liu@intel.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19Revert "mm: shrinkers: make count and scan in shrinker debugfs lockless"Qi Zheng
This reverts commit 20cd1892fcc3efc10a7ac327cc3790494bec46b5. Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec test case [1], which is caused by commit f95bdb700bc6 ("mm: vmscan: make global slab shrink lockless"). The root cause is that SRCU has to be careful to not frequently check for SRCU read-side critical section exits. Therefore, even if no one is currently in the SRCU read-side critical section, synchronize_srcu() cannot return quickly. That's why unregister_shrinker() has become slower. We will try to use the refcount+RCU method [2] proposed by Dave Chinner to continue to re-implement the lockless slab shrink. So revert the shrinker_srcu related changes first. [1]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/ [2]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/ Link: https://lkml.kernel.org/r/20230609081518.3039120-5-qi.zheng@linux.dev Reported-by: kernel test robot <yujie.liu@intel.com> Closes: https://lore.kernel.org/oe-lkp/202305230837.db2c233f-yujie.liu@intel.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19Revert "mm: vmscan: hold write lock to reparent shrinker nr_deferred"Qi Zheng
This reverts commit b3cabea3c9153fd42fe5cb851ac58b51ea2b32b8. Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec test case [1], which is caused by commit f95bdb700bc6 ("mm: vmscan: make global slab shrink lockless"). The root cause is that SRCU has to be careful to not frequently check for SRCU read-side critical section exits. Therefore, even if no one is currently in the SRCU read-side critical section, synchronize_srcu() cannot return quickly. That's why unregister_shrinker() has become slower. We will try to use the refcount+RCU method [2] proposed by Dave Chinner to continue to re-implement the lockless slab shrink. Because there will be other readers after reverting the shrinker_srcu related changes, so it is better to restore to hold read lock to reparent shrinker nr_deferred. [1]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/ [2]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/ Link: https://lkml.kernel.org/r/20230609081518.3039120-4-qi.zheng@linux.dev Reported-by: kernel test robot <yujie.liu@intel.com> Closes: https://lore.kernel.org/oe-lkp/202305230837.db2c233f-yujie.liu@intel.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19Revert "mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()"Qi Zheng
This reverts commit 1643db98d9b314e0a592d152603094fbf7ab906e. Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec test case [1], which is caused by commit f95bdb700bc6 ("mm: vmscan: make global slab shrink lockless"). The root cause is that SRCU has to be careful to not frequently check for SRCU read-side critical section exits. Therefore, even if no one is currently in the SRCU read-side critical section, synchronize_srcu() cannot return quickly. That's why unregister_shrinker() has become slower. We will try to use the refcount+RCU method [2] proposed by Dave Chinner to continue to re-implement the lockless slab shrink. So we still need shrinker_rwsem in synchronize_shrinkers() after reverting the shrinker_srcu related changes. [1]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/ [2]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/ Link: https://lkml.kernel.org/r/20230609081518.3039120-3-qi.zheng@linux.dev Reported-by: kernel test robot <yujie.liu@intel.com> Closes: https://lore.kernel.org/oe-lkp/202305230837.db2c233f-yujie.liu@intel.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19Revert "mm: shrinkers: convert shrinker_rwsem to mutex"Qi Zheng
Patch series "revert shrinker_srcu related changes". This patch (of 7): This reverts commit cf2e309ebca7bb0916771839f9b580b06c778530. Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec test case [1], which is caused by commit f95bdb700bc6 ("mm: vmscan: make global slab shrink lockless"). The root cause is that SRCU has to be careful to not frequently check for SRCU read-side critical section exits. Therefore, even if no one is currently in the SRCU read-side critical section, synchronize_srcu() cannot return quickly. That's why unregister_shrinker() has become slower. After discussion, we will try to use the refcount+RCU method [2] proposed by Dave Chinner to continue to re-implement the lockless slab shrink. So revert the shrinker_mutex back to shrinker_rwsem first. [1]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/ [2]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/ Link: https://lkml.kernel.org/r/20230609081518.3039120-1-qi.zheng@linux.dev Link: https://lkml.kernel.org/r/20230609081518.3039120-2-qi.zheng@linux.dev Reported-by: kernel test robot <yujie.liu@intel.com> Closes: https://lore.kernel.org/oe-lkp/202305230837.db2c233f-yujie.liu@intel.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yujie Liu <yujie.liu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm/khugepaged: fix iteration in collapse_fileDavid Stevens
Remove an unnecessary call to xas_set(index) when iterating over the target range in collapse_file. The extra call to xas_set reset the xas cursor to the top of the tree, causing the xas_next call on the next iteration to walk the tree to index instead of advancing to index+1. This returned the same page again, which would cause collapse_file to fail because the page is already locked. This bug was hidden when CONFIG_DEBUG_VM was set. When that config was used, the xas_load in a subsequent VM_BUG_ON assert would walk xas from the top of the tree to index, causing the xas_next call on the next loop iteration to advance the cursor as expected. Link: https://lkml.kernel.org/r/20230607053135.2087354-1-stevensd@google.com Fixes: a2e17cc2efc7 ("mm/khugepaged: maintain page cache uptodate flag") Signed-off-by: David Stevens <stevensd@chromium.org> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Kirill A . Shutemov <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19memfd: check for non-NULL file_seals in memfd_create() syscallRoberto Sassu
Ensure that file_seals is non-NULL before using it in the memfd_create() syscall. One situation in which memfd_file_seals_ptr() could return a NULL pointer when CONFIG_SHMEM=n, oopsing the kernel. Link: https://lkml.kernel.org/r/20230607132427.2867435-1-roberto.sassu@huaweicloud.com Fixes: 47b9012ecdc7 ("shmem: add sealing support to hugetlb-backed memfd") Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com> Cc: Marc-Andr Lureau <marcandre.lureau@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm/vmalloc: do not output a spurious warning when huge vmalloc() failsLorenzo Stoakes
In __vmalloc_area_node() we always warn_alloc() when an allocation performed by vm_area_alloc_pages() fails unless it was due to a pending fatal signal. However, huge page allocations instigated either by vmalloc_huge() or __vmalloc_node_range() (or a caller that invokes this like kvmalloc() or kvmalloc_node()) always falls back to order-0 allocations if the huge page allocation fails. This renders the warning useless and noisy, especially as all callers appear to be aware that this may fallback. This has already resulted in at least one bug report from a user who was confused by this (see link). Therefore, simply update the code to only output this warning for order-0 pages when no fatal signal is pending. Link: https://bugzilla.suse.com/show_bug.cgi?id=1211410 Link: https://lkml.kernel.org/r/20230605201107.83298-1-lstoakes@gmail.com Fixes: 80b1d8fdfad1 ("mm: vmalloc: correct use of __GFP_NOWARN mask in __vmalloc_area_node()") Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm/mprotect: fix do_mprotect_pkey() limit checkLiam R. Howlett
The return of do_mprotect_pkey() can still be incorrectly returned as success if there is a gap that spans to or beyond the end address passed in. Update the check to ensure that the end address has indeed been seen. Link: https://lore.kernel.org/all/CABi2SkXjN+5iFoBhxk71t3cmunTk-s=rB4T7qo0UQRh17s49PQ@mail.gmail.com/ Link: https://lkml.kernel.org/r/20230606182912.586576-1-Liam.Howlett@oracle.com Fixes: 82f951340f25 ("mm/mprotect: fix do_mprotect_pkey() return on error") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reported-by: Jeff Xu <jeffxu@chromium.org> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-18mm/mmap: Fix error path in do_vmi_align_munmap()Liam R. Howlett
The error unrolling was leaving the VMAs detached in many cases and leaving the locked_vm statistic altered, and skipping the unrolling entirely in the case of the vma tree write failing. Fix the error path by re-attaching the detached VMAs and adding the necessary goto for the failed vma tree write, and fix the locked_vm statistic by only updating after the vma tree write succeeds. Fixes: 763ecb035029 ("mm: remove the vma linked list") Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-18tcp: Use per-vma locking for receive zerocopyArjun Roy
Per-VMA locking allows us to lock a struct vm_area_struct without taking the process-wide mmap lock in read mode. Consider a process workload where the mmap lock is taken constantly in write mode. In this scenario, all zerocopy receives are periodically blocked during that period of time - though in principle, the memory ranges being used by TCP are not touched by the operations that need the mmap write lock. This results in performance degradation. Now consider another workload where the mmap lock is never taken in write mode, but there are many TCP connections using receive zerocopy that are concurrently receiving. These connections all take the mmap lock in read mode, but this does induce a lot of contention and atomic ops for this process-wide lock. This results in additional CPU overhead caused by contending on the cache line for this lock. However, with per-vma locking, both of these problems can be avoided. As a test, I ran an RPC-style request/response workload with 4KB payloads and receive zerocopy enabled, with 100 simultaneous TCP connections. I measured perf cycles within the find_tcp_vma/mmap_read_lock/mmap_read_unlock codepath, with and without per-vma locking enabled. When using process-wide mmap semaphore read locking, about 1% of measured perf cycles were within this path. With per-VMA locking, this value dropped to about 0.45%. Signed-off-by: Arjun Roy <arjunroy@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-16Merge branches 'slab/for-6.5/prandom', 'slab/for-6.5/slab_no_merge' and ↵Vlastimil Babka
'slab/for-6.5/slab-deprecate' into slab/for-next Merge the feature branches scheduled for 6.5: - replace the usage of weak PRNGs, by David Keisar Schmidt - introduce the SLAB_NO_MERGE kmem_cache flag, by Jesper Dangaard Brouer - deprecate CONFIG_SLAB, with a planned removal, by myself
2023-06-16mm/slab_common: use SLAB_NO_MERGE instead of negative refcountVlastimil Babka
When CONFIG_MEMCG_KMEM is enabled, we disable cache merging for KMALLOC_NORMAL caches so they don't end up merged with a cache that uses ad-hoc __GFP_ACCOUNT when allocating. This was implemented by setting the refcount to -1, but now we have a proper SLAB_NO_MERGE flag, so use that instead. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com>
2023-06-12Merge tag 'mm-hotfixes-stable-2023-06-12-12-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "19 hotfixes. 14 are cc:stable and the remainder address issues which were introduced during this development cycle or which were considered inappropriate for a backport" * tag 'mm-hotfixes-stable-2023-06-12-12-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: zswap: do not shrink if cgroup may not zswap page cache: fix page_cache_next/prev_miss off by one ocfs2: check new file size on fallocate call mailmap: add entry for John Keeping mm/damon/core: fix divide error in damon_nr_accesses_to_accesses_bp() epoll: ep_autoremove_wake_function should use list_del_init_careful mm/gup_test: fix ioctl fail for compat task nilfs2: reject devices with insufficient block count ocfs2: fix use-after-free when unmounting read-only filesystem lib/test_vmalloc.c: avoid garbage in page array nilfs2: fix possible out-of-bounds segment allocation in resize ioctl riscv/purgatory: remove PGO flags powerpc/purgatory: remove PGO flags x86/purgatory: remove PGO flags kexec: support purgatories with .text.hot sections mm/uffd: allow vma to merge as much as possible mm/uffd: fix vma operation where start addr cuts part of vma radix-tree: move declarations to header nilfs2: fix incomplete buffer cleanup in nilfs_btnode_abort_change_key()
2023-06-12zswap: do not shrink if cgroup may not zswapNhat Pham
Before storing a page, zswap first checks if the number of stored pages exceeds the limit specified by memory.zswap.max, for each cgroup in the hierarchy. If this limit is reached or exceeded, then zswap shrinking is triggered and short-circuits the store attempt. However, since the zswap's LRU is not memcg-aware, this can create the following pathological behavior: the cgroup whose zswap limit is 0 will evict pages from other cgroups continually, without lowering its own zswap usage. This means the shrinking will continue until the need for swap ceases or the pool becomes empty. As a result of this, we observe a disproportionate amount of zswap writeback and a perpetually small zswap pool in our experiments, even though the pool limit is never hit. More generally, a cgroup might unnecessarily evict pages from other cgroups before we drive the memcg back below its limit. This patch fixes the issue by rejecting zswap store attempt without shrinking the pool when obj_cgroup_may_zswap() returns false. [akpm@linux-foundation.org: fix return of unintialized value] [akpm@linux-foundation.org: s/ENOSPC/ENOMEM/] Link: https://lkml.kernel.org/r/20230530222440.2777700-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20230530232435.3097106-1-nphamcs@gmail.com Fixes: f4840ccfca25 ("zswap: memcg accounting") Signed-off-by: Nhat Pham <nphamcs@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12page cache: fix page_cache_next/prev_miss off by oneMike Kravetz
Ackerley Tng reported an issue with hugetlbfs fallocate here[1]. The issue showed up after the conversion of hugetlb page cache lookup code to use page_cache_next_miss. Code in hugetlb fallocate, userfaultfd and GUP is now using page_cache_next_miss to determine if a page is present the page cache. The following statement is used. present = page_cache_next_miss(mapping, index, 1) != index; There are two issues with page_cache_next_miss when used in this way. 1) If the passed value for index is equal to the 'wrap-around' value, the same index will always be returned. This wrap-around value is 0, so 0 will be returned even if page is present at index 0. 2) If there is no gap in the range passed, the last index in the range will be returned. When passed a range of 1 as above, the passed index value will be returned even if the page is present. The end result is the statement above will NEVER indicate a page is present in the cache, even if it is. As noted by Ackerley in [1], users can see this by hugetlb fallocate incorrectly returning EEXIST if pages are already present in the file. In addition, hugetlb pages will not be included in core dumps if they need to be brought in via GUP. userfaultfd UFFDIO_COPY also uses this code and will not notice pages already present in the cache. It may try to allocate a new page and potentially return ENOMEM as opposed to EEXIST. Both page_cache_next_miss and page_cache_prev_miss have similar issues. Fix by: - Check for index equal to 'wrap-around' value and do not exit early. - If no gap is found in range, return index outside range. - Update function description to say 'wrap-around' value could be returned if passed as index. [1] https://lore.kernel.org/linux-mm/cover.1683069252.git.ackerleytng@google.com/ Link: https://lkml.kernel.org/r/20230602225747.103865-2-mike.kravetz@oracle.com Fixes: d0ce0e47b323 ("mm/hugetlb: convert hugetlb fault paths to use alloc_hugetlb_folio()") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Ackerley Tng <ackerleytng@google.com> Reviewed-by: Ackerley Tng <ackerleytng@google.com> Tested-by: Ackerley Tng <ackerleytng@google.com> Cc: Erdem Aktas <erdemaktas@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vishal Annapurve <vannapurve@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12mm/damon/core: fix divide error in damon_nr_accesses_to_accesses_bp()Kefeng Wang
If 'aggr_interval' is smaller than 'sample_interval', max_nr_accesses in damon_nr_accesses_to_accesses_bp() becomes zero which leads to divide error, let's validate the values of them in damon_set_attrs() to fix it, which similar to others attrs check. Link: https://lkml.kernel.org/r/20230527032101.167788-1-wangkefeng.wang@huawei.com Fixes: 2f5bef5a590b ("mm/damon/core: update monitoring results for new monitoring attributes") Reported-by: syzbot+841a46899768ec7bec67@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=841a46899768ec7bec67 Link: https://lore.kernel.org/damon/00000000000055fc4e05fc975bc2@google.com/ Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12mm/gup_test: fix ioctl fail for compat taskHaibo Li
When tools/testing/selftests/mm/gup_test.c is compiled as 32bit, then run on arm64 kernel, it reports "ioctl: Inappropriate ioctl for device". Fix it by filling compat_ioctl in gup_test_fops Link: https://lkml.kernel.org/r/20230526022125.175728-1-haibo.li@mediatek.com Signed-off-by: Haibo Li <haibo.li@mediatek.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12block: replace fmode_t with a block-specific type for block open flagsChristoph Hellwig
The only overlap between the block open flags mapped into the fmode_t and other uses of fmode_t are FMODE_READ and FMODE_WRITE. Define a new blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and ->ioctl and stop abusing fmode_t. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-28-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12block: use the holder as indication for exclusive opensChristoph Hellwig
The current interface for exclusive opens is rather confusing as it requires both the FMODE_EXCL flag and a holder. Remove the need to pass FMODE_EXCL and just key off the exclusive open off a non-NULL holder. For blkdev_put this requires adding the holder argument, which provides better debug checking that only the holder actually releases the hold, but at the same time allows removing the now superfluous mode argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Link: https://lore.kernel.org/r/20230608110258.189493-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-09mm: sparse: mark populate_section_memmap() staticArnd Bergmann
There are two definitions of this function, but the second one lacks the 'static' annotation: mm/sparse.c:704:25: error: no previous prototype for 'populate_section_memmap' [-Werror=missing-prototypes] Link: https://lkml.kernel.org/r/20230517131102.934196-4-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Eric Paris <eparis@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>