summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-10-12mm/memory.c: fix race when faulting a device private pageAlistair Popple
Patch series "Fix several device private page reference counting issues", v2 This series aims to fix a number of page reference counting issues in drivers dealing with device private ZONE_DEVICE pages. These result in use-after-free type bugs, either from accessing a struct page which no longer exists because it has been removed or accessing fields within the struct page which are no longer valid because the page has been freed. During normal usage it is unlikely these will cause any problems. However without these fixes it is possible to crash the kernel from userspace. These crashes can be triggered either by unloading the kernel module or unbinding the device from the driver prior to a userspace task exiting. In modules such as Nouveau it is also possible to trigger some of these issues by explicitly closing the device file-descriptor prior to the task exiting and then accessing device private memory. This involves some minor changes to both PowerPC and AMD GPU code. Unfortunately I lack hardware to test either of those so any help there would be appreciated. The changes mimic what is done in for both Nouveau and hmm-tests though so I doubt they will cause problems. This patch (of 8): When the CPU tries to access a device private page the migrate_to_ram() callback associated with the pgmap for the page is called. However no reference is taken on the faulting page. Therefore a concurrent migration of the device private page can free the page and possibly the underlying pgmap. This results in a race which can crash the kernel due to the migrate_to_ram() function pointer becoming invalid. It also means drivers can't reliably read the zone_device_data field because the page may have been freed with memunmap_pages(). Close the race by getting a reference on the page while holding the ptl to ensure it has not been freed. Unfortunately the elevated reference count will cause the migration required to handle the fault to fail. To avoid this failure pass the faulting page into the migrate_vma functions so that if an elevated reference count is found it can be checked to see if it's expected or not. [mpe@ellerman.id.au: fix build] Link: https://lkml.kernel.org/r/87fsgbf3gh.fsf@mpe.ellerman.id.au Link: https://lkml.kernel.org/r/cover.60659b549d8509ddecafad4f498ee7f03bb23c69.1664366292.git-series.apopple@nvidia.com Link: https://lkml.kernel.org/r/d3e813178a59e565e8d78d9b9a4e2562f6494f90.1664366292.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Lyude Paul <lyude@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Alex Sierra <alex.sierra@amd.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Christian König <christian.koenig@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12mm/damon: use damon_sz_region() in appropriate placeXin Hao
In many places we can use damon_sz_region() to instead of "r->ar.end - r->ar.start". Link: https://lkml.kernel.org/r/20220927001946.85375-2-xhao@linux.alibaba.com Signed-off-by: Xin Hao <xhao@linux.alibaba.com> Suggested-by: SeongJae Park <sj@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12mm/damon: move sz_damon_region to damon_sz_regionXin Hao
Rename sz_damon_region() to damon_sz_region(), and move it to "include/linux/damon.h", because in many places, we can to use this func. Link: https://lkml.kernel.org/r/20220927001946.85375-1-xhao@linux.alibaba.com Signed-off-by: Xin Hao <xhao@linux.alibaba.com> Suggested-by: SeongJae Park <sj@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12lib/test_meminit: add checks for the allocation functionsXiaoke Wang
alloc_pages(), kmalloc() and vmalloc() are all memory allocation functions which can return NULL when some internal memory failures happen. So it is better to check the return of them to catch the failure in time for better test them. Link: https://lkml.kernel.org/r/tencent_D44A49FFB420EDCCBFB9221C8D14DFE12908@qq.com Signed-off-by: Xiaoke Wang <xkernel.wang@foxmail.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12kmsan: unpoison @tlb in arch_tlb_gather_mmu()Alexander Potapenko
This is an optimization to reduce stackdepot pressure. struct mmu_gather contains 7 1-bit fields packed into a 32-bit unsigned int value. The remaining 25 bits remain uninitialized and are never used, but KMSAN updates the origin for them in zap_pXX_range() in mm/memory.c, thus creating very long origin chains. This is technically correct, but consumes too much memory. Unpoisoning the whole structure will prevent creating such chains. Link: https://lkml.kernel.org/r/20220905122452.2258262-20-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Acked-by: Marco Elver <elver@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12ext4,f2fs: fix readahead of verity dataMatthew Wilcox (Oracle)
The recent change of page_cache_ra_unbounded() arguments was buggy in the two callers, causing us to readahead the wrong pages. Move the definition of ractl down to after the index is set correctly. This affected performance on configurations that use fs-verity. Link: https://lkml.kernel.org/r/20221012193419.1453558-1-willy@infradead.org Fixes: 73bb49da50cd ("mm/readahead: make page_cache_ra_unbounded take a readahead_control") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: Jintao Yin <nicememory@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12mm/mmap: undo ->mmap() when arch_validate_flags() failsCarlos Llamas
Commit c462ac288f2c ("mm: Introduce arch_validate_flags()") added a late check in mmap_region() to let architectures validate vm_flags. The check needs to happen after calling ->mmap() as the flags can potentially be modified during this callback. If arch_validate_flags() check fails we unmap and free the vma. However, the error path fails to undo the ->mmap() call that previously succeeded and depending on the specific ->mmap() implementation this translates to reference increments, memory allocations and other operations what will not be cleaned up. There are several places (mainly device drivers) where this is an issue. However, one specific example is bpf_map_mmap() which keeps count of the mappings in map->writecnt. The count is incremented on ->mmap() and then decremented on vm_ops->close(). When arch_validate_flags() fails this count is off since bpf_map_mmap_close() is never called. One can reproduce this issue in arm64 devices with MTE support. Here the vm_flags are checked to only allow VM_MTE if VM_MTE_ALLOWED has been set previously. From userspace then is enough to pass the PROT_MTE flag to mmap() syscall to trigger the arch_validate_flags() failure. The following program reproduces this issue: #include <stdio.h> #include <unistd.h> #include <linux/unistd.h> #include <linux/bpf.h> #include <sys/mman.h> int main(void) { union bpf_attr attr = { .map_type = BPF_MAP_TYPE_ARRAY, .key_size = sizeof(int), .value_size = sizeof(long long), .max_entries = 256, .map_flags = BPF_F_MMAPABLE, }; int fd; fd = syscall(__NR_bpf, BPF_MAP_CREATE, &attr, sizeof(attr)); mmap(NULL, 4096, PROT_WRITE | PROT_MTE, MAP_SHARED, fd, 0); return 0; } By manually adding some log statements to the vm_ops callbacks we can confirm that when passing PROT_MTE to mmap() the map->writecnt is off upon ->release(): With PROT_MTE flag: root@debian:~# ./bpf-test [ 111.263874] bpf_map_write_active_inc: map=9 writecnt=1 [ 111.288763] bpf_map_release: map=9 writecnt=1 Without PROT_MTE flag: root@debian:~# ./bpf-test [ 157.816912] bpf_map_write_active_inc: map=10 writecnt=1 [ 157.830442] bpf_map_write_active_dec: map=10 writecnt=0 [ 157.832396] bpf_map_release: map=10 writecnt=0 This patch fixes the above issue by calling vm_ops->close() when the arch_validate_flags() check fails, after this we can proceed to unmap and free the vma on the error path. Link: https://lkml.kernel.org/r/20220930003844.1210987-1-cmllamas@google.com Fixes: c462ac288f2c ("mm: Introduce arch_validate_flags()") Signed-off-by: Carlos Llamas <cmllamas@google.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Liam Howlett <liam.howlett@oracle.com> Cc: Christian Brauner (Microsoft) <brauner@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> [5.10+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12mm/uffd: fix warning without PTE_MARKER_UFFD_WP compiled inPeter Xu
When PTE_MARKER_UFFD_WP not configured, it's still possible to reach pte marker code and trigger an warning. Add a few CONFIG_PTE_MARKER_UFFD_WP ifdefs to make sure the code won't be reached when not compiled in. Link: https://lkml.kernel.org/r/YzeR+R6b4bwBlBHh@x1n Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs") Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: <syzbot+2b9b4f0895be09a6dec3@syzkaller.appspotmail.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Edward Liaw <edliaw@google.com> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12mm/mmap: preallocate maple nodes for brk vma expansionLiam Howlett
If the brk VMA is the last vma in a maple node and meets the rare criteria that it can be expanded, then preallocation is necessary to avoid a potential fs_reclaim circular lock issue on low resources. At the same time use the actual vma start address (unaligned) when calling vma_adjust_trans_huge(). Link: https://lkml.kernel.org/r/20221011160624.1253454-1-Liam.Howlett@oracle.com Fixes: 2e7ce7d354f2 (mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap()) Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reported-by: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12mm: more vma cache removalAlexey Dobriyan
Link: https://lkml.kernel.org/r/Y0WuE3Riv4iy5Jx8@localhost.localdomain Fixes: 7964cf8caa4d ("mm: remove vmacache") Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Liam Howlett <liam.howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12mmap: fix copy_vma() failure pathLiam Howlett
The anon vma was not unlinked and the file was not closed in the failure path when the machine runs out of memory during the maple tree modification. This caused a memory leak of the anon vma chain and vma since neither would be freed. Link: https://lkml.kernel.org/r/20221011203621.1446507-1-Liam.Howlett@oracle.com Fixes: 524e00b36e8c ("mm: remove rb tree") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reported-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Tested-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12mm/compaction: fix set skip in fast_find_migrateblockChuyi Zhou
When we successfully find a pageblock in fast_find_migrateblock(), the block will be set skip-flag through set_pageblock_skip(). However, when entering isolate_migratepages_block(), the whole pageblock will be skipped due to the branch 'if (!valid_page && IS_ALIGNED(low_pfn, pageblock_nr_pages))'. Eventually we will goto isolate_abort and isolate nothing. That makes fast_find_migrateblock useless. In this patch, when we find a suitable pageblock in fast_find_migrateblock, we do noting but let isolate_migratepages_block to set skip flag to the pageblock after scan it. Normally, we would isolate some pages from the fast-find block. I use mmtest/thpscale-madvhugepage test it. Here is the result: baseline patch Amean fault-both-1 1331.66 ( 0.00%) 1261.04 * 5.30%* Amean fault-both-3 1383.95 ( 0.00%) 1191.69 * 13.89%* Amean fault-both-5 1568.13 ( 0.00%) 1445.20 * 7.84%* Amean fault-both-7 1819.62 ( 0.00%) 1555.13 * 14.54%* Amean fault-both-12 1106.96 ( 0.00%) 1149.43 * -3.84%* Amean fault-both-18 2196.93 ( 0.00%) 1875.77 * 14.62%* Amean fault-both-24 2642.69 ( 0.00%) 2671.21 * -1.08%* Amean fault-both-30 2901.89 ( 0.00%) 2857.32 * 1.54%* Amean fault-both-32 3747.00 ( 0.00%) 3479.23 * 7.15%* Link: https://lkml.kernel.org/r/20220713062009.597255-1-zhouchuyi@bytedance.com Fixes: 70b44595eafe9 ("mm, compaction: use free lists to quickly locate a migration source") Signed-off-by: zhouchuyi <zhouchuyi@bytedance.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12mm/hugetlb.c: make __hugetlb_vma_unlock_write_put() staticAndrew Morton
Reported-by: kernel test robot <lkp@intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-07hugetlb: allocate vma lock for all sharable vmasMike Kravetz
The hugetlb vma lock was originally designed to synchronize pmd sharing. As such, it was only necessary to allocate the lock for vmas that were capable of pmd sharing. Later in the development cycle, it was discovered that it could also be used to simplify fault/truncation races as described in [1]. However, a subsequent change to allocate the lock for all vmas that use the page cache was never made. A fault/truncation race could leave pages in a file past i_size until the file is removed. Remove the previous restriction and allocate lock for all VM_MAYSHARE vmas. Warn in the unlikely event of allocation failure. [1] https://lore.kernel.org/lkml/Yxiv0SkMkZ0JWGGp@monkey/#t Link: https://lkml.kernel.org/r/20221005011707.514612-4-mike.kravetz@oracle.com Fixes: "hugetlb: clean up code checking for fault/truncation races" Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-07hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointerMike Kravetz
hugetlb file truncation/hole punch code may need to back out and take locks in order in the routine hugetlb_unmap_file_folio(). This code could race with vma freeing as pointed out in [1] and result in accessing a stale vma pointer. To address this, take the vma_lock when clearing the vma_lock->vma pointer. [1] https://lore.kernel.org/linux-mm/01f10195-7088-4462-6def-909549c75ef4@huawei.com/ [mike.kravetz@oracle.com: address build issues] Link: https://lkml.kernel.org/r/Yz5L1uxQYR1VqFtJ@monkey Link: https://lkml.kernel.org/r/20221005011707.514612-3-mike.kravetz@oracle.com Fixes: "hugetlb: use new vma_lock for pmd sharing synchronization" Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-07hugetlb: fix vma lock handling during split vma and range unmappingMike Kravetz
Patch series "hugetlb: fixes for new vma lock series". In review of the series "hugetlb: Use new vma lock for huge pmd sharing synchronization", Miaohe Lin pointed out two key issues: 1) There is a race in the routine hugetlb_unmap_file_folio when locks are dropped and reacquired in the correct order [1]. 2) With the switch to using vma lock for fault/truncate synchronization, we need to make sure lock exists for all VM_MAYSHARE vmas, not just vmas capable of pmd sharing. These two issues are addressed here. In addition, having a vma lock present in all VM_MAYSHARE vmas, uncovered some issues around vma splitting. Those are also addressed. [1] https://lore.kernel.org/linux-mm/01f10195-7088-4462-6def-909549c75ef4@huawei.com/ This patch (of 3): The hugetlb vma lock hangs off the vm_private_data field and is specific to the vma. When vm_area_dup() is called as part of vma splitting, the vma lock pointer is copied to the new vma. This will result in issues such as double freeing of the structure. Update the hugetlb open vm_ops to allocate a new vma lock for the new vma. The routine __unmap_hugepage_range_final unconditionally unset VM_MAYSHARE to prevent subsequent pmd sharing. hugetlb_vma_lock_free attempted to anticipate this by checking both VM_MAYSHARE and VM_SHARED. However, if only VM_MAYSHARE was set we would miss the free. With the introduction of the vma lock, a vma can not participate in pmd sharing if vm_private_data is NULL. Instead of clearing VM_MAYSHARE in __unmap_hugepage_range_final, free the vma lock to prevent sharing. Also, update the sharing code to make sure vma lock is indeed a condition for pmd sharing. hugetlb_vma_lock_free can then key off VM_MAYSHARE and not miss any vmas. Link: https://lkml.kernel.org/r/20221005011707.514612-1-mike.kravetz@oracle.com Link: https://lkml.kernel.org/r/20221005011707.514612-2-mike.kravetz@oracle.com Fixes: "hugetlb: add vma based lock for pmd sharing" Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-07mglru: mm/vmscan.c: fix imprecise commentsYu Zhao
Link: https://lkml.kernel.org/r/YzSWfFI+MOeb1ils@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-07mm/mglru: don't sync disk for each aging cycleYu Zhao
wakeup_flusher_threads() was added under the assumption that if a system runs out of clean cold pages, it might want to write back dirty pages more aggressively so that they can become clean and be dropped. However, doing so can breach the rate limit a system wants to impose on writeback, resulting in early SSD wearout. Link: https://lkml.kernel.org/r/YzSiWq9UEER5LKup@google.com Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks") Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: Axel Rasmussen <axelrasmussen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbolJohannes Weiner
Since 2d1c498072de ("mm: memcontrol: make swap tracking an integral part of memory control"), CONFIG_MEMCG_SWAP hasn't been a user-visible config option anymore, it just means CONFIG_MEMCG && CONFIG_SWAP. Update the sites accordingly and drop the symbol. [ While touching the docs, remove two references to CONFIG_MEMCG_KMEM, which hasn't been a user-visible symbol for over half a decade. ] Link: https://lkml.kernel.org/r/20220926135704.400818-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03mm: memcontrol: use do_memsw_account() in a few more placesJohannes Weiner
It's slightly more descriptive and consistent with other places that distinguish cgroup1's combined memory+swap accounting scheme from cgroup2's dedicated swap accounting. Link: https://lkml.kernel.org/r/20220926135704.400818-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03mm: memcontrol: deprecate swapaccounting=0 modeJohannes Weiner
The swapaccounting= commandline option already does very little today. To close a trivial containment failure case, the swap ownership tracking part of the swap controller has recently become mandatory (see commit 2d1c498072de ("mm: memcontrol: make swap tracking an integral part of memory control") for details), which makes up the majority of the work during swapout, swapin, and the swap slot map. The only thing left under this flag is the page_counter operations and the visibility of the swap control files in the first place, which are rather meager savings. There also aren't many scenarios, if any, where controlling the memory of a cgroup while allowing it unlimited access to a global swap space is a workable resource isolation strategy. On the other hand, there have been several bugs and confusion around the many possible swap controller states (cgroup1 vs cgroup2 behavior, memory accounting without swap accounting, memcg runtime disabled). This puts the maintenance overhead of retaining the toggle above its practical benefits. Deprecate it. Link: https://lkml.kernel.org/r/20220926135704.400818-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabledJohannes Weiner
Patch series "memcg swap fix & cleanups". This patch (of 4): Since commit 2d1c498072de ("mm: memcontrol: make swap tracking an integral part of memory control"), the cgroup swap arrays are used to track memory ownership at the time of swap readahead and swapoff, even if swap space *accounting* has been turned off by the user via swapaccount=0 (which sets cgroup_memory_noswap). However, the patch was overzealous: by simply dropping the cgroup_memory_noswap conditionals in the swapon, swapoff and uncharge path, it caused the cgroup arrays being allocated even when the memory controller as a whole is disabled. This is a waste of that memory. Restore mem_cgroup_disabled() checks, implied previously by cgroup_memory_noswap, in the swapon, swapoff, and swap_entry_free callbacks. Link: https://lkml.kernel.org/r/20220926135704.400818-1-hannes@cmpxchg.org Link: https://lkml.kernel.org/r/20220926135704.400818-2-hannes@cmpxchg.org Fixes: 2d1c498072de ("mm: memcontrol: make swap tracking an integral part of memory control") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Hugh Dickins <hughd@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03mm/secretmem: remove reduntant return valueXiu Jianfeng
The return value @ret is always 0, so remove it and return 0 directly. Link: https://lkml.kernel.org/r/20220920012205.246217-1-xiujianfeng@huawei.com Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03mm/hugetlb: add available_huge_pages() funcXin Hao
In hugetlb.c there are several places which compare the values of 'h->free_huge_pages' and 'h->resv_huge_pages', it looks a bit messy, so add a new available_huge_pages() function to do these. Link: https://lkml.kernel.org/r/20220922021929.98961-1-xhao@linux.alibaba.com Signed-off-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03mm: remove unused inline functions from include/linux/mm_inline.hGaosheng Cui
Remove the following unused inline functions from mm_inline.h: 1. All uses of add_page_to_lru_list_tail() have been removed since commit 7a3dbfe8a52b ("mm/swap: convert lru_deactivate_file to a folio_batch"), and it can be replaced by lruvec_add_folio_tail(). 2. All uses of __clear_page_lru_flags() have been removed since commit 188e8caee968 ("mm/swap: convert __page_cache_release() to use a folio"), and it can be replaced by __folio_clear_lru_flags(). They are useless, so remove them. Link: https://lkml.kernel.org/r/20220922110935.1495099-1-cuigaosheng1@huawei.com Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memoryZach O'Keefe
Add :collapse mod to userfaultfd selftest. Currently this mod is only valid for "shmem" test type, but could be used for other test types. When provided, memory allocated by ->allocate_area() will be hugepage-aligned enforced to be hugepage-sized. userfaultf_minor_test, after the UFFD-registered mapping has been populated by UUFD minor fault handler, attempt to MADV_COLLAPSE the UFFD-registered mapping to collapse the memory into a pmd-mapped THP. This test is meant to be a functional test of what occurs during UFFD-driven live migration of VMs backed by huge tmpfs where, after a hugepage-sized region has been successfully migrated (in native page-sized chunks, to avoid latency of fetched a hugepage over the network), we want to reclaim previous VM performance by remapping it at the PMD level. Link: https://lkml.kernel.org/r/20220907144521.3115321-11-zokeefe@google.com Link: https://lkml.kernel.org/r/20220922224046.1143204-11-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmdZach O'Keefe
This test tests that MADV_COLLAPSE acting on file/shmem memory for which (1) the file extent mapping by the memory is already a huge page in the page cache, and (2) the pmd mapping this memory in the target process is none. In practice, (1)+(2) is the state left over after khugepaged has successfully collapsed file/shmem memory for a target VMA, but the memory has not yet been refaulted. So, this test in-effect tests MADV_COLLAPSE racing with khugepaged to collapse the memory first. Link: https://lkml.kernel.org/r/20220907144521.3115321-10-zokeefe@google.com Link: https://lkml.kernel.org/r/20220922224046.1143204-10-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03selftests/vm: add thp collapse shmem testingZach O'Keefe
Add memory operations for shmem (memfd) memory, and reuse existing tests with the new memory operations. Shmem tests can be called with "shmem" mem_type, and shmem tests are ran with "all" mem_type as well. Link: https://lkml.kernel.org/r/20220907144521.3115321-9-zokeefe@google.com Link: https://lkml.kernel.org/r/20220922224046.1143204-9-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03selftests/vm: add thp collapse file and tmpfs testingZach O'Keefe
Add memory operations for file-backed and tmpfs memory. Call existing tests with these new memory operations to test collapse functionality of khugepaged and MADV_COLLAPSE on file-backed and tmpfs memory. Not all tests are reusable; for example, collapse_swapin_single_pte() which checks swap usage. Refactor test arguments. Usage is now: Usage: ./khugepaged <test type> [dir] <test type> : <context>:<mem_type> <context> : [all|khugepaged|madvise] <mem_type> : [all|anon|file] "file,all" mem_type requires [dir] argument "file,all" mem_type requires kernel built with CONFIG_READ_ONLY_THP_FOR_FS=y if [dir] is a (sub)directory of a tmpfs mount, tmpfs must be mounted with huge=madvise option for khugepaged tests to work Refactor calling tests to make it clear what collapse context / memory operations they support, but only invoke tests requested by user. Also log what test is being ran, and with what context / memory, to make test logs more human readable. A new test file is created and deleted for every test to ensure no pages remain in the page cache between tests (tests also may attempt to collapse different amount of memory). For file-backed memory where the file is stored on a block device, disable /sys/block/<device>/queue/read_ahead_kb so that pages don't find their way into the page cache without the tests faulting them in. Add file and shmem wrappers to vm_utils check for file and shmem hugepages in smaps. [zokeefe@google.com: fix "add thp collapse file and tmpfs testing" for tmpfs] Link: https://lkml.kernel.org/r/20220913212517.3163701-1-zokeefe@google.com Link: https://lkml.kernel.org/r/20220907144521.3115321-8-zokeefe@google.com Link: https://lkml.kernel.org/r/20220922224046.1143204-8-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03selftests/vm: modularize thp collapse memory operationsZach O'Keefe
Modularize operations to setup, cleanup, fault, and check for huge pages, for a given memory type. This allows reusing existing tests with additional memory types by defining new memory operations. Following patches will add file and shmem memory types. Link: https://lkml.kernel.org/r/20220907144521.3115321-7-zokeefe@google.com Link: https://lkml.kernel.org/r/20220922224046.1143204-7-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03selftests/vm: dedup THP helpersZach O'Keefe
These files: tools/testing/selftests/vm/vm_util.c tools/testing/selftests/vm/khugepaged.c Both contain logic to: 1) Determine hugepage size on current system 2) Read /proc/self/smaps to determine number of THPs at an address Refactor selftests/vm/khugepaged.c to use the vm_util common helpers and add it as a build dependency. Since selftests/vm/khugepaged.c is the largest user of check_huge(), change the signature of check_huge() to match selftests/vm/khugepaged.c's useage: take an expected number of hugepages, and return a bool indicating if the correct number of hugepages were found. Add a wrapper, check_huge_anon(), in anticipation of checking smaps for file and shmem hugepages. Update existing callsites to use the new pattern / function. Likewise, check_for_pattern() was duplicated, and it's a general enough helper to include in vm_util helpers as well. Link: https://lkml.kernel.org/r/20220907144521.3115321-6-zokeefe@google.com Link: https://lkml.kernel.org/r/20220922224046.1143204-6-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03mm/khugepaged: add tracepoint to hpage_collapse_scan_file()Zach O'Keefe
Add huge_memory:trace_mm_khugepaged_scan_file tracepoint to hpage_collapse_scan_file() analogously to hpage_collapse_scan_pmd(). While this change is targeted at debugging MADV_COLLAPSE pathway, the "mm_khugepaged" prefix is retained for symmetry with huge_memory:trace_mm_khugepaged_scan_pmd, which retains it's legacy name to prevent changing kernel ABI as much as possible. Link: https://lkml.kernel.org/r/20220907144521.3115321-5-zokeefe@google.com Link: https://lkml.kernel.org/r/20220922224046.1143204-5-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03mm/madvise: add file and shmem support to MADV_COLLAPSEZach O'Keefe
Add support for MADV_COLLAPSE to collapse shmem-backed and file-backed memory into THPs (requires CONFIG_READ_ONLY_THP_FOR_FS=y). On success, the backing memory will be a hugepage. For the memory range and process provided, the page tables will synchronously have a huge pmd installed, mapping the THP. Other mappings of the file extent mapped by the memory range may be added to a set of entries that khugepaged will later process and attempt update their page tables to map the THP by a pmd. This functionality unlocks two important uses: (1) Immediately back executable text by THPs. Current support provided by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system which might impair services from serving at their full rated load after (re)starting. Tricks like mremap(2)'ing text onto anonymous memory to immediately realize iTLB performance prevents page sharing and demand paging, both of which increase steady state memory footprint. Now, we can have the best of both worlds: Peak upfront performance and lower RAM footprints. (2) userfaultfd-based live migration of virtual machines satisfy UFFD faults by fetching native-sized pages over the network (to avoid latency of transferring an entire hugepage). However, after guest memory has been fully copied to the new host, MADV_COLLAPSE can be used to immediately increase guest performance. Since khugepaged is single threaded, this change now introduces possibility of collapse contexts racing in file collapse path. There a important few places to consider: (1) hpage_collapse_scan_file(), when we xas_pause() and drop RCU. We could have the memory collapsed out from under us, but the next xas_for_each() iteration will correctly pick up the hugepage. The hugepage might not be up to date (insofar as copying of small page contents might not have completed - the page still may be locked), but regardless what small page index we were iterating over, we'll find the hugepage and identify it as a suitably aligned compound page of order HPAGE_PMD_ORDER. In khugepaged path, we locklessly check the value of the pmd, and only add it to deferred collapse array if we find pmd mapping pte table. This is fine, since other values that could have raced in right afterwards denote failure, or that the memory was successfully collapsed, so we don't need further processing. In madvise path, we'll take mmap_lock() in write to serialize against page table updates and will know what to do based on the true value of the pmd: recheck all ptes if we point to a pte table, directly install the pmd, if the pmd has been cleared, but memory not yet faulted, or nothing at all if we find a huge pmd. It's worth putting emphasis here on how we treat the none pmd here. If khugepaged has processed this mm's page tables already, it will have left the pmd cleared (ready for refault by the process). Depending on the VMA flags and sysfs settings, amount of RAM on the machine, and the current load, could be a relatively common occurrence - and as such is one we'd like to handle successfully in MADV_COLLAPSE. When we see the none pmd in collapse_pte_mapped_thp(), we've locked mmap_lock in write and checked (a) huepaged_vma_check() to see if the backing memory is appropriate still, along with VMA sizing and appropriate hugepage alignment within the file, and (b) we've found a hugepage head of order HPAGE_PMD_ORDER at the offset in the file mapped by our hugepage-aligned virtual address. Even though the common-case is likely race with khugepaged, given these checks (regardless how we got here - we could be operating on a completely different file than originally checked in hpage_collapse_scan_file() for all we know) it should be safe to directly make the pmd a huge pmd pointing to this hugepage. (2) collapse_file() is mostly serialized on the same file extent by lock sequence: | lock hupepage | lock mapping->i_pages | lock 1st page | unlock mapping->i_pages | <page checks> | lock mapping->i_pages | page_ref_freeze(3) | xas_store(hugepage) | unlock mapping->i_pages | page_ref_unfreeze(1) | unlock 1st page V unlock hugepage Once a context (who already has their fresh hugepage locked) locks mapping->i_pages exclusively, it will hold said lock until it locks the first page, and it will hold that lock until the after the hugepage has been added to the page cache (and will unlock the hugepage after page table update, though that isn't important here). A racing context that loses the race for mapping->i_pages will then lose the race to locking the first page. Here - depending on how far the other racing context has gotten - we might find the new hugepage (in which case we'll exit cleanly when we check PageTransCompound()), or we'll find the "old" 1st small page (in which we'll exit cleanly when we discover unexpected refcount of 2 after isolate_lru_page()). This is assuming we are able to successfully lock the page we find - in shmem path, we could just fail the trylock and exit cleanly anyways. Failure path in collapse_file() is similar: once we hold lock on 1st small page, we are serialized against other collapse contexts. Before the 1st small page is unlocked, we add it back to the pagecache and unfreeze the refcount appropriately. Contexts who lost the race to the 1st small page will then find the same 1st small page with the correct refcount and will be able to proceed. [zokeefe@google.com: don't check pmd value twice in collapse_pte_mapped_thp()] Link: https://lkml.kernel.org/r/20220927033854.477018-1-zokeefe@google.com [shy828301@gmail.com: Delete hugepage_vma_revalidate_anon(), remove check for multi-add in khugepaged_add_pte_mapped_thp()] Link: https://lore.kernel.org/linux-mm/CAHbLzkrtpM=ic7cYAHcqkubah5VTR8N5=k5RT8MTvv5rN1Y91w@mail.gmail.com/ Link: https://lkml.kernel.org/r/20220907144521.3115321-4-zokeefe@google.com Link: https://lkml.kernel.org/r/20220922224046.1143204-4-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmdsZach O'Keefe
The main benefit of THPs are that they can be mapped at the pmd level, increasing the likelihood of TLB hit and spending less cycles in page table walks. pte-mapped hugepages - that is - hugepage-aligned compound pages of order HPAGE_PMD_ORDER mapped by ptes - although being contiguous in physical memory, don't have this advantage. In fact, one could argue they are detrimental to system performance overall since they occupy a precious hugepage-aligned/sized region of physical memory that could otherwise be used more effectively. Additionally, pte-mapped hugepages can be the cheapest memory to collapse for khugepaged since no new hugepage allocation or copying of memory contents is necessary - we only need to update the mapping page tables. In the anonymous collapse path, we are able to collapse pte-mapped hugepages (albeit, perhaps suboptimally), but the file/shmem path makes no effort when compound pages (of any order) are encountered. Identify pte-mapped hugepages in the file/shmem collapse path. The final step of which makes a racy check of the value of the pmd to ensure it maps a pte table. This should be fine, since races that result in false-positive (i.e. attempt collapse even though we shouldn't) will fail later in collapse_pte_mapped_thp() once we actually lock mmap_lock and reinspect the pmd value. Races that result in false-negatives (i.e. where we decide to not attempt collapse, but should have) shouldn't be an issue, since in the worst case, we do nothing - which is what we've done up to this point. We make a similar check in retract_page_tables(). If we do think we've found a pte-mapped hugepgae in khugepaged context, attempt to update page tables mapping this hugepage. Note that these collapses still count towards the /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed counter, and if the pte-mapped hugepage was also mapped into multiple process' address spaces, could be incremented for each page table update. Since we increment the counter when a pte-mapped hugepage is successfully added to the list of to-collapse pte-mapped THPs, it's possible that we never actually update the page table either. This is different from how file/shmem pages_collapsed accounting works today where only a successful page cache update is counted (it's also possible here that no page tables are actually changed). Though it incurs some slop, this is preferred to either not accounting for the event at all, or plumbing through data in struct mm_slot on whether to account for the collapse or not. Also note that work still needs to be done to support arbitrary compound pages, and that this should all be converted to using folios. [shy828301@gmail.com: Spelling mistake, update comment, and add Documentation] Link: https://lore.kernel.org/linux-mm/CAHbLzkpHwZxFzjfX9nxVoRhzup8WMjMfyL6Xiq8mZ9M-N3ombw@mail.gmail.com/ Link: https://lkml.kernel.org/r/20220907144521.3115321-3-zokeefe@google.com Link: https://lkml.kernel.org/r/20220922224046.1143204-3-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03mm/shmem: add flag to enforce shmem THP in hugepage_vma_check()Zach O'Keefe
Patch series "mm: add file/shmem support to MADV_COLLAPSE", v4. This series builds on top of the previous "mm: userspace hugepage collapse" series which introduced the MADV_COLLAPSE madvise mode and added support for private, anonymous mappings[2], by adding support for file and shmem backed memory to CONFIG_READ_ONLY_THP_FOR_FS=y kernels. File and shmem support have been added with effort to align with existing MADV_COLLAPSE semantics and policy decisions[3]. Collapse of shmem-backed memory ignores kernel-guiding directives and heuristics including all sysfs settings (transparent_hugepage/shmem_enabled), and tmpfs huge= mount options (shmem always supports large folios). Like anonymous mappings, on successful return of MADV_COLLAPSE on file/shmem memory, the contents of memory mapped by the addresses provided will be synchronously pmd-mapped THPs. This functionality unlocks two important uses: (1) Immediately back executable text by THPs. Current support provided by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system which might impair services from serving at their full rated load after (re)starting. Tricks like mremap(2)'ing text onto anonymous memory to immediately realize iTLB performance prevents page sharing and demand paging, both of which increase steady state memory footprint. Now, we can have the best of both worlds: Peak upfront performance and lower RAM footprints. (2) userfaultfd-based live migration of virtual machines satisfy UFFD faults by fetching native-sized pages over the network (to avoid latency of transferring an entire hugepage). However, after guest memory has been fully copied to the new host, MADV_COLLAPSE can be used to immediately increase guest performance. khugepaged has received a small improvement by association and can now detect and collapse pte-mapped THPs. However, there is still work to be done along the file collapse path. Compound pages of arbitrary order still needs to be supported and THP collapse needs to be converted to using folios in general. Eventually, we'd like to move away from the read-only and executable-mapped constraints currently imposed on eligible files and support any inode claiming huge folio support. That said, I think the series as-is covers enough to claim that MADV_COLLAPSE supports file/shmem memory. Patches 1-3 Implement the guts of the series. Patch 4 Is a tracepoint for debugging. Patches 5-9 Refactor existing khugepaged selftests to work with new memory types + new collapse tests. Patch 10 Adds a userfaultfd selftest mode to mimic a functional test of UFFDIO_REGISTER_MODE_MINOR+MADV_COLLAPSE live migration. (v4 note: "userfaultfd shmem" selftest is failing as of Sep 22 mm-unstable) [1] https://lore.kernel.org/linux-mm/YyiK8YvVcrtZo0z3@google.com/ [2] https://lore.kernel.org/linux-mm/20220706235936.2197195-1-zokeefe@google.com/ [3] https://lore.kernel.org/linux-mm/YtBmhaiPHUTkJml8@google.com/ [4] https://lore.kernel.org/linux-mm/20220922222731.1124481-1-zokeefe@google.com/ [5] https://lore.kernel.org/linux-mm/20220922184651.1016461-1-zokeefe@google.com/ This patch (of 10): Extend 'mm/thp: add flag to enforce sysfs THP in hugepage_vma_check()' to shmem, allowing callers to ignore /sys/kernel/transparent_hugepage/shmem_enabled and tmpfs huge= mount. This is intended to be used by MADV_COLLAPSE, and the rationale is analogous to the anon/file case: MADV_COLLAPSE is not coupled to directives that advise the kernel's decisions on when THPs should be considered eligible. shmem/tmpfs always claims large folio support, regardless of sysfs or mount options. [shy828301@gmail.com: test shmem_huge_force explicitly] Link: https://lore.kernel.org/linux-mm/CAHbLzko3A5-TpS0BgBeKkx5cuOkWgLvWXQH=TdgW-baO4rPtdg@mail.gmail.com/ Link: https://lkml.kernel.org/r/20220922224046.1143204-1-zokeefe@google.com Link: https://lkml.kernel.org/r/20220907144521.3115321-2-zokeefe@google.com Link: https://lkml.kernel.org/r/20220922224046.1143204-2-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03selftests/vm: retry on EAGAIN for MADV_COLLAPSE selftestZach O'Keefe
MADV_COLLAPSE is a best-effort request that will set errno to an actionable value if the request cannot be performed. For example, if pages are not found on the LRU, or if they are currently locked by something else, MADV_COLLAPSE will fail and set errno to EAGAIN to inform callers that they may try again. Since the khugepaged selftest is the first public use of MADV_COLLAPSE, set a best practice of checking errno and retrying on EAGAIN. Link: https://lkml.kernel.org/r/20220922184651.1016461-2-zokeefe@google.com Fixes: 9330694de59f ("selftests/vm: add MADV_COLLAPSE collapse context to selftests") Signed-off-by: Zach O'Keefe <zokeefe@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03mm/madvise: MADV_COLLAPSE return EAGAIN when page cannot be isolatedZach O'Keefe
MADV_COLLAPSE is a best-effort request that attempts to set an actionable errno value if the request cannot be fulfilled at the time. EAGAIN should be used to communicate that a resource was temporarily unavailable, but that the user may try again immediately. SCAN_DEL_PAGE_LRU is an internal result code used when a page cannot be isolated from it's LRU list. Since this, like SCAN_PAGE_LRU, is likely a transitory state, make MADV_COLLAPSE return EAGAIN so that users know they may reattempt the operation. Another important scenario to consider is race with khugepaged. khugepaged might isolate a page while MADV_COLLAPSE is interested in it. Even though racing with khugepaged might mean that the memory has already been collapsed, signalling an errno that is non-intrinsic to that memory or arguments provided to madvise(2) lets the user know that future attempts might (and in this case likely would) succeed, and avoids false-negative assumptions by the user. Link: https://lkml.kernel.org/r/20220922184651.1016461-1-zokeefe@google.com Fixes: 7d8faaf15545 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse") Signed-off-by: Zach O'Keefe <zokeefe@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03mm/khugepaged: check compound_order() in collapse_pte_mapped_thp()Zach O'Keefe
By the time we lock a page in collapse_pte_mapped_thp(), the page mapped by the address pushed onto the slot's .pte_mapped_thp[] array might have changed arbitrarily since we last looked at it. We revalidate that the page is still the head of a compound page, but we don't revalidate if the compound page is of order HPAGE_PMD_ORDER before applying rmap and page table updates. Since the kernel now supports large folios of arbitrary order, and since replacing page's pte mappings by a pmd mapping only makes sense for compound pages of order HPAGE_PMD_ORDER, revalidate that the compound order is indeed of order HPAGE_PMD_ORDER before proceeding. Link: https://lore.kernel.org/linux-mm/CAHbLzkon+2ky8v9ywGcsTUgXM_B35jt5NThYqQKXW2YV_GUacw@mail.gmail.com/ Link: https://lkml.kernel.org/r/20220922222731.1124481-1-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Suggested-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03mm: hugetlb: fix UAF in hugetlb_handle_userfaultLiu Shixin
The vma_lock and hugetlb_fault_mutex are dropped before handling userfault and reacquire them again after handle_userfault(), but reacquire the vma_lock could lead to UAF[1,2] due to the following race, hugetlb_fault hugetlb_no_page /*unlock vma_lock */ hugetlb_handle_userfault handle_userfault /* unlock mm->mmap_lock*/ vm_mmap_pgoff do_mmap mmap_region munmap_vma_range /* clean old vma */ /* lock vma_lock again <--- UAF */ /* unlock vma_lock */ Since the vma_lock will unlock immediately after hugetlb_handle_userfault(), let's drop the unneeded lock and unlock in hugetlb_handle_userfault() to fix the issue. [1] https://lore.kernel.org/linux-mm/000000000000d5e00a05e834962e@google.com/ [2] https://lore.kernel.org/linux-mm/20220921014457.1668-1-liuzixian4@huawei.com/ Link: https://lkml.kernel.org/r/20220923042113.137273-1-liushixin2@huawei.com Fixes: 1a1aad8a9b7b ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook") Signed-off-by: Liu Shixin <liushixin2@huawei.com> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reported-by: syzbot+193f9cee8638750b23cf@syzkaller.appspotmail.com Reported-by: Liu Zixian <liuzixian4@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: <stable@vger.kernel.org> [4.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03mm: memcontrol: make cgroup_memory_noswap a static keyKairui Song
cgroup_memory_noswap is used in many hot path, so make it a static key to lower the kernel overhead. Using 8G of ZRAM as SWAP, benchmark using `perf stat -d -d -d --repeat 100` with the following code snip in a non-root cgroup: #include <stdio.h> #include <string.h> #include <linux/mman.h> #include <sys/mman.h> #define MB 1024UL * 1024UL int main(int argc, char **argv){ void *p = mmap(NULL, 8000 * MB, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); memset(p, 0xff, 8000 * MB); madvise(p, 8000 * MB, MADV_PAGEOUT); memset(p, 0xff, 8000 * MB); return 0; } Before: 7,021.43 msec task-clock # 0.967 CPUs utilized ( +- 0.03% ) 4,010 context-switches # 573.853 /sec ( +- 0.01% ) 0 cpu-migrations # 0.000 /sec 2,052,057 page-faults # 293.661 K/sec ( +- 0.00% ) 12,616,546,027 cycles # 1.805 GHz ( +- 0.06% ) (39.92%) 156,823,666 stalled-cycles-frontend # 1.25% frontend cycles idle ( +- 0.10% ) (40.25%) 310,130,812 stalled-cycles-backend # 2.47% backend cycles idle ( +- 4.39% ) (40.73%) 18,692,516,591 instructions # 1.49 insn per cycle # 0.01 stalled cycles per insn ( +- 0.04% ) (40.75%) 4,907,447,976 branches # 702.283 M/sec ( +- 0.05% ) (40.30%) 13,002,578 branch-misses # 0.26% of all branches ( +- 0.08% ) (40.48%) 7,069,786,296 L1-dcache-loads # 1.012 G/sec ( +- 0.03% ) (40.32%) 649,385,847 L1-dcache-load-misses # 9.13% of all L1-dcache accesses ( +- 0.07% ) (40.10%) 1,485,448,688 L1-icache-loads # 212.576 M/sec ( +- 0.15% ) (39.49%) 31,628,457 L1-icache-load-misses # 2.13% of all L1-icache accesses ( +- 0.40% ) (39.57%) 6,667,311 dTLB-loads # 954.129 K/sec ( +- 0.21% ) (39.50%) 5,668,555 dTLB-load-misses # 86.40% of all dTLB cache accesses ( +- 0.12% ) (39.03%) 765 iTLB-loads # 109.476 /sec ( +- 21.81% ) (39.44%) 4,370,351 iTLB-load-misses # 214320.09% of all iTLB cache accesses ( +- 1.44% ) (39.86%) 149,207,254 L1-dcache-prefetches # 21.352 M/sec ( +- 0.13% ) (40.27%) 7.25869 +- 0.00203 seconds time elapsed ( +- 0.03% ) After: 6,576.16 msec task-clock # 0.953 CPUs utilized ( +- 0.10% ) 4,020 context-switches # 605.595 /sec ( +- 0.01% ) 0 cpu-migrations # 0.000 /sec 2,052,056 page-faults # 309.133 K/sec ( +- 0.00% ) 11,967,619,180 cycles # 1.803 GHz ( +- 0.36% ) (38.76%) 161,259,240 stalled-cycles-frontend # 1.38% frontend cycles idle ( +- 0.27% ) (36.58%) 253,605,302 stalled-cycles-backend # 2.16% backend cycles idle ( +- 4.45% ) (34.78%) 19,328,171,892 instructions # 1.65 insn per cycle # 0.01 stalled cycles per insn ( +- 0.10% ) (31.46%) 5,213,967,902 branches # 785.461 M/sec ( +- 0.18% ) (30.68%) 12,385,170 branch-misses # 0.24% of all branches ( +- 0.26% ) (34.13%) 7,271,687,822 L1-dcache-loads # 1.095 G/sec ( +- 0.12% ) (35.29%) 649,873,045 L1-dcache-load-misses # 8.93% of all L1-dcache accesses ( +- 0.11% ) (41.41%) 1,950,037,608 L1-icache-loads # 293.764 M/sec ( +- 0.33% ) (43.11%) 31,365,566 L1-icache-load-misses # 1.62% of all L1-icache accesses ( +- 0.39% ) (45.89%) 6,767,809 dTLB-loads # 1.020 M/sec ( +- 0.47% ) (48.42%) 6,339,590 dTLB-load-misses # 95.43% of all dTLB cache accesses ( +- 0.50% ) (46.60%) 736 iTLB-loads # 110.875 /sec ( +- 1.79% ) (48.60%) 4,314,836 iTLB-load-misses # 518653.73% of all iTLB cache accesses ( +- 0.63% ) (42.91%) 144,950,156 L1-dcache-prefetches # 21.836 M/sec ( +- 0.37% ) (41.39%) 6.89935 +- 0.00703 seconds time elapsed ( +- 0.10% ) The performance is clearly better. There is no significant hotspot improvement according to perf report, as there are quite a few callers of memcg_swap_enabled and do_memsw_account (which calls memcg_swap_enabled). Many pieces of minor optimizations resulted in lower overhead for the branch predictor, and bettter performance. Link: https://lkml.kernel.org/r/20220919180634.45958-3-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Muchun Song <songmuchun@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03mm: memcontrol: use memcg_kmem_enabled in count_objcg_eventKairui Song
Patch series "mm: memcontrol: cleanup and optimize for two accounting params", v2. This patch (of 2): There are currently two helpers for checking if cgroup kmem accounting is enabled: - mem_cgroup_kmem_disabled - memcg_kmem_enabled mem_cgroup_kmem_disabled is a simple helper that returns true if cgroup.memory=nokmem is specified, otherwise returns false. memcg_kmem_enabled is a bit different, it returns true if cgroup.memory=nokmem is not specified and there was at least one non-root memory control enabled cgroup ever created. This help improve performance when kmem accounting was not actually activated. And it's optimized with static branch. The usage of mem_cgroup_kmem_disabled is for sub-systems that need to preallocate data for kmem accounting since they could be initialized before kmem accounting is activated. But count_objcg_event doesn't need that, so using memcg_kmem_enabled is better here. Link: https://lkml.kernel.org/r/20220919180634.45958-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20220919180634.45958-2-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Muchun Song <songmuchun@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03mm/damon: deduplicate damon_{reclaim,lru_sort}_apply_parameters()Kaixu Xia
The bodies of damon_{reclaim,lru_sort}_apply_parameters() contain duplicates. This commit adds a common function damon_set_region_biggest_system_ram_default() to remove the duplicates. Link: https://lkml.kernel.org/r/6329f00d.a70a0220.9bb29.3678SMTPIN_ADDED_BROKEN@mx.google.com Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Suggested-by: SeongJae Park <sj@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03mm/damon/sysfs: return 'err' value when call kstrtoul() failedXin Hao
We had better return the 'err' value when calling kstrtoul() failed, so the user will know why it really fails, there do little change, let it return the 'err' value when failed. Link: https://lkml.kernel.org/r/6329ebe0.050a0220.ec4bd.297cSMTPIN_ADDED_BROKEN@mx.google.com Suggested-by: SeongJae Park <sj@kernel.org> Signed-off-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Xin Hao <xhao@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03mm/page_alloc: update comments for rmqueue()Ran Xiaokai
Since commit 44042b449872 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists"), the per-cpu page allocators (PCP) is not only for order-0 pages. Update the comments. Link: https://lkml.kernel.org/r/20220918025640.208586-1-ran.xiaokai@zte.com.cn Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03mm/damon: rename damon_pageout_score() to damon_cold_score()Kaixu Xia
In the beginning there is only one damos_action 'DAMOS_PAGEOUT' that need to get the coldness score of a region for a scheme, which using damon_pageout_score() to do that. But now there are also other damos_action actions need the coldness score, so rename it to damon_cold_score() to make more sense. Link: https://lkml.kernel.org/r/1663423014-28907-1-git-send-email-kaixuxia@tencent.com Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03hugetlb: freeze allocated pages before creating hugetlb pagesMike Kravetz
When creating hugetlb pages, the hugetlb code must first allocate contiguous pages from a low level allocator such as buddy, cma or memblock. The pages returned from these low level allocators are ref counted. This creates potential issues with other code taking speculative references on these pages before they can be transformed to a hugetlb page. This issue has been addressed with methods and code such as that provided in [1]. Recent discussions about vmemmap freeing [2] have indicated that it would be beneficial to freeze all sub pages, including the head page of pages returned from low level allocators before converting to a hugetlb page. This helps avoid races if we want to replace the page containing vmemmap for the head page. There have been proposals to change at least the buddy allocator to return frozen pages as described at [3]. If such a change is made, it can be employed by the hugetlb code. However, as mentioned above hugetlb uses several low level allocators so each would need to be modified to return frozen pages. For now, we can manually freeze the returned pages. This is done in two places: 1) alloc_buddy_huge_page, only the returned head page is ref counted. We freeze the head page, retrying once in the VERY rare case where there may be an inflated ref count. 2) prep_compound_gigantic_page, for gigantic pages the current code freezes all pages except the head page. New code will simply freeze the head page as well. In a few other places, code checks for inflated ref counts on newly allocated hugetlb pages. With the modifications to freeze after allocating, this code can be removed. After hugetlb pages are freshly allocated, they are often added to the hugetlb free lists. Since these pages were previously ref counted, this was done via put_page() which would end up calling the hugetlb destructor: free_huge_page. With changes to freeze pages, we simply call free_huge_page directly to add the pages to the free list. In a few other places, freshly allocated hugetlb pages were immediately put into use, and the expectation was they were already ref counted. In these cases, we must manually ref count the page. [1] https://lore.kernel.org/linux-mm/20210622021423.154662-3-mike.kravetz@oracle.com/ [2] https://lore.kernel.org/linux-mm/20220802180309.19340-1-joao.m.martins@oracle.com/ [3] https://lore.kernel.org/linux-mm/20220809171854.3725722-1-willy@infradead.org/ [mike.kravetz@oracle.com: fix NULL pointer dereference] Link: https://lkml.kernel.org/r/20220921202702.106069-1-mike.kravetz@oracle.com Link: https://lkml.kernel.org/r/20220916214638.155744-1-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03mm/page_alloc: fix obsolete comment in deferred_pfn_valid()Miaohe Lin
There are no architectures that can have holes in the memory map within a pageblock since commit 859a85ddf90e ("mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE"). Update the corresponding comment. Link: https://lkml.kernel.org/r/20220916072257.9639-17-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03mm/page_alloc: remove obsolete gfpflags_normal_context()Miaohe Lin
Since commit dacb5d8875cc ("tcp: fix page frag corruption on page fault"), there's no caller of gfpflags_normal_context(). Remove it as this helper is strictly tied to the sk page frag usage and there won't be other user in the future. [linmiaohe@huawei.com: fix htmldocs] Link: https://lkml.kernel.org/r/1bc55727-9b66-0e9e-c306-f10c4716ea89@huawei.com Link: https://lkml.kernel.org/r/20220916072257.9639-16-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03mm/page_alloc: use costly_order in WARN_ON_ONCE_GFP()Miaohe Lin
There's no need to check whether order > PAGE_ALLOC_COSTLY_ORDER again. Minor readability improvement. Link: https://lkml.kernel.org/r/20220916072257.9639-15-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03mm/page_alloc: init local variable buddy_pfnMiaohe Lin
The local variable buddy_pfn could be passed to buddy_merge_likely() without initialization if the passed in order is MAX_ORDER - 1. This looks buggy but buddy_pfn won't be used in this case as there's a order >= MAX_ORDER - 2 check. Init buddy_pfn to 0 anyway to avoid possible future misuse. Link: https://lkml.kernel.org/r/20220916072257.9639-14-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>