From bd74fdaea146029e4fa12c6de89adbe0779348a9 Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Sun, 18 Sep 2022 02:00:05 -0600 Subject: mm: multi-gen LRU: support page table walks MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit To further exploit spatial locality, the aging prefers to walk page tables to search for young PTEs and promote hot pages. A kill switch will be added in the next patch to disable this behavior. When disabled, the aging relies on the rmap only. NB: this behavior has nothing similar with the page table scanning in the 2.4 kernel [1], which searches page tables for old PTEs, adds cold pages to swapcache and unmaps them. To avoid confusion, the term "iteration" specifically means the traversal of an entire mm_struct list; the term "walk" will be applied to page tables and the rmap, as usual. An mm_struct list is maintained for each memcg, and an mm_struct follows its owner task to the new memcg when this task is migrated. Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls walk_page_range() with each mm_struct on this list to promote hot pages before it increments max_seq. When multiple page table walkers iterate the same list, each of them gets a unique mm_struct; therefore they can run concurrently. Page table walkers ignore any misplaced pages, e.g., if an mm_struct was migrated, pages it left in the previous memcg will not be promoted when its current memcg is under reclaim. Similarly, page table walkers will not promote pages from nodes other than the one under reclaim. This patch uses the following optimizations when walking page tables: 1. It tracks the usage of mm_struct's between context switches so that page table walkers can skip processes that have been sleeping since the last iteration. 2. It uses generational Bloom filters to record populated branches so that page table walkers can reduce their search space based on the query results, e.g., to skip page tables containing mostly holes or misplaced pages. 3. It takes advantage of the accessed bit in non-leaf PMD entries when CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y. 4. It does not zigzag between a PGD table and the same PMD table spanning multiple VMAs. IOW, it finishes all the VMAs within the range of the same PMD table before it returns to a PGD table. This improves the cache performance for workloads that have large numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5. Server benchmark results: Single workload: fio (buffered I/O): no change Single workload: memcached (anon): +[8, 10]% Ops/sec KB/sec patch1-7: 1147696.57 44640.29 patch1-8: 1245274.91 48435.66 Configurations: no change Client benchmark results: kswapd profiles: patch1-7 48.16% lzo1x_1_do_compress (real work) 8.20% page_vma_mapped_walk (overhead) 7.06% _raw_spin_unlock_irq 2.92% ptep_clear_flush 2.53% __zram_bvec_write 2.11% do_raw_spin_lock 2.02% memmove 1.93% lru_gen_look_around 1.56% free_unref_page_list 1.40% memset patch1-8 49.44% lzo1x_1_do_compress (real work) 6.19% page_vma_mapped_walk (overhead) 5.97% _raw_spin_unlock_irq 3.13% get_pfn_folio 2.85% ptep_clear_flush 2.42% __zram_bvec_write 2.08% do_raw_spin_lock 1.92% memmove 1.44% alloc_zspage 1.36% memset Configurations: no change Thanks to the following developers for their efforts [3]. kernel test robot [1] https://lwn.net/Articles/23732/ [2] https://llvm.org/docs/ScudoHardenedAllocator.html [3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/ Link: https://lkml.kernel.org/r/20220918080010.2920238-9-yuzhao@google.com Signed-off-by: Yu Zhao Acked-by: Brian Geffon Acked-by: Jan Alexander Steffens (heftig) Acked-by: Oleksandr Natalenko Acked-by: Steven Barrett Acked-by: Suleiman Souhlal Tested-by: Daniel Byrne Tested-by: Donald Carr Tested-by: Holger Hoffstätte Tested-by: Konstantin Kharlamov Tested-by: Shuang Zhai Tested-by: Sofia Trinh Tested-by: Vaibhav Jain Cc: Andi Kleen Cc: Aneesh Kumar K.V Cc: Barry Song Cc: Catalin Marinas Cc: Dave Hansen Cc: Hillf Danton Cc: Jens Axboe Cc: Johannes Weiner Cc: Jonathan Corbet Cc: Linus Torvalds Cc: Matthew Wilcox Cc: Mel Gorman Cc: Miaohe Lin Cc: Michael Larabel Cc: Michal Hocko Cc: Mike Rapoport Cc: Mike Rapoport Cc: Peter Zijlstra Cc: Qi Zheng Cc: Tejun Heo Cc: Vlastimil Babka Cc: Will Deacon Signed-off-by: Andrew Morton --- fs/exec.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'fs/exec.c') diff --git a/fs/exec.c b/fs/exec.c index 9a5ca7b82bfc..507a317d54db 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1014,6 +1014,7 @@ static int exec_mmap(struct mm_struct *mm) active_mm = tsk->active_mm; tsk->active_mm = mm; tsk->mm = mm; + lru_gen_add_mm(mm); /* * This prevents preemption while active_mm is being loaded and * it and mm are being updated, which could cause problems for @@ -1029,6 +1030,7 @@ static int exec_mmap(struct mm_struct *mm) tsk->mm->vmacache_seqnum = 0; vmacache_flush(tsk); task_unlock(tsk); + lru_gen_use_mm(mm); if (vfork) timens_on_fork(tsk->nsproxy, tsk); -- cgit From 7964cf8caa4dfa42c4149f3833d3878713cda3dc Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Tue, 6 Sep 2022 19:48:51 +0000 Subject: mm: remove vmacache By using the maple tree and the maple tree state, the vmacache is no longer beneficial and is complicating the VMA code. Remove the vmacache to reduce the work in keeping it up to date and code complexity. Link: https://lkml.kernel.org/r/20220906194824.2110408-26-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett Acked-by: Vlastimil Babka Tested-by: Yu Zhao Cc: Catalin Marinas Cc: David Hildenbrand Cc: David Howells Cc: Davidlohr Bueso Cc: "Matthew Wilcox (Oracle)" Cc: SeongJae Park Cc: Sven Schnelle Cc: Will Deacon Signed-off-by: Andrew Morton --- fs/exec.c | 3 --- 1 file changed, 3 deletions(-) (limited to 'fs/exec.c') diff --git a/fs/exec.c b/fs/exec.c index 507a317d54db..2b919b30dc97 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -28,7 +28,6 @@ #include #include #include -#include #include #include #include @@ -1027,8 +1026,6 @@ static int exec_mmap(struct mm_struct *mm) activate_mm(active_mm, mm); if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM)) local_irq_enable(); - tsk->mm->vmacache_seqnum = 0; - vmacache_flush(tsk); task_unlock(tsk); lru_gen_use_mm(mm); -- cgit From 19066e58682ec156aac8d6cf94b79ab2f122a556 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Tue, 6 Sep 2022 19:48:56 +0000 Subject: exec: use VMA iterator instead of linked list Remove a use of the vm_next list by doing the initial lookup with the VMA iterator and then using it to find the next entry. Link: https://lkml.kernel.org/r/20220906194824.2110408-42-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Liam R. Howlett Tested-by: Yu Zhao Cc: Catalin Marinas Cc: David Hildenbrand Cc: David Howells Cc: Davidlohr Bueso Cc: SeongJae Park Cc: Sven Schnelle Cc: Vlastimil Babka Cc: Will Deacon Signed-off-by: Andrew Morton --- fs/exec.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) (limited to 'fs/exec.c') diff --git a/fs/exec.c b/fs/exec.c index 2b919b30dc97..afe55d0c3bcf 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -683,6 +683,8 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift) unsigned long length = old_end - old_start; unsigned long new_start = old_start - shift; unsigned long new_end = old_end - shift; + VMA_ITERATOR(vmi, mm, new_start); + struct vm_area_struct *next; struct mmu_gather tlb; BUG_ON(new_start > new_end); @@ -691,7 +693,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift) * ensure there are no vmas between where we want to go * and where we are */ - if (vma != find_vma(mm, new_start)) + if (vma != vma_next(&vmi)) return -EFAULT; /* @@ -710,12 +712,13 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift) lru_add_drain(); tlb_gather_mmu(&tlb, mm); + next = vma_next(&vmi); if (new_end > old_start) { /* * when the old and new regions overlap clear from new_end. */ free_pgd_range(&tlb, new_end, old_end, new_end, - vma->vm_next ? vma->vm_next->vm_start : USER_PGTABLES_CEILING); + next ? next->vm_start : USER_PGTABLES_CEILING); } else { /* * otherwise, clean from old_start; this is done to not touch @@ -724,7 +727,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift) * for the others its just a little faster. */ free_pgd_range(&tlb, old_start, old_end, new_end, - vma->vm_next ? vma->vm_next->vm_start : USER_PGTABLES_CEILING); + next ? next->vm_start : USER_PGTABLES_CEILING); } tlb_finish_mmu(&tlb); -- cgit