summaryrefslogtreecommitdiff
path: root/mm/vmscan.c
AgeCommit message (Collapse)Author
2020-10-16mm: use self-explanatory macros rather than "2"Yu Zhao
Signed-off-by: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alex Shi <alex.shi@linux.alibaba.com> Link: http://lkml.kernel.org/r/20200831175042.3527153-2-yuzhao@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16mm/vmscan: allow arbitrary sized pages to be paged outMatthew Wilcox (Oracle)
Remove the assumption that a compound page has HPAGE_PMD_NR pins from the page cache. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: SeongJae Park <sjpark@amazon.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: "Huang, Ying" <ying.huang@intel.com> Link: https://lkml.kernel.org/r/20200908195539.25896-12-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13mm/vmscan: fix comments for isolate_lru_page()Hui Su
fix comments for isolate_lru_page(): s/fundamentnal/fundamental Signed-off-by: Hui Su <sh_def@163.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: https://lkml.kernel.org/r/20200927173923.GA8058@rlk Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13mm/vmscan: fix infinite loop in drop_slab_nodeChunxin Zang
We have observed that drop_caches can take a considerable amount of time (<put data here>). Especially when there are many memcgs involved because they are adding an additional overhead. It is quite unfortunate that the operation cannot be interrupted by a signal currently. Add a check for fatal signals into the main loop so that userspace can control early bailout. There are two reasons: 1. We have too many memcgs, even though one object freed in one memcg, the sum of object is bigger than 10. 2. We spend a lot of time in traverse memcg once. So, the memcg who traversed at the first have been freed many objects. Traverse memcg next time, the freed count bigger than 10 again. We can get the following info through 'ps': root:~# ps -aux | grep drop root 357956 ... R Aug25 21119854:55 echo 3 > /proc/sys/vm/drop_caches root 1771385 ... R Aug16 21146421:17 echo 3 > /proc/sys/vm/drop_caches root 1986319 ... R 18:56 117:27 echo 3 > /proc/sys/vm/drop_caches root 2002148 ... R Aug24 5720:39 echo 3 > /proc/sys/vm/drop_caches root 2564666 ... R 18:59 113:58 echo 3 > /proc/sys/vm/drop_caches root 2639347 ... R Sep03 2383:39 echo 3 > /proc/sys/vm/drop_caches root 3904747 ... R 03:35 993:31 echo 3 > /proc/sys/vm/drop_caches root 4016780 ... R Aug21 7882:18 echo 3 > /proc/sys/vm/drop_caches Use bpftrace follow 'freed' value in drop_slab_node: root:~# bpftrace -e 'kprobe:drop_slab_node+70 {@ret=hist(reg("bp")); }' Attaching 1 probe... ^B^C @ret: [64, 128) 1 | | [128, 256) 28 | | [256, 512) 107 |@ | [512, 1K) 298 |@@@ | [1K, 2K) 613 |@@@@@@@ | [2K, 4K) 4435 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [4K, 8K) 442 |@@@@@ | [8K, 16K) 299 |@@@ | [16K, 32K) 100 |@ | [32K, 64K) 139 |@ | [64K, 128K) 56 | | [128K, 256K) 26 | | [256K, 512K) 2 | | In the while loop, we can check whether the TASK_KILLABLE signal is set, if so, we should break the loop. Signed-off-by: Chunxin Zang <zangchunxin@bytedance.com> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Chris Down <chris@chrisdown.name> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Link: https://lkml.kernel.org/r/20200909152047.27905-1-zangchunxin@bytedance.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-19mm: fix check_move_unevictable_pages() on THPHugh Dickins
check_move_unevictable_pages() is used in making unevictable shmem pages evictable: by shmem_unlock_mapping(), drm_gem_check_release_pagevec() and i915/gem check_release_pagevec(). Those may pass down subpages of a huge page, when /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force". That does not crash or warn at present, but the accounting of vmstats unevictable_pgs_scanned and unevictable_pgs_rescued is inconsistent: scanned being incremented on each subpage, rescued only on the head (since tails already appear evictable once the head has been updated). 5.8 commit 5d91f31faf8e ("mm: swap: fix vmstats for huge page") has established that vm_events in general (and unevictable_pgs_rescued in particular) should count every subpage: so follow that precedent here. Do this in such a way that if mem_cgroup_page_lruvec() is made stricter (to check page->mem_cgroup is always set), no problem: skip the tails before calling it, and add thp_nr_pages() to vmstats on the head. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Yang Shi <shy828301@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Qian Cai <cai@lca.pw> Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301405000.5954@eggly.anvils Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-05mm: memcg: fix memcg reclaim soft lockupXunlei Pang
We've met softlockup with "CONFIG_PREEMPT_NONE=y", when the target memcg doesn't have any reclaimable memory. It can be easily reproduced as below: watchdog: BUG: soft lockup - CPU#0 stuck for 111s![memcg_test:2204] CPU: 0 PID: 2204 Comm: memcg_test Not tainted 5.9.0-rc2+ #12 Call Trace: shrink_lruvec+0x49f/0x640 shrink_node+0x2a6/0x6f0 do_try_to_free_pages+0xe9/0x3e0 try_to_free_mem_cgroup_pages+0xef/0x1f0 try_charge+0x2c1/0x750 mem_cgroup_charge+0xd7/0x240 __add_to_page_cache_locked+0x2fd/0x370 add_to_page_cache_lru+0x4a/0xc0 pagecache_get_page+0x10b/0x2f0 filemap_fault+0x661/0xad0 ext4_filemap_fault+0x2c/0x40 __do_fault+0x4d/0xf9 handle_mm_fault+0x1080/0x1790 It only happens on our 1-vcpu instances, because there's no chance for oom reaper to run to reclaim the to-be-killed process. Add a cond_resched() at the upper shrink_node_memcgs() to solve this issue, this will mean that we will get a scheduling point for each memcg in the reclaimed hierarchy without any dependency on the reclaimable memory in that memcg thus making it more predictable. Suggested-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Chris Down <chris@chrisdown.name> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: http://lkml.kernel.org/r/1598495549-67324-1-git-send-email-xlpang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-14mm: replace hpage_nr_pages with thp_nr_pagesMatthew Wilcox (Oracle)
The thp prefix is more frequently used than hpage and we should be consistent between the various functions. [akpm@linux-foundation.org: fix mm/migrate.c] Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/vmscan.c: delete or fix duplicated wordsRandy Dunlap
Drop the repeated word "marked". Change "time time" to "same time". Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Link: http://lkml.kernel.org/r/20200801173822.14973-14-rdunlap@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/vmscan: restore active/inactive ratio for anonymous LRUJoonsoo Kim
Now that workingset detection is implemented for anonymous LRU, we don't need large inactive list to allow detecting frequently accessed pages before they are reclaimed, anymore. This effectively reverts the temporary measure put in by commit "mm/vmscan: make active/inactive ratio as 1:1 for anon lru". Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Link: http://lkml.kernel.org/r/1595490560-15117-7-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/swap: implement workingset detection for anonymous LRUJoonsoo Kim
This patch implements workingset detection for anonymous LRU. All the infrastructure is implemented by the previous patches so this patch just activates the workingset detection by installing/retrieving the shadow entry and adding refault calculation. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Link: http://lkml.kernel.org/r/1595490560-15117-6-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/swapcache: support to handle the shadow entriesJoonsoo Kim
Workingset detection for anonymous page will be implemented in the following patch and it requires to store the shadow entries into the swapcache. This patch implements an infrastructure to store the shadow entry in the swapcache. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/1595490560-15117-5-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/workingset: prepare the workingset detection infrastructure for anon LRUJoonsoo Kim
To prepare the workingset detection for anon LRU, this patch splits workingset event counters for refault, activate and restore into anon and file variants, as well as the refaults counter in struct lruvec. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Link: http://lkml.kernel.org/r/1595490560-15117-4-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/vmscan: protect the workingset on anonymous LRUJoonsoo Kim
In current implementation, newly created or swap-in anonymous page is started on active list. Growing active list results in rebalancing active/inactive list so old pages on active list are demoted to inactive list. Hence, the page on active list isn't protected at all. Following is an example of this situation. Assume that 50 hot pages on active list. Numbers denote the number of pages on active/inactive list (active | inactive). 1. 50 hot pages on active list 50(h) | 0 2. workload: 50 newly created (used-once) pages 50(uo) | 50(h) 3. workload: another 50 newly created (used-once) pages 50(uo) | 50(uo), swap-out 50(h) This patch tries to fix this issue. Like as file LRU, newly created or swap-in anonymous pages will be inserted to the inactive list. They are promoted to active list if enough reference happens. This simple modification changes the above example as following. 1. 50 hot pages on active list 50(h) | 0 2. workload: 50 newly created (used-once) pages 50(h) | 50(uo) 3. workload: another 50 newly created (used-once) pages 50(h) | 50(uo), swap-out 50(uo) As you can see, hot pages on active list would be protected. Note that, this implementation has a drawback that the page cannot be promoted and will be swapped-out if re-access interval is greater than the size of inactive list but less than the size of total(active+inactive). To solve this potential issue, following patch will apply workingset detection similar to the one that's already applied to file LRU. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Link: http://lkml.kernel.org/r/1595490560-15117-3-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/vmscan: make active/inactive ratio as 1:1 for anon lruJoonsoo Kim
Patch series "workingset protection/detection on the anonymous LRU list", v7. * PROBLEM In current implementation, newly created or swap-in anonymous page is started on the active list. Growing the active list results in rebalancing active/inactive list so old pages on the active list are demoted to the inactive list. Hence, hot page on the active list isn't protected at all. Following is an example of this situation. Assume that 50 hot pages on active list and system can contain total 100 pages. Numbers denote the number of pages on active/inactive list (active | inactive). (h) stands for hot pages and (uo) stands for used-once pages. 1. 50 hot pages on active list 50(h) | 0 2. workload: 50 newly created (used-once) pages 50(uo) | 50(h) 3. workload: another 50 newly created (used-once) pages 50(uo) | 50(uo), swap-out 50(h) As we can see, hot pages are swapped-out and it would cause swap-in later. * SOLUTION Since this is what we want to avoid, this patchset implements workingset protection. Like as the file LRU list, newly created or swap-in anonymous page is started on the inactive list. Also, like as the file LRU list, if enough reference happens, the page will be promoted. This simple modification changes the above example as following. 1. 50 hot pages on active list 50(h) | 0 2. workload: 50 newly created (used-once) pages 50(h) | 50(uo) 3. workload: another 50 newly created (used-once) pages 50(h) | 50(uo), swap-out 50(uo) hot pages remains in the active list. :) * EXPERIMENT I tested this scenario on my test bed and confirmed that this problem happens on current implementation. I also checked that it is fixed by this patchset. * SUBJECT workingset detection * PROBLEM Later part of the patchset implements the workingset detection for the anonymous LRU list. There is a corner case that workingset protection could cause thrashing. If we can avoid thrashing by workingset detection, we can get the better performance. Following is an example of thrashing due to the workingset protection. 1. 50 hot pages on active list 50(h) | 0 2. workload: 50 newly created (will be hot) pages 50(h) | 50(wh) 3. workload: another 50 newly created (used-once) pages 50(h) | 50(uo), swap-out 50(wh) 4. workload: 50 (will be hot) pages 50(h) | 50(wh), swap-in 50(wh) 5. workload: another 50 newly created (used-once) pages 50(h) | 50(uo), swap-out 50(wh) 6. repeat 4, 5 Without workingset detection, this kind of workload cannot be promoted and thrashing happens forever. * SOLUTION Therefore, this patchset implements workingset detection. All the infrastructure for workingset detecion is already implemented, so there is not much work to do. First, extend workingset detection code to deal with the anonymous LRU list. Then, make swap cache handles the exceptional value for the shadow entry. Lastly, install/retrieve the shadow value into/from the swap cache and check the refault distance. * EXPERIMENT I made a test program to imitates above scenario and confirmed that problem exists. Then, I checked that this patchset fixes it. My test setup is a virtual machine with 8 cpus and 6100MB memory. But, the amount of the memory that the test program can use is about 280 MB. This is because the system uses large ram-backed swap and large ramdisk to capture the trace. Test scenario is like as below. 1. allocate cold memory (512MB) 2. allocate hot-1 memory (96MB) 3. activate hot-1 memory (96MB) 4. allocate another hot-2 memory (96MB) 5. access cold memory (128MB) 6. access hot-2 memory (96MB) 7. repeat 5, 6 Since hot-1 memory (96MB) is on the active list, the inactive list can contains roughly 190MB pages. hot-2 memory's re-access interval (96+128 MB) is more 190MB, so it cannot be promoted without workingset detection and swap-in/out happens repeatedly. With this patchset, workingset detection works and promotion happens. Therefore, swap-in/out occurs less. Here is the result. (average of 5 runs) type swap-in swap-out base 863240 989945 patch 681565 809273 As we can see, patched kernel do less swap-in/out. * OVERALL TEST (ebizzy using modified random function) ebizzy is the test program that main thread allocates lots of memory and child threads access them randomly during the given times. Swap-in will happen if allocated memory is larger than the system memory. The random function that represents the zipf distribution is used to make hot/cold memory. Hot/cold ratio is controlled by the parameter. If the parameter is high, hot memory is accessed much larger than cold one. If the parameter is low, the number of access on each memory would be similar. I uses various parameters in order to show the effect of patchset on various hot/cold ratio workload. My test setup is a virtual machine with 8 cpus, 1024 MB memory and 5120 MB ram swap. Result format is as following. param: 1-1024-0.1 - 1 (number of thread) - 1024 (allocated memory size, MB) - 0.1 (zipf distribution alpha, 0.1 works like as roughly uniform random, 1.3 works like as small portion of memory is hot and the others are cold) pswpin: smaller is better std: standard deviation improvement: negative is better * single thread param pswpin std improvement base 1-1024.0-0.1 14101983.40 79441.19 prot 1-1024.0-0.1 14065875.80 136413.01 ( -0.26 ) detect 1-1024.0-0.1 13910435.60 100804.82 ( -1.36 ) base 1-1024.0-0.7 7998368.80 43469.32 prot 1-1024.0-0.7 7622245.80 88318.74 ( -4.70 ) detect 1-1024.0-0.7 7618515.20 59742.07 ( -4.75 ) base 1-1024.0-1.3 1017400.80 38756.30 prot 1-1024.0-1.3 940464.60 29310.69 ( -7.56 ) detect 1-1024.0-1.3 945511.40 24579.52 ( -7.07 ) base 1-1280.0-0.1 22895541.40 50016.08 prot 1-1280.0-0.1 22860305.40 51952.37 ( -0.15 ) detect 1-1280.0-0.1 22705565.20 93380.35 ( -0.83 ) base 1-1280.0-0.7 13717645.60 46250.65 prot 1-1280.0-0.7 12935355.80 64754.43 ( -5.70 ) detect 1-1280.0-0.7 13040232.00 63304.00 ( -4.94 ) base 1-1280.0-1.3 1654251.40 4159.68 prot 1-1280.0-1.3 1522680.60 33673.50 ( -7.95 ) detect 1-1280.0-1.3 1599207.00 70327.89 ( -3.33 ) base 1-1536.0-0.1 31621775.40 31156.28 prot 1-1536.0-0.1 31540355.20 62241.36 ( -0.26 ) detect 1-1536.0-0.1 31420056.00 123831.27 ( -0.64 ) base 1-1536.0-0.7 19620760.60 60937.60 prot 1-1536.0-0.7 18337839.60 56102.58 ( -6.54 ) detect 1-1536.0-0.7 18599128.00 75289.48 ( -5.21 ) base 1-1536.0-1.3 2378142.40 20994.43 prot 1-1536.0-1.3 2166260.60 48455.46 ( -8.91 ) detect 1-1536.0-1.3 2183762.20 16883.24 ( -8.17 ) base 1-1792.0-0.1 40259714.80 90750.70 prot 1-1792.0-0.1 40053917.20 64509.47 ( -0.51 ) detect 1-1792.0-0.1 39949736.40 104989.64 ( -0.77 ) base 1-1792.0-0.7 25704884.40 69429.68 prot 1-1792.0-0.7 23937389.00 79945.60 ( -6.88 ) detect 1-1792.0-0.7 24271902.00 35044.30 ( -5.57 ) base 1-1792.0-1.3 3129497.00 32731.86 prot 1-1792.0-1.3 2796994.40 19017.26 ( -10.62 ) detect 1-1792.0-1.3 2886840.40 33938.82 ( -7.75 ) base 1-2048.0-0.1 48746924.40 50863.88 prot 1-2048.0-0.1 48631954.40 24537.30 ( -0.24 ) detect 1-2048.0-0.1 48509419.80 27085.34 ( -0.49 ) base 1-2048.0-0.7 32046424.40 78624.22 prot 1-2048.0-0.7 29764182.20 86002.26 ( -7.12 ) detect 1-2048.0-0.7 30250315.80 101282.14 ( -5.60 ) base 1-2048.0-1.3 3916723.60 24048.55 prot 1-2048.0-1.3 3490781.60 33292.61 ( -10.87 ) detect 1-2048.0-1.3 3585002.20 44942.04 ( -8.47 ) * multi thread param pswpin std improvement base 8-1024.0-0.1 16219822.60 329474.01 prot 8-1024.0-0.1 15959494.00 654597.45 ( -1.61 ) detect 8-1024.0-0.1 15773790.80 502275.25 ( -2.75 ) base 8-1024.0-0.7 9174107.80 537619.33 prot 8-1024.0-0.7 8571915.00 385230.08 ( -6.56 ) detect 8-1024.0-0.7 8489484.20 364683.00 ( -7.46 ) base 8-1024.0-1.3 1108495.60 83555.98 prot 8-1024.0-1.3 1038906.20 63465.20 ( -6.28 ) detect 8-1024.0-1.3 941817.80 32648.80 ( -15.04 ) base 8-1280.0-0.1 25776114.20 450480.45 prot 8-1280.0-0.1 25430847.00 465627.07 ( -1.34 ) detect 8-1280.0-0.1 25282555.00 465666.55 ( -1.91 ) base 8-1280.0-0.7 15218968.00 702007.69 prot 8-1280.0-0.7 13957947.80 492643.86 ( -8.29 ) detect 8-1280.0-0.7 14158331.20 238656.02 ( -6.97 ) base 8-1280.0-1.3 1792482.80 30512.90 prot 8-1280.0-1.3 1577686.40 34002.62 ( -11.98 ) detect 8-1280.0-1.3 1556133.00 22944.79 ( -13.19 ) base 8-1536.0-0.1 33923761.40 575455.85 prot 8-1536.0-0.1 32715766.20 300633.51 ( -3.56 ) detect 8-1536.0-0.1 33158477.40 117764.51 ( -2.26 ) base 8-1536.0-0.7 20628907.80 303851.34 prot 8-1536.0-0.7 19329511.20 341719.31 ( -6.30 ) detect 8-1536.0-0.7 20013934.00 385358.66 ( -2.98 ) base 8-1536.0-1.3 2588106.40 130769.20 prot 8-1536.0-1.3 2275222.40 89637.06 ( -12.09 ) detect 8-1536.0-1.3 2365008.40 124412.55 ( -8.62 ) base 8-1792.0-0.1 43328279.20 946469.12 prot 8-1792.0-0.1 41481980.80 525690.89 ( -4.26 ) detect 8-1792.0-0.1 41713944.60 406798.93 ( -3.73 ) base 8-1792.0-0.7 27155647.40 536253.57 prot 8-1792.0-0.7 24989406.80 502734.52 ( -7.98 ) detect 8-1792.0-0.7 25524806.40 263237.87 ( -6.01 ) base 8-1792.0-1.3 3260372.80 137907.92 prot 8-1792.0-1.3 2879187.80 63597.26 ( -11.69 ) detect 8-1792.0-1.3 2892962.20 33229.13 ( -11.27 ) base 8-2048.0-0.1 50583989.80 710121.48 prot 8-2048.0-0.1 49599984.40 228782.42 ( -1.95 ) detect 8-2048.0-0.1 50578596.00 660971.66 ( -0.01 ) base 8-2048.0-0.7 33765479.60 812659.55 prot 8-2048.0-0.7 30767021.20 462907.24 ( -8.88 ) detect 8-2048.0-0.7 32213068.80 211884.24 ( -4.60 ) base 8-2048.0-1.3 3941675.80 28436.45 prot 8-2048.0-1.3 3538742.40 76856.08 ( -10.22 ) detect 8-2048.0-1.3 3579397.80 58630.95 ( -9.19 ) As we can see, all the cases show improvement. Especially, test case with zipf distribution 1.3 show more improvements. It means that if there is a hot/cold tendency in anon pages, this patchset works better. This patch (of 6): Current implementation of LRU management for anonymous page has some problems. Most important one is that it doesn't protect the workingset, that is, pages on the active LRU list. Although, this problem will be fixed in the following patchset, the preparation is required and this patch does it. What following patch does is to implement workingset protection. After the following patchset, newly created or swap-in pages will start their lifetime on the inactive list. If inactive list is too small, there is not enough chance to be referenced and the page cannot become the workingset. In order to provide the newly anonymous or swap-in pages enough chance to be referenced again, this patch makes active/inactive LRU ratio as 1:1. This is just a temporary measure. Later patch in the series introduces workingset detection for anonymous LRU that will be used to better decide if pages should start on the active and inactive list. Afterwards this patch is effectively reverted. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Matthew Wilcox <willy@infradead.org> Link: http://lkml.kernel.org/r/1595490560-15117-1-git-send-email-iamjoonsoo.kim@lge.com Link: http://lkml.kernel.org/r/1595490560-15117-2-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: vmscan: consistent update to pgrefillShakeel Butt
The vmstat pgrefill is useful together with pgscan and pgsteal stats to measure the reclaim efficiency. However vmstat's pgrefill is not updated consistently at system level. It gets updated for both global and memcg reclaim however pgscan and pgsteal are updated for only global reclaim. So, update pgrefill only for global reclaim. If someone is interested in the stats representing both system level as well as memcg level reclaim, then consult the root memcg's memory.stat instead of /proc/vmstat. Signed-off-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Chris Down <chris@chrisdown.name> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Link: http://lkml.kernel.org/r/20200711011459.1159929-1-shakeelb@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/vmscan.c: fix typodylan-meiners
Change "optizimation" to "optimization". Signed-off-by: dylan-meiners <spacct.spacct@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Link: http://lkml.kernel.org/r/20200609185144.10049-1-spacct.spacct@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: remove vm_total_pagesDavid Hildenbrand
The global variable "vm_total_pages" is a relic from older days. There is only a single user that reads the variable - build_all_zonelists() - and the first thing it does is update it. Use a local variable in build_all_zonelists() instead and remove the global variable. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Huang Ying <ying.huang@intel.com> Cc: Minchan Kim <minchan@kernel.org> Link: http://lkml.kernel.org/r/20200619132410.23859-2-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: memcontrol: don't count limit-setting reclaim as memory pressureJohannes Weiner
When an outside process lowers one of the memory limits of a cgroup (or uses the force_empty knob in cgroup1), direct reclaim is performed in the context of the write(), in order to directly enforce the new limit and have it being met by the time the write() returns. Currently, this reclaim activity is accounted as memory pressure in the cgroup that the writer(!) belongs to. This is unexpected. It specifically causes problems for senpai (https://github.com/facebookincubator/senpai), which is an agent that routinely adjusts the memory limits and performs associated reclaim work in tens or even hundreds of cgroups running on the host. The cgroup that senpai is running in itself will report elevated levels of memory pressure, even though it itself is under no memory shortage or any sort of distress. Move the psi annotation from the central cgroup reclaim function to callsites in the allocation context, and thereby no longer count any limit-setting reclaim as memory pressure. If the newly set limit causes the workload inside the cgroup into direct reclaim, that of course will continue to count as memory pressure. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Chris Down <chris@chrisdown.name> Acked-by: Michal Hocko <mhocko@suse.com> Link: http://lkml.kernel.org/r/20200728135210.379885-2-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm, memcg: decouple e{low,min} state mutations from protection checksChris Down
mem_cgroup_protected currently is both used to set effective low and min and return a mem_cgroup_protection based on the result. As a user, this can be a little unexpected: it appears to be a simple predicate function, if not for the big warning in the comment above about the order in which it must be executed. This change makes it so that we separate the state mutations from the actual protection checks, which makes it more obvious where we need to be careful mutating internal state, and where we are simply checking and don't need to worry about that. [mhocko@suse.com - don't check protection on root memcgs] Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Chris Down <chris@chrisdown.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Yafang Shao <laoar.shao@gmail.com> Link: http://lkml.kernel.org/r/ff3f915097fcee9f6d7041c084ef92d16aaeb56a.1594638158.git.chris@chrisdown.name Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm, memcg: avoid stale protection values when cgroup is above protectionYafang Shao
Patch series "mm, memcg: memory.{low,min} reclaim fix & cleanup", v4. This series contains a fix for a edge case in my earlier protection calculation patches, and a patch to make the area overall a little more robust to hopefully help avoid this in future. This patch (of 2): A cgroup can have both memory protection and a memory limit to isolate it from its siblings in both directions - for example, to prevent it from being shrunk below 2G under high pressure from outside, but also from growing beyond 4G under low pressure. Commit 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim") implemented proportional scan pressure so that multiple siblings in excess of their protection settings don't get reclaimed equally but instead in accordance to their unprotected portion. During limit reclaim, this proportionality shouldn't apply of course: there is no competition, all pressure is from within the cgroup and should be applied as such. Reclaim should operate at full efficiency. However, mem_cgroup_protected() never expected anybody to look at the effective protection values when it indicated that the cgroup is above its protection. As a result, a query during limit reclaim may return stale protection values that were calculated by a previous reclaim cycle in which the cgroup did have siblings. When this happens, reclaim is unnecessarily hesitant and potentially slow to meet the desired limit. In theory this could lead to premature OOM kills, although it's not obvious this has occurred in practice. Workaround the problem by special casing reclaim roots in mem_cgroup_protection. These memcgs are never participating in the reclaim protection because the reclaim is internal. We have to ignore effective protection values for reclaim roots because mem_cgroup_protected might be called from racing reclaim contexts with different roots. Calculation is relying on root -> leaf tree traversal therefore top-down reclaim protection invariants should hold. The only exception is the reclaim root which should have effective protection set to 0 but that would be problematic for the following setup: Let's have global and A's reclaim in parallel: | A (low=2G, usage = 3G, max = 3G, children_low_usage = 1.5G) |\ | C (low = 1G, usage = 2.5G) B (low = 1G, usage = 0.5G) for A reclaim we have B.elow = B.low C.elow = C.low For the global reclaim A.elow = A.low B.elow = min(B.usage, B.low) because children_low_usage <= A.elow C.elow = min(C.usage, C.low) With the effective values resetting we have A reclaim A.elow = 0 B.elow = B.low C.elow = C.low and global reclaim could see the above and then B.elow = C.elow = 0 because children_low_usage > A.elow Which means that protected memcgs would get reclaimed. In future we would like to make mem_cgroup_protected more robust against racing reclaim contexts but that is likely more complex solution than this simple workaround. [hannes@cmpxchg.org - large part of the changelog] [mhocko@suse.com - workaround explanation] [chris@chrisdown.name - retitle] Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim") Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Chris Down <chris@chrisdown.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Chris Down <chris@chrisdown.name> Acked-by: Roman Gushchin <guro@fb.com> Link: http://lkml.kernel.org/r/cover.1594638158.git.chris@chrisdown.name Link: http://lkml.kernel.org/r/044fb8ecffd001c7905d27c0c2ad998069fdc396.1594638158.git.chris@chrisdown.name Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: memcg: convert vmstat slab counters to bytesRoman Gushchin
In order to prepare for per-object slab memory accounting, convert NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes. To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB). Internally global and per-node counters are stored in pages, however memcg and lruvec counters are stored in bytes. This scheme may look weird, but only for now. As soon as slab pages will be shared between multiple cgroups, global and node counters will reflect the total number of slab pages. However memcg and lruvec counters will be used for per-memcg slab memory tracking, which will take separate kernel objects in the account. Keeping global and node counters in pages helps to avoid additional overhead. The size of slab memory shouldn't exceed 4Gb on 32-bit machines, so it will fit into atomic_long_t we use for vmstats. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-4-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-26mm: workingset: age nonresident information alongside anonymous pagesJohannes Weiner
Patch series "fix for "mm: balance LRU lists based on relative thrashing" patchset" This patchset fixes some problems of the patchset, "mm: balance LRU lists based on relative thrashing", which is now merged on the mainline. Patch "mm: workingset: let cache workingset challenge anon fix" is the result of discussion with Johannes. See following link. http://lkml.kernel.org/r/20200520232525.798933-6-hannes@cmpxchg.org And, the other two are minor things which are found when I try to rebase my patchset. This patch (of 3): After ("mm: workingset: let cache workingset challenge anon fix"), we compare refault distances to active_file + anon. But age of the non-resident information is only driven by the file LRU. As a result, we may overestimate the recency of any incoming refaults and activate them too eagerly, causing unnecessary LRU churn in certain situations. Make anon aging drive nonresident age as well to address that. Link: http://lkml.kernel.org/r/1592288204-27734-1-git-send-email-iamjoonsoo.kim@lge.com Link: http://lkml.kernel.org/r/1592288204-27734-2-git-send-email-iamjoonsoo.kim@lge.com Fixes: 34e58cac6d8f2a ("mm: workingset: let cache workingset challenge anon") Reported-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Rik van Riel <riel@surriel.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04mm/vmsan: fix some typos in commentEthon Paul
There are some typos, fix them. s/regsitration/registration s/santity/sanity s/decremeting/decrementing Signed-off-by: Ethon Paul <ethp@qq.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Link: http://lkml.kernel.org/r/20200411071544.16222-1-ethp@qq.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: vmscan: limit the range of LRU type balancingJohannes Weiner
When LRU cost only shows up on one list, we abruptly stop scanning that list altogether. That's an extreme reaction: by the time the other list starts thrashing and the pendulum swings back, we may have no recent age information on the first list anymore, and we could have significant latencies until the scanner has caught up. Soften this change in the feedback system by ensuring that no list receives less than a third of overall pressure, and only distribute the other 66% according to LRU cost. This ensures that we maintain a minimum rate of aging on the entire workingset while it's being pressured, while still allowing a generous rate of convergence when the relative sizes of the lists need to adjust. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Link: http://lkml.kernel.org/r/20200520232525.798933-15-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: vmscan: reclaim writepage is IO costJohannes Weiner
The VM tries to balance reclaim pressure between anon and file so as to reduce the amount of IO incurred due to the memory shortage. It already counts refaults and swapins, but in addition it should also count writepage calls during reclaim. For swap, this is obvious: it's IO that wouldn't have occurred if the anonymous memory hadn't been under memory pressure. From a relative balancing point of view this makes sense as well: even if anon is cold and reclaimable, a cache that isn't thrashing may have equally cold pages that don't require IO to reclaim. For file writeback, it's trickier: some of the reclaim writepage IO would have likely occurred anyway due to dirty expiration. But not all of it - premature writeback reduces batching and generates additional writes. Since the flushers are already woken up by the time the VM starts writing cache pages one by one, let's assume that we'e likely causing writes that wouldn't have happened without memory pressure. In addition, the per-page cost of IO would have probably been much cheaper if written in larger batches from the flusher thread rather than the single-page-writes from kswapd. For our purposes - getting the trend right to accelerate convergence on a stable state that doesn't require paging at all - this is sufficiently accurate. If we later wanted to optimize for sustained thrashing, we can still refine the measurements. Count all writepage calls from kswapd as IO cost toward the LRU that the page belongs to. Why do this dynamically? Don't we know in advance that anon pages require IO to reclaim, and so could build in a static bias? First, scanning is not the same as reclaiming. If all the anon pages are referenced, we may not swap for a while just because we're scanning the anon list. During this time, however, it's important that we age anonymous memory and the page cache at the same rate so that their hot-cold gradients are comparable. Everything else being equal, we still want to reclaim the coldest memory overall. Second, we keep copies in swap unless the page changes. If there is swap-backed data that's mostly read (tmpfs file) and has been swapped out before, we can reclaim it without incurring additional IO. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Link: http://lkml.kernel.org/r/20200520232525.798933-14-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: vmscan: determine anon/file pressure balance at the reclaim rootJohannes Weiner
We split the LRU lists into anon and file, and we rebalance the scan pressure between them when one of them begins thrashing: if the file cache experiences workingset refaults, we increase the pressure on anonymous pages; if the workload is stalled on swapins, we increase the pressure on the file cache instead. With cgroups and their nested LRU lists, we currently don't do this correctly. While recursive cgroup reclaim establishes a relative LRU order among the pages of all involved cgroups, LRU pressure balancing is done on an individual cgroup LRU level. As a result, when one cgroup is thrashing on the filesystem cache while a sibling may have cold anonymous pages, pressure doesn't get equalized between them. This patch moves LRU balancing decision to the root of reclaim - the same level where the LRU order is established. It does this by tracking LRU cost recursively, so that every level of the cgroup tree knows the aggregate LRU cost of all memory within its domain. When the page scanner calculates the scan balance for any given individual cgroup's LRU list, it uses the values from the ancestor cgroup that initiated the reclaim cycle. If one sibling is then thrashing on the cache, it will tip the pressure balance inside its ancestors, and the next hierarchical reclaim iteration will go more after the anon pages in the tree. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Link: http://lkml.kernel.org/r/20200520232525.798933-13-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: balance LRU lists based on relative thrashingJohannes Weiner
Since the LRUs were split into anon and file lists, the VM has been balancing between page cache and anonymous pages based on per-list ratios of scanned vs. rotated pages. In most cases that tips page reclaim towards the list that is easier to reclaim and has the fewest actively used pages, but there are a few problems with it: 1. Refaults and LRU rotations are weighted the same way, even though one costs IO and the other costs a bit of CPU. 2. The less we scan an LRU list based on already observed rotations, the more we increase the sampling interval for new references, and rotations become even more likely on that list. This can enter a death spiral in which we stop looking at one list completely until the other one is all but annihilated by page reclaim. Since commit a528910e12ec ("mm: thrash detection-based file cache sizing") we have refault detection for the page cache. Along with swapin events, they are good indicators of when the file or anon list, respectively, is too small for its workingset and needs to grow. For example, if the page cache is thrashing, the cache pages need more time in memory, while there may be colder pages on the anonymous list. Likewise, if swapped pages are faulting back in, it indicates that we reclaim anonymous pages too aggressively and should back off. Replace LRU rotations with refaults and swapins as the basis for relative reclaim cost of the two LRUs. This will have the VM target list balances that incur the least amount of IO on aggregate. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Link: http://lkml.kernel.org/r/20200520232525.798933-12-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: only count actual rotations as LRU reclaim costJohannes Weiner
When shrinking the active file list we rotate referenced pages only when they're in an executable mapping. The others get deactivated. When it comes to balancing scan pressure, though, we count all referenced pages as rotated, even the deactivated ones. Yet they do not carry the same cost to the system: the deactivated page *might* refault later on, but the deactivation is tangible progress toward freeing pages; rotations on the other hand cost time and effort without getting any closer to freeing memory. Don't treat both events as equal. The following patch will hook up LRU balancing to cache and anon refaults, which are a much more concrete cost signal for reclaiming one list over the other. Thus, remove the maybe-IO cost bias from page references, and only note the CPU cost for actual rotations that prevent the pages from getting reclaimed. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Rik van Riel <riel@surriel.com> Link: http://lkml.kernel.org/r/20200520232525.798933-11-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: base LRU balancing on an explicit cost modelJohannes Weiner
Currently, scan pressure between the anon and file LRU lists is balanced based on a mixture of reclaim efficiency and a somewhat vague notion of "value" of having certain pages in memory over others. That concept of value is problematic, because it has caused us to count any event that remotely makes one LRU list more or less preferrable for reclaim, even when these events are not directly comparable and impose very different costs on the system. One example is referenced file pages that we still deactivate and referenced anonymous pages that we actually rotate back to the head of the list. There is also conceptual overlap with the LRU algorithm itself. By rotating recently used pages instead of reclaiming them, the algorithm already biases the applied scan pressure based on page value. Thus, when rebalancing scan pressure due to rotations, we should think of reclaim cost, and leave assessing the page value to the LRU algorithm. Lastly, considering both value-increasing as well as value-decreasing events can sometimes cause the same type of event to be counted twice, i.e. how rotating a page increases the LRU value, while reclaiming it succesfully decreases the value. In itself this will balance out fine, but it quietly skews the impact of events that are only recorded once. The abstract metric of "value", the murky relationship with the LRU algorithm, and accounting both negative and positive events make the current pressure balancing model hard to reason about and modify. This patch switches to a balancing model of accounting the concrete, actually observed cost of reclaiming one LRU over another. For now, that cost includes pages that are scanned but rotated back to the list head. Subsequent patches will add consideration for IO caused by refaulting of recently evicted pages. Replace struct zone_reclaim_stat with two cost counters in the lruvec, and make everything that affects cost go through a new lru_note_cost() function. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Link: http://lkml.kernel.org/r/20200520232525.798933-9-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: vmscan: drop unnecessary div0 avoidance rounding in get_scan_count()Johannes Weiner
When we calculate the relative scan pressure between the anon and file LRU lists, we have to assume that reclaim_stat can contain zeroes. To avoid div0 crashes, we add 1 to all denominators like so: anon_prio = swappiness; file_prio = 200 - anon_prio; [...] /* * The amount of pressure on anon vs file pages is inversely * proportional to the fraction of recently scanned pages on * each list that were recently referenced and in active use. */ ap = anon_prio * (reclaim_stat->recent_scanned[0] + 1); ap /= reclaim_stat->recent_rotated[0] + 1; fp = file_prio * (reclaim_stat->recent_scanned[1] + 1); fp /= reclaim_stat->recent_rotated[1] + 1; spin_unlock_irq(&pgdat->lru_lock); fraction[0] = ap; fraction[1] = fp; denominator = ap + fp + 1; While reclaim_stat can contain 0, it's not actually possible for ap + fp to be 0. One of anon_prio or file_prio could be zero, but they must still add up to 200. And the reclaim_stat fraction, due to the +1 in there, is always at least 1. So if one of the two numerators is 0, the other one can't be. ap + fp is always at least 1. Drop the + 1. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Link: http://lkml.kernel.org/r/20200520232525.798933-8-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: allow swappiness that prefers reclaiming anon over the file workingsetJohannes Weiner
With the advent of fast random IO devices (SSDs, PMEM) and in-memory swap devices such as zswap, it's possible for swap to be much faster than filesystems, and for swapping to be preferable over thrashing filesystem caches. Allow setting swappiness - which defines the rough relative IO cost of cache misses between page cache and swap-backed pages - to reflect such situations by making the swap-preferred range configurable. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Link: http://lkml.kernel.org/r/20200520232525.798933-4-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: keep separate anon and file statistics on page reclaim activityJohannes Weiner
Having statistics on pages scanned and pages reclaimed for both anon and file pages makes it easier to evaluate changes to LRU balancing. While at it, clean up the stat-keeping mess for isolation, putback, reclaim stats etc. a bit: first the physical LRU operation (isolation and putback), followed by vmstats, reclaim_stats, and then vm events. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Link: http://lkml.kernel.org/r/20200520232525.798933-3-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm/vmscan: update the comment of should_continue_reclaim()Qiwu Chen
try_to_compact_zone() has been replaced by try_to_compact_pages(), which is necessary to be updated in the comment of should_continue_reclaim(). Signed-off-by: Qiwu Chen <chenqiwu@xiaomi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200501034907.22991-1-chenqiwu@xiaomi.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm/vmscan.c: change prototype for shrink_page_listManinder Singh
commit 3c710c1ad11b ("mm, vmscan extract shrink_page_list reclaim counters into a struct") changed data type for the function, so changing return type for funciton and its caller. Signed-off-by: Vaneet Narang <v.narang@samsung.com> Signed-off-by: Maninder Singh <maninder1.s@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Amit Sahrawat <a.sahrawat@samsung.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/1588168259-25604-1-git-send-email-maninder1.s@samsung.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm/vmscan: count layzfree pages and fix nr_isolated_* mismatchJaewon Kim
Fix an nr_isolate_* mismatch problem between cma and dirty lazyfree pages. If try_to_unmap_one is used for reclaim and it detects a dirty lazyfree page, then the lazyfree page is changed to a normal anon page having SwapBacked by commit 802a3a92ad7a ("mm: reclaim MADV_FREE pages"). Even with the change, reclaim context correctly counts isolated files because it uses is_file_lru to distinguish file. And the change to anon is not happened if try_to_unmap_one is used for migration. So migration context like compaction also correctly counts isolated files even though it uses page_is_file_lru insted of is_file_lru. Recently page_is_file_cache was renamed to page_is_file_lru by commit 9de4f22a60f7 ("mm: code cleanup for MADV_FREE"). But the nr_isolate_* mismatch problem happens on cma alloc. There is reclaim_clean_pages_from_list which is being used only by cma. It was introduced by commit 02c6de8d757c ("mm: cma: discard clean pages during contiguous allocation instead of migration") to reclaim clean file pages without migration. The cma alloc uses both reclaim_clean_pages_from_list and migrate_pages, and it uses page_is_file_lru to count isolated files. If there are dirty lazyfree pages allocated from cma memory region, the pages are counted as isolated file at the beginging but are counted as isolated anon after finished. Mem-Info: Node 0 active_anon:3045904kB inactive_anon:611448kB active_file:14892kB inactive_file:205636kB unevictable:10416kB isolated(anon):0kB isolated(file):37664kB mapped:630216kB dirty:384kB writeback:0kB shmem:42576kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no Like log above, there were too much isolated files, 37664kB, which triggers too_many_isolated in reclaim even when there is no actually isolated file in system wide. It could be reproducible by running two programs, writing on MADV_FREE page and doing cma alloc, respectively. Although isolated anon is 0, I found that the internal value of isolated anon was the negative value of isolated file. Fix this by compensating the isolated count for both LRU lists. Count non-discarded lazyfree pages in shrink_page_list, then compensate the counted number in reclaim_clean_pages_from_list. Reported-by: Yong-Taek Lee <ytk.lee@samsung.com> Suggested-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Shaohua Li <shli@fb.com> Link: http://lkml.kernel.org/r/20200426011718.30246-1-jaewon31.kim@samsung.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm/vmscan.c: use update_lru_size() in update_lru_sizes()Wei Yang
We already defined the helper update_lru_size(). Let's use this to reduce code duplication. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Link: http://lkml.kernel.org/r/20200331221550.1011-1-richard.weiyang@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: simplify calling a compound page destructorMatthew Wilcox (Oracle)
None of the three callers of get_compound_page_dtor() want to know the value; they just want to call the function. Replace it with destroy_compound_page() which calls the dtor for them. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: http://lkml.kernel.org/r/20200517105051.9352-1-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm/page_alloc: integrate classzone_idx and high_zoneidxJoonsoo Kim
classzone_idx is just different name for high_zoneidx now. So, integrate them and add some comment to struct alloc_context in order to reduce future confusion about the meaning of this variable. The accessor, ac_classzone_idx() is also removed since it isn't needed after integration. In addition to integration, this patch also renames high_zoneidx to highest_zoneidx since it represents more precise meaning. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Ye Xiaolong <xiaolong.ye@intel.com> Link: http://lkml.kernel.org/r/1587095923-7515-3-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/writeback: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLENeilBrown
PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a daemon needs to write to one bdi (the final bdi) in order to free up writes queued to another bdi (the client bdi). The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty pages, so that it can still dirty pages after other processses have been throttled. The purpose of this is to avoid deadlock that happen when the PF_LESS_THROTTLE process must write for any dirty pages to be freed, but it is being thottled and cannot write. This approach was designed when all threads were blocked equally, independently on which device they were writing to, or how fast it was. Since that time the writeback algorithm has changed substantially with different threads getting different allowances based on non-trivial heuristics. This means the simple "add 25%" heuristic is no longer reliable. The important issue is not that the daemon needs a *larger* dirty page allowance, but that it needs a *private* dirty page allowance, so that dirty pages for the "client" bdi that it is helping to clear (the bdi for an NFS filesystem or loop block device etc) do not affect the throttling of the daemon writing to the "final" bdi. This patch changes the heuristic so that the task is not throttled when the bdi it is writing to has a dirty page count below below (or equal to) the free-run threshold for that bdi. This ensures it will always be able to have some pages in flight, and so will not deadlock. In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might still be throttled by global threshold, but that is acceptable as it is only the deadlock state that is interesting for this flag. This approach of "only throttle when target bdi is busy" is consistent with the other use of PF_LESS_THROTTLE in current_may_throttle(), were it causes attention to be focussed only on the target bdi. So this patch - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE, - removes the 25% bonus that that flag gives, and - If PF_LOCAL_THROTTLE is set, don't delay at all unless the global and the local free-run thresholds are exceeded. Note that previously realtime threads were treated the same as PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour for real-time threads, so it is now different from the behaviour of nfsd and loop tasks. I don't know what is wanted for realtime. [akpm@linux-foundation.org: coding style fixes] Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Chuck Lever <chuck.lever@oracle.com> [nfsd] Cc: Christoph Hellwig <hch@lst.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Link: http://lkml.kernel.org/r/87ftbf7gs3.fsf@notabene.neil.brown.name Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-07mm/vmscan: remove unnecessary argument description of isolate_lru_pages()Qiwu Chen
Since commit a9e7c39fa9fd9 ("mm/vmscan.c: remove 7th argument of isolate_lru_pages()"), the explanation of 'mode' argument has been unnecessary. Let's remove it. Signed-off-by: Qiwu Chen <chenqiwu@xiaomi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200501090346.2894-1-chenqiwu@xiaomi.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07mm: code cleanup for MADV_FREEHuang Ying
Some comments for MADV_FREE is revised and added to help people understand the MADV_FREE code, especially the page flag, PG_swapbacked. This makes page_is_file_cache() isn't consistent with its comments. So the function is renamed to page_is_file_lru() to make them consistent again. All these are put in one patch as one logical change. Suggested-by: David Hildenbrand <david@redhat.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: David Rientjes <rientjes@google.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@kernel.org> Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@surriel.com> Link: http://lkml.kernel.org/r/20200317100342.2730705-1-ying.huang@intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02mm/vmscan.c: do_try_to_free_pages(): clean code by removing unnecessary ↵Mateusz Nosek
assignment sc->memcg_low_skipped resets skipped_deactivate to 0 but this is not needed as this code path is never reachable with skipped_deactivate != 0 due to previous sc->skipped_deactivate branch. [mhocko@kernel.org: rewrite changelog] Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Link: http://lkml.kernel.org/r/20200319165938.23354-1-mateusznosek0@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02mm/vmscan.c: make may_enter_fs bool in shrink_page_list()Kirill Tkhai
This gives some size improvement: $size mm/vmscan.o (before) text data bss dec hex filename 53670 24123 12 77805 12fed mm/vmscan.o $size mm/vmscan.o (after) text data bss dec hex filename 53648 24123 12 77783 12fd7 mm/vmscan.o Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/Message-ID: Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02mm/vmscan.c: clean code by removing unnecessary assignmentMateusz Nosek
Previously 0 was assigned to variable 'lruvec_size', but the variable was never read later. So the assignment can be removed. Fixes: f87bccde6a7d ("mm/vmscan: remove unused lru_pages argument") Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Link: http://lkml.kernel.org/r/20200229214022.11853-1-mateusznosek0@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02mm/vmscan.c: fix data races using kswapd_classzone_idxQian Cai
pgdat->kswapd_classzone_idx could be accessed concurrently in wakeup_kswapd(). Plain writes and reads without any lock protection result in data races. Fix them by adding a pair of READ|WRITE_ONCE() as well as saving a branch (compilers might well optimize the original code in an unintentional way anyway). While at it, also take care of pgdat->kswapd_order and non-kswapd threads in allow_direct_reclaim(). The data races were reported by KCSAN, BUG: KCSAN: data-race in wakeup_kswapd / wakeup_kswapd write to 0xffff9f427ffff2dc of 4 bytes by task 7454 on cpu 13: wakeup_kswapd+0xf1/0x400 wakeup_kswapd at mm/vmscan.c:3967 wake_all_kswapds+0x59/0xc0 wake_all_kswapds at mm/page_alloc.c:4241 __alloc_pages_slowpath+0xdcc/0x1290 __alloc_pages_slowpath at mm/page_alloc.c:4512 __alloc_pages_nodemask+0x3bb/0x450 alloc_pages_vma+0x8a/0x2c0 do_anonymous_page+0x16e/0x6f0 __handle_mm_fault+0xcd5/0xd40 handle_mm_fault+0xfc/0x2f0 do_page_fault+0x263/0x6f9 page_fault+0x34/0x40 1 lock held by mtest01/7454: #0: ffff9f425afe8808 (&mm->mmap_sem#2){++++}, at: do_page_fault+0x143/0x6f9 do_user_addr_fault at arch/x86/mm/fault.c:1405 (inlined by) do_page_fault at arch/x86/mm/fault.c:1539 irq event stamp: 6944085 count_memcg_event_mm+0x1a6/0x270 count_memcg_event_mm+0x119/0x270 __do_softirq+0x34c/0x57c irq_exit+0xa2/0xc0 read to 0xffff9f427ffff2dc of 4 bytes by task 7472 on cpu 38: wakeup_kswapd+0xc8/0x400 wake_all_kswapds+0x59/0xc0 __alloc_pages_slowpath+0xdcc/0x1290 __alloc_pages_nodemask+0x3bb/0x450 alloc_pages_vma+0x8a/0x2c0 do_anonymous_page+0x16e/0x6f0 __handle_mm_fault+0xcd5/0xd40 handle_mm_fault+0xfc/0x2f0 do_page_fault+0x263/0x6f9 page_fault+0x34/0x40 1 lock held by mtest01/7472: #0: ffff9f425a9ac148 (&mm->mmap_sem#2){++++}, at: do_page_fault+0x143/0x6f9 irq event stamp: 6793561 count_memcg_event_mm+0x1a6/0x270 count_memcg_event_mm+0x119/0x270 __do_softirq+0x34c/0x57c irq_exit+0xa2/0xc0 BUG: KCSAN: data-race in kswapd / wakeup_kswapd write to 0xffff90973ffff2dc of 4 bytes by task 820 on cpu 6: kswapd+0x27c/0x8d0 kthread+0x1e0/0x200 ret_from_fork+0x27/0x50 read to 0xffff90973ffff2dc of 4 bytes by task 6299 on cpu 0: wakeup_kswapd+0xf3/0x450 wake_all_kswapds+0x59/0xc0 __alloc_pages_slowpath+0xdcc/0x1290 __alloc_pages_nodemask+0x3bb/0x450 alloc_pages_vma+0x8a/0x2c0 do_anonymous_page+0x170/0x700 __handle_mm_fault+0xc9f/0xd00 handle_mm_fault+0xfc/0x2f0 do_page_fault+0x263/0x6f9 page_fault+0x34/0x40 Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Marco Elver <elver@google.com> Cc: Matthew Wilcox <willy@infradead.org> Link: http://lkml.kernel.org/r/1582749472-5171-1-git-send-email-cai@lca.pw Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02mm/vmscan.c: remove cpu online notification for nowWei Yang
kswapd kernel thread starts either with a CPU affinity set to the full cpu mask of its target node or without any affinity at all if the node is CPUless. There is a cpu hotplug callback (kswapd_cpu_online) that implements an elaborate way to update this mask when a cpu is onlined. It is not really clear whether there is any actual benefit from this scheme. Completely CPU-less NUMA nodes rarely gain a new CPU during runtime. Drop the code for that reason. If there is a real usecase then we can resurrect and simplify the code. [mhocko@suse.com rewrite changelog] Suggested-by: Michal Hocko <mhocko@suse.org> Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Link: http://lkml.kernel.org/r/20200218224422.3407-1-richardw.yang@linux.intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02mm: vmscan: replace open codings to NUMA_NO_NODEYang Shi
The commit 98fa15f34cb3 ("mm: replace all open encodings for NUMA_NO_NODE") did the replacement across the kernel tree, but we got some more in vmscan.c since then. Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Link: http://lkml.kernel.org/r/1581568298-45317-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02mm: swap: make page_evictable() inlineYang Shi
When backporting commit 9c4e6b1a7027 ("mm, mlock, vmscan: no more skipping pagevecs") to our 4.9 kernel, our test bench noticed around 10% down with a couple of vm-scalability's test cases (lru-file-readonce, lru-file-readtwice and lru-file-mmap-read). I didn't see that much down on my VM (32c-64g-2nodes). It might be caused by the test configuration, which is 32c-256g with NUMA disabled and the tests were run in root memcg, so the tests actually stress only one inactive and active lru. It sounds not very usual in mordern production environment. That commit did two major changes: 1. Call page_evictable() 2. Use smp_mb to force the PG_lru set visible It looks they contribute the most overhead. The page_evictable() is a function which does function prologue and epilogue, and that was used by page reclaim path only. However, lru add is a very hot path, so it sounds better to make it inline. However, it calls page_mapping() which is not inlined either, but the disassemble shows it doesn't do push and pop operations and it sounds not very straightforward to inline it. Other than this, it sounds smp_mb() is not necessary for x86 since SetPageLRU is atomic which enforces memory barrier already, replace it with smp_mb__after_atomic() in the following patch. With the two fixes applied, the tests can get back around 5% on that test bench and get back normal on my VM. Since the test bench configuration is not that usual and I also saw around 6% up on the latest upstream, so it sounds good enough IMHO. The below is test data (lru-file-readtwice throughput) against the v5.6-rc4: mainline w/ inline fix 150MB 154MB With this patch the throughput gets 2.67% up. The data with using smp_mb__after_atomic() is showed in the following patch. Shakeel Butt did the below test: On a real machine with limiting the 'dd' on a single node and reading 100 GiB sparse file (less than a single node). Just ran a single instance to not cause the lru lock contention. The cmdline used is "dd if=file-100GiB of=/dev/null bs=4k". Ran the cmd 10 times with drop_caches in between and measured the time it took. Without patch: 56.64143 +- 0.672 sec With patches: 56.10 +- 0.21 sec [akpm@linux-foundation.org: move page_evictable() to internal.h] Fixes: 9c4e6b1a7027 ("mm, mlock, vmscan: no more skipping pagevecs") Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: http://lkml.kernel.org/r/1584500541-46817-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-21mm/vmscan.c: don't round up scan size for online memory cgroupGavin Shan
Commit 68600f623d69 ("mm: don't miss the last page because of round-off error") makes the scan size round up to @denominator regardless of the memory cgroup's state, online or offline. This affects the overall reclaiming behavior: the corresponding LRU list is eligible for reclaiming only when its size logically right shifted by @sc->priority is bigger than zero in the former formula. For example, the inactive anonymous LRU list should have at least 0x4000 pages to be eligible for reclaiming when we have 60/12 for swappiness/priority and without taking scan/rotation ratio into account. After the roundup is applied, the inactive anonymous LRU list becomes eligible for reclaiming when its size is bigger than or equal to 0x1000 in the same condition. (0x4000 >> 12) * 60 / (60 + 140 + 1) = 1 ((0x1000 >> 12) * 60) + 200) / (60 + 140 + 1) = 1 aarch64 has 512MB huge page size when the base page size is 64KB. The memory cgroup that has a huge page is always eligible for reclaiming in that case. The reclaiming is likely to stop after the huge page is reclaimed, meaing the further iteration on @sc->priority and the silbing and child memory cgroups will be skipped. The overall behaviour has been changed. This fixes the issue by applying the roundup to offlined memory cgroups only, to give more preference to reclaim memory from offlined memory cgroup. It sounds reasonable as those memory is unlikedly to be used by anyone. The issue was found by starting up 8 VMs on a Ampere Mustang machine, which has 8 CPUs and 16 GB memory. Each VM is given with 2 vCPUs and 2GB memory. It took 264 seconds for all VMs to be completely up and 784MB swap is consumed after that. With this patch applied, it took 236 seconds and 60MB swap to do same thing. So there is 10% performance improvement for my case. Note that KSM is disable while THP is enabled in the testing. total used free shared buff/cache available Mem: 16196 10065 2049 16 4081 3749 Swap: 8175 784 7391 total used free shared buff/cache available Mem: 16196 11324 3656 24 1215 2936 Swap: 8175 60 8115 Link: http://lkml.kernel.org/r/20200211024514.8730-1-gshan@redhat.com Fixes: 68600f623d69 ("mm: don't miss the last page because of round-off error") Signed-off-by: Gavin Shan <gshan@redhat.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: <stable@vger.kernel.org> [4.20+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm/vmscan: remove unused RECLAIM_OFF/RECLAIM_ZONEAlex Shi
Commit 1b2ffb7896ad ("[PATCH] Zone reclaim: Allow modification of zone reclaim behavior")' defined RECLAIM_OFF/RECLAIM_ZONE, but never use them, so better to remove them. [dwagner@suse.de: fix sanity checks enabling] Link: http://lkml.kernel.org/r/20200116131642.642-1-dwagner@suse.de [akpm@linux-foundation.org: renumber the bits for neatness] Link: http://lkml.kernel.org/r/1579005573-58923-1-git-send-email-alex.shi@linux.alibaba.com Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Signed-off-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: "Tobin C. Harding" <tobin@kernel.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>