summaryrefslogtreecommitdiff
path: root/include/linux/swap.h
AgeCommit message (Collapse)Author
2021-03-02swap: fix swapfile read/write offsetJens Axboe
We're not factoring in the start of the file for where to write and read the swapfile, which leads to very unfortunate side effects of writing where we should not be... Fixes: 48d15436fde6 ("mm: remove get_swap_bio") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-24mm/vmscan: __isolate_lru_page_prepare() cleanupAlex Shi
The function just returns 2 results, so using a 'switch' to deal with its result is unnecessary. Also simplify it to a bool func as Vlastimil suggested. Also remove 'goto' by reusing list_move(), and take Matthew Wilcox's suggestion to update comments in function. Link: https://lkml.kernel.org/r/728874d7-2d93-4049-68c1-dcc3b2d52ccd@linux.alibaba.com Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24mm: memcg: add swapcache stat for memcg v2Shakeel Butt
This patch adds swapcache stat for the cgroup v2. The swapcache represents the memory that is accounted against both the memory and the swap limit of the cgroup. The main motivation behind exposing the swapcache stat is for enabling users to gracefully migrate from cgroup v1's memsw counter to cgroup v2's memory and swap counters. Cgroup v1's memsw limit allows users to limit the memory+swap usage of a workload but without control on the exact proportion of memory and swap. Cgroup v2 provides separate limits for memory and swap which enables more control on the exact usage of memory and swap individually for the workload. With some little subtleties, the v1's memsw limit can be switched with the sum of the v2's memory and swap limits. However the alternative for memsw usage is not yet available in cgroup v2. Exposing per-cgroup swapcache stat enables that alternative. Adding the memory usage and swap usage and subtracting the swapcache will approximate the memsw usage. This will help in the transparent migration of the workloads depending on memsw usage and limit to v2' memory and swap counters. The reasons these applications are still interested in this approximate memsw usage are: (1) these applications are not really interested in two separate memory and swap usage metrics. A single usage metric is more simple to use and reason about for them. (2) The memsw usage metric hides the underlying system's swap setup from the applications. Applications with multiple instances running in a datacenter with heterogeneous systems (some have swap and some don't) will keep seeing a consistent view of their usage. [akpm@linux-foundation.org: fix CONFIG_SWAP=n build] Link: https://lkml.kernel.org/r/20210108155813.2914586-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-27mm: remove get_swap_bioChristoph Hellwig
Just reuse the block_device and sector from the swap_info structure, just as used by the SWP_SYNCHRONOUS path. Also remove the checks for NULL returns from bio_alloc as that can't happen for sleeping allocations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-15mm/compaction: do page isolation first in compactionAlex Shi
Currently, compaction would get the lru_lock and then do page isolation which works fine with pgdat->lru_lock, since any page isoltion would compete for the lru_lock. If we want to change to memcg lru_lock, we have to isolate the page before getting lru_lock, thus isoltion would block page's memcg change which relay on page isoltion too. Then we could safely use per memcg lru_lock later. The new page isolation use previous introduced TestClearPageLRU() + pgdat lru locking which will be changed to memcg lru lock later. Hugh Dickins <hughd@google.com> fixed following bugs in this patch's early version: Fix lots of crashes under compaction load: isolate_migratepages_block() must clean up appropriately when rejecting a page, setting PageLRU again if it had been cleared; and a put_page() after get_page_unless_zero() cannot safely be done while holding locked_lruvec - it may turn out to be the final put_page(), which will take an lruvec lock when PageLRU. And move __isolate_lru_page_prepare back after get_page_unless_zero to make trylock_page() safe: trylock_page() is not safe to use at this time: its setting PG_locked can race with the page being freed or allocated ("Bad page"), and can also erase flags being set by one of those "sole owners" of a freshly allocated page who use non-atomic __SetPageFlag(). Link: https://lkml.kernel.org/r/1604566549-62481-16-git-send-email-alex.shi@linux.alibaba.com Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: "Chen, Rong A" <rong.a.chen@intel.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Jann Horn <jannh@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mika Penttilä <mika.penttila@nextfour.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/thp: move lru_add_page_tail() to huge_memory.cAlex Shi
Patch series "per memcg lru lock", v21. This patchset includes 3 parts: 1) some code cleanup and minimum optimization as preparation 2) use TestCleanPageLRU as page isolation's precondition 3) replace per node lru_lock with per memcg per node lru_lock Current lru_lock is one for each of node, pgdat->lru_lock, that guard for lru lists, but now we had moved the lru lists into memcg for long time. Still using per node lru_lock is clearly unscalable, pages on each of memcgs have to compete each others for a whole lru_lock. This patchset try to use per lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make it scalable for memcgs and get performance gain. Currently lru_lock still guards both lru list and page's lru bit, that's ok. but if we want to use specific lruvec lock on the page, we need to pin down the page's lruvec/memcg during locking. Just taking lruvec lock first may be undermined by the page's memcg charge/migration. To fix this problem, we could take out the page's lru bit clear and use it as pin down action to block the memcg changes. That's the reason for new atomic func TestClearPageLRU. So now isolating a page need both actions: TestClearPageLRU and hold the lru_lock. The typical usage of this is isolate_migratepages_block() in compaction.c we have to take lru bit before lru lock, that serialized the page isolation in memcg page charge/migration which will change page's lruvec and new lru_lock in it. The above solution suggested by Johannes Weiner, and based on his new memcg charge path, then have this patchset. (Hugh Dickins tested and contributed much code from compaction fix to general code polish, thanks a lot!). Daniel Jordan's testing show 62% improvement on modified readtwice case on his 2P * 10 core * 2 HT broadwell box on v18, which has no much different with this v20. https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/ Thanks to Hugh Dickins and Konstantin Khlebnikov, they both brought this idea 8 years ago, and others who gave comments as well: Daniel Jordan, Mel Gorman, Shakeel Butt, Matthew Wilcox, Alexander Duyck etc. Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu, and Yun Wang. Hugh Dickins also shared his kbuild-swap case. This patch (of 19): lru_add_page_tail() is only used in huge_memory.c, defining it in other file with a CONFIG_TRANSPARENT_HUGEPAGE macro restrict just looks weird. Let's move it THP. And make it static as Hugh Dickins suggested. Link: https://lkml.kernel.org/r/1604566549-62481-1-git-send-email-alex.shi@linux.alibaba.com Link: https://lkml.kernel.org/r/1604566549-62481-2-git-send-email-alex.shi@linux.alibaba.com Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tejun Heo <tj@kernel.org> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: "Chen, Rong A" <rong.a.chen@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Jann Horn <jannh@google.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mika Penttilä <mika.penttila@nextfour.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13mm: remove activate_page() from unuse_pte()Yu Zhao
We don't initially add anon pages to active lruvec after commit b518154e59aa ("mm/vmscan: protect the workingset on anonymous LRU"). Remove activate_page() from unuse_pte(), which seems to be missed by the commit. And make the function static while we are at it. Before the commit, we called lru_cache_add_active_or_unevictable() to add new ksm pages to active lruvec. Therefore, activate_page() wasn't necessary for them in the first place. Signed-off-by: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Qian Cai <cai@lca.pw> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: http://lkml.kernel.org/r/20200818184704.3625199-1-yuzhao@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13swap: rename SWP_FS to SWAP_FS_OPS to avoid ambiguityGao Xiang
SWP_FS is used to make swap_{read,write}page() go through the filesystem, and it's only used for swap files over NFS for now. Otherwise it will directly submit IO to blockdev according to swapfile extents reported by filesystems in advance. As Matthew pointed out [1], SWP_FS naming is somewhat confusing, so let's rename to SWP_FS_OPS. [1] https://lore.kernel.org/r/20200820113448.GM17456@casper.infradead.org Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: https://lkml.kernel.org/r/20200822113019.11319-1-hsiangkao@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13mm: factor find_get_incore_page out of mincore_pageMatthew Wilcox (Oracle)
Patch series "Return head pages from find_*_entry", v2. This patch series started out as part of the THP patch set, but it has some nice effects along the way and it seems worth splitting it out and submitting separately. Currently find_get_entry() and find_lock_entry() return the page corresponding to the requested index, but the first thing most callers do is find the head page, which we just threw away. As part of auditing all the callers, I found some misuses of the APIs and some plain inefficiencies that I've fixed. The diffstat is unflattering, but I added more kernel-doc and a new wrapper. This patch (of 8); Provide this functionality from the swap cache. It's useful for more than just mincore(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Matthew Auld <matthew.auld@intel.com> Cc: Huang Ying <ying.huang@intel.com> Link: https://lkml.kernel.org/r/20200910183318.20139-1-willy@infradead.org Link: https://lkml.kernel.org/r/20200910183318.20139-2-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-23mm: split swap_type_ofChristoph Hellwig
swap_type_of is used for two entirely different purposes: (1) check what swap type a given device/offset corresponds to (2) find the first available swap device that can be written to Mixing both in a single function creates an unreadable mess. Create two separate functions instead, and switch both to pass a dev_t instead of a struct block_device to further simplify the code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-12mm/swap: implement workingset detection for anonymous LRUJoonsoo Kim
This patch implements workingset detection for anonymous LRU. All the infrastructure is implemented by the previous patches so this patch just activates the workingset detection by installing/retrieving the shadow entry and adding refault calculation. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Link: http://lkml.kernel.org/r/1595490560-15117-6-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/swapcache: support to handle the shadow entriesJoonsoo Kim
Workingset detection for anonymous page will be implemented in the following patch and it requires to store the shadow entries into the swapcache. This patch implements an infrastructure to store the shadow entry in the swapcache. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/1595490560-15117-5-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/vmscan: protect the workingset on anonymous LRUJoonsoo Kim
In current implementation, newly created or swap-in anonymous page is started on active list. Growing active list results in rebalancing active/inactive list so old pages on active list are demoted to inactive list. Hence, the page on active list isn't protected at all. Following is an example of this situation. Assume that 50 hot pages on active list. Numbers denote the number of pages on active/inactive list (active | inactive). 1. 50 hot pages on active list 50(h) | 0 2. workload: 50 newly created (used-once) pages 50(uo) | 50(h) 3. workload: another 50 newly created (used-once) pages 50(uo) | 50(uo), swap-out 50(h) This patch tries to fix this issue. Like as file LRU, newly created or swap-in anonymous pages will be inserted to the inactive list. They are promoted to active list if enough reference happens. This simple modification changes the above example as following. 1. 50 hot pages on active list 50(h) | 0 2. workload: 50 newly created (used-once) pages 50(h) | 50(uo) 3. workload: another 50 newly created (used-once) pages 50(h) | 50(uo), swap-out 50(uo) As you can see, hot pages on active list would be protected. Note that, this implementation has a drawback that the page cannot be promoted and will be swapped-out if re-access interval is greater than the size of inactive list but less than the size of total(active+inactive). To solve this potential issue, following patch will apply workingset detection similar to the one that's already applied to file LRU. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Link: http://lkml.kernel.org/r/1595490560-15117-3-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/page_alloc: remove nr_free_pagecache_pages()David Hildenbrand
nr_free_pagecache_pages() isn't used outside page_alloc.c anymore - and the name does not really help to understand what's going on. Let's open-code it instead and add a comment. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Huang Ying <ying.huang@intel.com> Link: http://lkml.kernel.org/r/20200619132410.23859-3-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: remove vm_total_pagesDavid Hildenbrand
The global variable "vm_total_pages" is a relic from older days. There is only a single user that reads the variable - build_all_zonelists() - and the first thing it does is update it. Use a local variable in build_all_zonelists() instead and remove the global variable. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Huang Ying <ying.huang@intel.com> Cc: Minchan Kim <minchan@kernel.org> Link: http://lkml.kernel.org/r/20200619132410.23859-2-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-26mm: workingset: age nonresident information alongside anonymous pagesJohannes Weiner
Patch series "fix for "mm: balance LRU lists based on relative thrashing" patchset" This patchset fixes some problems of the patchset, "mm: balance LRU lists based on relative thrashing", which is now merged on the mainline. Patch "mm: workingset: let cache workingset challenge anon fix" is the result of discussion with Johannes. See following link. http://lkml.kernel.org/r/20200520232525.798933-6-hannes@cmpxchg.org And, the other two are minor things which are found when I try to rebase my patchset. This patch (of 3): After ("mm: workingset: let cache workingset challenge anon fix"), we compare refault distances to active_file + anon. But age of the non-resident information is only driven by the file LRU. As a result, we may overestimate the recency of any incoming refaults and activate them too eagerly, causing unnecessary LRU churn in certain situations. Make anon aging drive nonresident age as well to address that. Link: http://lkml.kernel.org/r/1592288204-27734-1-git-send-email-iamjoonsoo.kim@lge.com Link: http://lkml.kernel.org/r/1592288204-27734-2-git-send-email-iamjoonsoo.kim@lge.com Fixes: 34e58cac6d8f2a ("mm: workingset: let cache workingset challenge anon") Reported-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Rik van Riel <riel@surriel.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: vmscan: reclaim writepage is IO costJohannes Weiner
The VM tries to balance reclaim pressure between anon and file so as to reduce the amount of IO incurred due to the memory shortage. It already counts refaults and swapins, but in addition it should also count writepage calls during reclaim. For swap, this is obvious: it's IO that wouldn't have occurred if the anonymous memory hadn't been under memory pressure. From a relative balancing point of view this makes sense as well: even if anon is cold and reclaimable, a cache that isn't thrashing may have equally cold pages that don't require IO to reclaim. For file writeback, it's trickier: some of the reclaim writepage IO would have likely occurred anyway due to dirty expiration. But not all of it - premature writeback reduces batching and generates additional writes. Since the flushers are already woken up by the time the VM starts writing cache pages one by one, let's assume that we'e likely causing writes that wouldn't have happened without memory pressure. In addition, the per-page cost of IO would have probably been much cheaper if written in larger batches from the flusher thread rather than the single-page-writes from kswapd. For our purposes - getting the trend right to accelerate convergence on a stable state that doesn't require paging at all - this is sufficiently accurate. If we later wanted to optimize for sustained thrashing, we can still refine the measurements. Count all writepage calls from kswapd as IO cost toward the LRU that the page belongs to. Why do this dynamically? Don't we know in advance that anon pages require IO to reclaim, and so could build in a static bias? First, scanning is not the same as reclaiming. If all the anon pages are referenced, we may not swap for a while just because we're scanning the anon list. During this time, however, it's important that we age anonymous memory and the page cache at the same rate so that their hot-cold gradients are comparable. Everything else being equal, we still want to reclaim the coldest memory overall. Second, we keep copies in swap unless the page changes. If there is swap-backed data that's mostly read (tmpfs file) and has been swapped out before, we can reclaim it without incurring additional IO. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Link: http://lkml.kernel.org/r/20200520232525.798933-14-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: balance LRU lists based on relative thrashingJohannes Weiner
Since the LRUs were split into anon and file lists, the VM has been balancing between page cache and anonymous pages based on per-list ratios of scanned vs. rotated pages. In most cases that tips page reclaim towards the list that is easier to reclaim and has the fewest actively used pages, but there are a few problems with it: 1. Refaults and LRU rotations are weighted the same way, even though one costs IO and the other costs a bit of CPU. 2. The less we scan an LRU list based on already observed rotations, the more we increase the sampling interval for new references, and rotations become even more likely on that list. This can enter a death spiral in which we stop looking at one list completely until the other one is all but annihilated by page reclaim. Since commit a528910e12ec ("mm: thrash detection-based file cache sizing") we have refault detection for the page cache. Along with swapin events, they are good indicators of when the file or anon list, respectively, is too small for its workingset and needs to grow. For example, if the page cache is thrashing, the cache pages need more time in memory, while there may be colder pages on the anonymous list. Likewise, if swapped pages are faulting back in, it indicates that we reclaim anonymous pages too aggressively and should back off. Replace LRU rotations with refaults and swapins as the basis for relative reclaim cost of the two LRUs. This will have the VM target list balances that incur the least amount of IO on aggregate. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Link: http://lkml.kernel.org/r/20200520232525.798933-12-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: base LRU balancing on an explicit cost modelJohannes Weiner
Currently, scan pressure between the anon and file LRU lists is balanced based on a mixture of reclaim efficiency and a somewhat vague notion of "value" of having certain pages in memory over others. That concept of value is problematic, because it has caused us to count any event that remotely makes one LRU list more or less preferrable for reclaim, even when these events are not directly comparable and impose very different costs on the system. One example is referenced file pages that we still deactivate and referenced anonymous pages that we actually rotate back to the head of the list. There is also conceptual overlap with the LRU algorithm itself. By rotating recently used pages instead of reclaiming them, the algorithm already biases the applied scan pressure based on page value. Thus, when rebalancing scan pressure due to rotations, we should think of reclaim cost, and leave assessing the page value to the LRU algorithm. Lastly, considering both value-increasing as well as value-decreasing events can sometimes cause the same type of event to be counted twice, i.e. how rotating a page increases the LRU value, while reclaiming it succesfully decreases the value. In itself this will balance out fine, but it quietly skews the impact of events that are only recorded once. The abstract metric of "value", the murky relationship with the LRU algorithm, and accounting both negative and positive events make the current pressure balancing model hard to reason about and modify. This patch switches to a balancing model of accounting the concrete, actually observed cost of reclaiming one LRU over another. For now, that cost includes pages that are scanned but rotated back to the list head. Subsequent patches will add consideration for IO caused by refaulting of recently evicted pages. Replace struct zone_reclaim_stat with two cost counters in the lruvec, and make everything that affects cost go through a new lru_note_cost() function. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Link: http://lkml.kernel.org/r/20200520232525.798933-9-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: fold and remove lru_cache_add_anon() and lru_cache_add_file()Johannes Weiner
They're the same function, and for the purpose of all callers they are equivalent to lru_cache_add(). [akpm@linux-foundation.org: fix it for local_lock changes] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: http://lkml.kernel.org/r/20200520232525.798933-5-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: memcontrol: move out cgroup swaprate throttlingJohannes Weiner
The cgroup swaprate throttling is about matching new anon allocations to the rate of available IO when that is being throttled. It's the io controller hooking into the VM, rather than a memory controller thing. Rename mem_cgroup_throttle_swaprate() to cgroup_throttle_swaprate(), and drop the @memcg argument which is only used to check whether the preceding page charge has succeeded and the fault is proceeding. We could decouple the call from mem_cgroup_try_charge() here as well, but that would cause unnecessary churn: the following patches convert all callsites to a new charge API and we'll decouple as we go along. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-5-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge updates from Andrew Morton: "A few little subsystems and a start of a lot of MM patches. Subsystems affected by this patch series: squashfs, ocfs2, parisc, vfs. With mm subsystems: slab-generic, slub, debug, pagecache, gup, swap, memcg, pagemap, memory-failure, vmalloc, kasan" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (128 commits) kasan: move kasan_report() into report.c mm/mm_init.c: report kasan-tag information stored in page->flags ubsan: entirely disable alignment checks under UBSAN_TRAP kasan: fix clang compilation warning due to stack protector x86/mm: remove vmalloc faulting mm: remove vmalloc_sync_(un)mappings() x86/mm/32: implement arch_sync_kernel_mappings() x86/mm/64: implement arch_sync_kernel_mappings() mm/ioremap: track which page-table levels were modified mm/vmalloc: track which page-table levels were modified mm: add functions to track page directory modifications s390: use __vmalloc_node in stack_alloc powerpc: use __vmalloc_node in alloc_vm_stack arm64: use __vmalloc_node in arch_alloc_vmap_stack mm: remove vmalloc_user_node_flags mm: switch the test_vmalloc module to use __vmalloc_node mm: remove __vmalloc_node_flags_caller mm: remove both instances of __vmalloc_node_flags mm: remove the prot argument to __vmalloc_node mm: remove the pgprot argument to __vmalloc ...
2020-06-02include/linux/swap.h: delete meaningless __add_to_swap_cache() declarationMiaohe Lin
Since commit 8d93b41c09d1 ("mm: Convert add_to_swap_cache to XArray"), __add_to_swap_cache and add_to_swap_cache are combined into one function. There is no __add_to_swap_cache() anymore. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Link: http://lkml.kernel.org/r/1590810326-2493-1-git-send-email-linmiaohe@huawei.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02swap: reduce lock contention on swap cache from swap slots allocationHuang Ying
In some swap scalability test, it is found that there are heavy lock contention on swap cache even if we have split one swap cache radix tree per swap device to one swap cache radix tree every 64 MB trunk in commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks"). The reason is as follow. After the swap device becomes fragmented so that there's no free swap cluster, the swap device will be scanned linearly to find the free swap slots. swap_info_struct->cluster_next is the next scanning base that is shared by all CPUs. So nearby free swap slots will be allocated for different CPUs. The probability for multiple CPUs to operate on the same 64 MB trunk is high. This causes the lock contention on the swap cache. To solve the issue, in this patch, for SSD swap device, a percpu version next scanning base (cluster_next_cpu) is added. Every CPU will use its own per-cpu next scanning base. And after finishing scanning a 64MB trunk, the per-cpu scanning base will be changed to the beginning of another randomly selected 64MB trunk. In this way, the probability for multiple CPUs to operate on the same 64 MB trunk is reduced greatly. Thus the lock contention is reduced too. For HDD, because sequential access is more important for IO performance, the original shared next scanning base is used. To test the patch, we have run 16-process pmbench memory benchmark on a 2-socket server machine with 48 cores. One ram disk is configured as the swap device per socket. The pmbench working-set size is much larger than the available memory so that swapping is triggered. The memory read/write ratio is 80/20 and the accessing pattern is random. In the original implementation, the lock contention on the swap cache is heavy. The perf profiling data of the lock contention code path is as following, _raw_spin_lock_irq.add_to_swap_cache.add_to_swap.shrink_page_list: 7.91 _raw_spin_lock_irqsave.__remove_mapping.shrink_page_list: 7.11 _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 2.51 _raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap: 1.66 _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node: 1.29 _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.03 _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 0.93 After applying this patch, it becomes, _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 3.58 _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node: 2.3 _raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap: 2.26 _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 1.8 _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.19 The lock contention on the swap cache is almost eliminated. And the pmbench score increases 18.5%. The swapin throughput increases 18.7% from 2.96 GB/s to 3.51 GB/s. While the swapout throughput increases 18.5% from 2.99 GB/s to 3.54 GB/s. We need really fast disk to show the benefit. I have tried this on 2 Intel P3600 NVMe disks. The performance improvement is only about 1%. The improvement should be better on the faster disks, such as Intel Optane disk. [ying.huang@intel.com: fix cluster_next_cpu allocation and freeing, per Daniel] Link: http://lkml.kernel.org/r/20200525002648.336325-1-ying.huang@intel.com [ying.huang@intel.com: v4] Link: http://lkml.kernel.org/r/20200529010840.928819-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Link: http://lkml.kernel.org/r/20200520031502.175659-1-ying.huang@intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/swapfile.c: classify SWAP_MAP_XXX to make it more readableWei Yang
swap_info_struct->swap_map[] encodes some flag and count. And to do some condition check, it also introduces some special values. Currently those macros are defined with some magic order, which makes audience hard to understand the exact meaning. This patch split those macros into three categories: flag special value for first swap_map special value for continued swap_map May this help audiences a little. [akpm@linux-foundation.org: tweak capitalization in comments] Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200501015259.32237-1-richard.weiyang@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-28mm/swap: Use local_lock for protectionIngo Molnar
The various struct pagevec per CPU variables are protected by disabling either preemption or interrupts across the critical sections. Inside these sections spinlocks have to be acquired. These spinlocks are regular spinlock_t types which are converted to "sleeping" spinlocks on PREEMPT_RT enabled kernels. Obviously sleeping locks cannot be acquired in preemption or interrupt disabled sections. local locks provide a trivial way to substitute preempt and interrupt disable instances. On a non PREEMPT_RT enabled kernel local_lock() maps to preempt_disable() and local_lock_irq() to local_irq_disable(). Create lru_rotate_pvecs containing the pagevec and the locallock. Create lru_pvecs containing the remaining pagevecs and the locallock. Add lru_add_drain_cpu_zone() which is used from compact_zone() to avoid exporting the pvec structure. Change the relevant call sites to acquire these locks instead of using preempt_disable() / get_cpu() / get_cpu_var() and local_irq_disable() / local_irq_save(). There is neither a functional change nor a change in the generated binary code for non PREEMPT_RT enabled non-debug kernels. When lockdep is enabled local locks have lockdep maps embedded. These allow lockdep to validate the protections, i.e. inappropriate usage of a preemption only protected sections would result in a lockdep warning while the same problem would not be noticed with a plain preempt_disable() based protection. local locks also improve readability as they provide a named scope for the protections while preempt/interrupt disable are opaque scopeless. Finally local locks allow PREEMPT_RT to substitute them with real locking primitives to ensure the correctness of operation in a fully preemptible kernel. [ bigeasy: Adopted to use local_lock ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20200527201119.1692513-4-bigeasy@linutronix.de
2020-04-18swap.h: Replace zero-length array with flexible-array memberGustavo A. R. Silva
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
2020-04-02mm: swap: make page_evictable() inlineYang Shi
When backporting commit 9c4e6b1a7027 ("mm, mlock, vmscan: no more skipping pagevecs") to our 4.9 kernel, our test bench noticed around 10% down with a couple of vm-scalability's test cases (lru-file-readonce, lru-file-readtwice and lru-file-mmap-read). I didn't see that much down on my VM (32c-64g-2nodes). It might be caused by the test configuration, which is 32c-256g with NUMA disabled and the tests were run in root memcg, so the tests actually stress only one inactive and active lru. It sounds not very usual in mordern production environment. That commit did two major changes: 1. Call page_evictable() 2. Use smp_mb to force the PG_lru set visible It looks they contribute the most overhead. The page_evictable() is a function which does function prologue and epilogue, and that was used by page reclaim path only. However, lru add is a very hot path, so it sounds better to make it inline. However, it calls page_mapping() which is not inlined either, but the disassemble shows it doesn't do push and pop operations and it sounds not very straightforward to inline it. Other than this, it sounds smp_mb() is not necessary for x86 since SetPageLRU is atomic which enforces memory barrier already, replace it with smp_mb__after_atomic() in the following patch. With the two fixes applied, the tests can get back around 5% on that test bench and get back normal on my VM. Since the test bench configuration is not that usual and I also saw around 6% up on the latest upstream, so it sounds good enough IMHO. The below is test data (lru-file-readtwice throughput) against the v5.6-rc4: mainline w/ inline fix 150MB 154MB With this patch the throughput gets 2.67% up. The data with using smp_mb__after_atomic() is showed in the following patch. Shakeel Butt did the below test: On a real machine with limiting the 'dd' on a single node and reading 100 GiB sparse file (less than a single node). Just ran a single instance to not cause the lru lock contention. The cmdline used is "dd if=file-100GiB of=/dev/null bs=4k". Ran the cmd 10 times with drop_caches in between and measured the time it took. Without patch: 56.64143 +- 0.672 sec With patches: 56.10 +- 0.21 sec [akpm@linux-foundation.org: move page_evictable() to internal.h] Fixes: 9c4e6b1a7027 ("mm, mlock, vmscan: no more skipping pagevecs") Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: http://lkml.kernel.org/r/1584500541-46817-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01mm: vmscan: detect file thrashing at the reclaim rootJohannes Weiner
We use refault information to determine whether the cache workingset is stable or transitioning, and dynamically adjust the inactive:active file LRU ratio so as to maximize protection from one-off cache during stable periods, and minimize IO during transitions. With cgroups and their nested LRU lists, we currently don't do this correctly. While recursive cgroup reclaim establishes a relative LRU order among the pages of all involved cgroups, refaults only affect the local LRU order in the cgroup in which they are occuring. As a result, cache transitions can take longer in a cgrouped system as the active pages of sibling cgroups aren't challenged when they should be. [ Right now, this is somewhat theoretical, because the siblings, under continued regular reclaim pressure, should eventually run out of inactive pages - and since inactive:active *size* balancing is also done on a cgroup-local level, we will challenge the active pages eventually in most cases. But the next patch will move that relative size enforcement to the reclaim root as well, and then this patch here will be necessary to propagate refault pressure to siblings. ] This patch moves refault detection to the root of reclaim. Instead of remembering the cgroup owner of an evicted page, remember the cgroup that caused the reclaim to happen. When refaults later occur, they'll correctly influence the cross-cgroup LRU order that reclaim follows. I.e. if global reclaim kicked out pages in some subgroup A/B/C, the refault of those pages will challenge the global LRU order, and not just the local order down inside C. [hannes@cmpxchg.org: use page_memcg() instead of another lookup] Link: http://lkml.kernel.org/r/20191115160722.GA309754@cmpxchg.org Link: http://lkml.kernel.org/r/20191107205334.158354-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm: introduce MADV_PAGEOUTMinchan Kim
When a process expects no accesses to a certain memory range for a long time, it could hint kernel that the pages can be reclaimed instantly but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_PAGEOUT hint to madvise(2) syscall. MADV_PAGEOUT can be used by a process to mark a memory range as not expected to be used for a long time so that kernel reclaims *any LRU* pages instantly. The hint can help kernel in deciding which pages to evict proactively. A note: It doesn't apply SWAP_CLUSTER_MAX LRU page isolation limit intentionally because it's automatically bounded by PMD size. If PMD size(e.g., 256) makes some trouble, we could fix it later by limit it to SWAP_CLUSTER_MAX[1]. - man-page material MADV_PAGEOUT (since Linux x.x) Do not expect access in the near future so pages in the specified regions could be reclaimed instantly regardless of memory pressure. Thus, access in the range after successful operation could cause major page fault but never lose the up-to-date contents unlike MADV_DONTNEED. Pages belonging to a shared mapping are only processed if a write access is allowed for the calling process. MADV_PAGEOUT cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [1] https://lore.kernel.org/lkml/20190710194719.GS29695@dhcp22.suse.cz/ [minchan@kernel.org: clear PG_active on MADV_PAGEOUT] Link: http://lkml.kernel.org/r/20190802200643.GA181880@google.com [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-5-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm: introduce MADV_COLDMinchan Kim
Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm, swap: use rbtree for swap_extentAaron Lu
swap_extent is used to map swap page offset to backing device's block offset. For a continuous block range, one swap_extent is used and all these swap_extents are managed in a linked list. These swap_extents are used by map_swap_entry() during swap's read and write path. To find out the backing device's block offset for a page offset, the swap_extent list will be traversed linearly, with curr_swap_extent being used as a cache to speed up the search. This works well as long as swap_extents are not huge or when the number of processes that access swap device are few, but when the swap device has many extents and there are a number of processes accessing the swap device concurrently, it can be a problem. On one of our servers, the disk's remaining size is tight: $df -h Filesystem Size Used Avail Use% Mounted on ... ... /dev/nvme0n1p1 1.8T 1.3T 504G 72% /home/t4 When creating a 80G swapfile there, there are as many as 84656 swap extents. The end result is, kernel spends abou 30% time in map_swap_entry() and swap throughput is only 70MB/s. As a comparison, when I used smaller sized swapfile, like 4G whose swap_extent dropped to 2000, swap throughput is back to 400-500MB/s and map_swap_entry() is about 3%. One downside of using rbtree for swap_extent is, 'struct rbtree' takes 24 bytes while 'struct list_head' takes 16 bytes, that's 8 bytes more for each swap_extent. For a swapfile that has 80k swap_extents, that means 625KiB more memory consumed. Test: Since it's not possible to reboot that server, I can not test this patch diretly there. Instead, I tested it on another server with NVMe disk. I created a 20G swapfile on an NVMe backed XFS fs. By default, the filesystem is quite clean and the created swapfile has only 2 extents. Testing vanilla and this patch shows no obvious performance difference when swapfile is not fragmented. To see the patch's effects, I used some tweaks to manually fragment the swapfile by breaking the extent at 1M boundary. This made the swapfile have 20K extents. nr_task=4 kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf) vanilla 165191 90.77% 171798 90.21% patched 858993 +420% 2.16% 715827 +317% 0.77% nr_task=8 kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf) vanilla 306783 92.19% 318145 87.76% patched 954437 +211% 2.35% 1073741 +237% 1.57% swapout: the throughput of swap out, in KB/s, higher is better 1st map_swap_entry: cpu cycles percent sampled by perf swapin: the throughput of swap in, in KB/s, higher is better. 2nd map_swap_entry: cpu cycles percent sampled by perf nr_task=1 doesn't show any difference, this is due to the curr_swap_extent can be effectively used to cache the correct swap extent for single task workload. [akpm@linux-foundation.org: s/BUG_ON(1)/BUG()/] Link: http://lkml.kernel.org/r/20190523142404.GA181@aaronlu Signed-off-by: Aaron Lu <ziqian.lzq@antfin.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm, swap: fix race between swapoff and some swap operationsHuang Ying
When swapin is performed, after getting the swap entry information from the page table, system will swap in the swap entry, without any lock held to prevent the swap device from being swapoff. This may cause the race like below, CPU 1 CPU 2 ----- ----- do_swap_page swapin_readahead __read_swap_cache_async swapoff swapcache_prepare p->swap_map = NULL __swap_duplicate p->swap_map[?] /* !!! NULL pointer access */ Because swapoff is usually done when system shutdown only, the race may not hit many people in practice. But it is still a race need to be fixed. To fix the race, get_swap_device() is added to check whether the specified swap entry is valid in its swap device. If so, it will keep the swap entry valid via preventing the swap device from being swapoff, until put_swap_device() is called. Because swapoff() is very rare code path, to make the normal path runs as fast as possible, rcu_read_lock/unlock() and synchronize_rcu() instead of reference count is used to implement get/put_swap_device(). >From get_swap_device() to put_swap_device(), RCU reader side is locked, so synchronize_rcu() in swapoff() will wait until put_swap_device() is called. In addition to swap_map, cluster_info, etc. data structure in the struct swap_info_struct, the swap cache radix tree will be freed after swapoff, so this patch fixes the race between swap cache looking up and swapoff too. Races between some other swap cache usages and swapoff are fixed too via calling synchronize_rcu() between clearing PageSwapCache() and freeing swap cache data structure. Another possible method to fix this is to use preempt_off() + stop_machine() to prevent the swap device from being swapoff when its data structure is being accessed. The overhead in hot-path of both methods is similar. The advantages of RCU based method are, 1. stop_machine() may disturb the normal execution code path on other CPUs. 2. File cache uses RCU to protect its radix tree. If the similar mechanism is used for swap cache too, it is easier to share code between them. 3. RCU is used to protect swap cache in total_swapcache_pages() and exit_swap_address_space() already. The two mechanisms can be merged to simplify the logic. Link: http://lkml.kernel.org/r/20190522015423.14418-1-ying.huang@intel.com Fixes: 235b62176712 ("mm/swap: add cluster lock") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com> Not-nacked-by: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14include/linux/swap.h: use offsetof() instead of custom __swapoffset macroPi-Hsun Shih
Use offsetof() to calculate offset of a field to take advantage of compiler built-in version when possible, and avoid UBSAN warning when compiling with Clang: UBSAN: Undefined behaviour in mm/swapfile.c:3010:38 member access within null pointer of type 'union swap_header' CPU: 6 PID: 1833 Comm: swapon Tainted: G S 4.19.23 #43 Call trace: dump_backtrace+0x0/0x194 show_stack+0x20/0x2c __dump_stack+0x20/0x28 dump_stack+0x70/0x94 ubsan_epilogue+0x14/0x44 ubsan_type_mismatch_common+0xf4/0xfc __ubsan_handle_type_mismatch_v1+0x34/0x54 __se_sys_swapon+0x654/0x1084 __arm64_sys_swapon+0x1c/0x24 el0_svc_common+0xa8/0x150 el0_svc_compat_handler+0x2c/0x38 el0_svc_compat+0x8/0x18 Link: http://lkml.kernel.org/r/20190312081902.223764-1-pihsun@chromium.org Signed-off-by: Pi-Hsun Shih <pihsun@chromium.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05mm/workingset: remove unused @mapping argument in workingset_eviction()Andrey Ryabinin
workingset_eviction() doesn't use and never did use the @mapping argument. Remove it. Link: http://lkml.kernel.org/r/20190228083329.31892-1-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rik van Riel <riel@surriel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: William Kucharski <william.kucharski@oracle.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05mm: swap: use mem_cgroup_is_root() instead of deferencing css->parentYang Shi
mem_cgroup_is_root() is the preferred API to check if memcg is root or not. Use it instead of deferencing css->parent. Link: http://lkml.kernel.org/r/1547232913-118148-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm/swap: use nr_node_ids for avail_lists in swap_info_structAaron Lu
Since a2468cc9bfdf ("swap: choose swap device according to numa node"), avail_lists field of swap_info_struct is changed to an array with MAX_NUMNODES elements. This made swap_info_struct size increased to 40KiB and needs an order-4 page to hold it. This is not optimal in that: 1 Most systems have way less than MAX_NUMNODES(1024) nodes so it is a waste of memory; 2 It could cause swapon failure if the swap device is swapped on after system has been running for a while, due to no order-4 page is available as pointed out by Vasily Averin. Solve the above two issues by using nr_node_ids(which is the actual possible node number the running system has) for avail_lists instead of MAX_NUMNODES. nr_node_ids is unknown at compile time so can't be directly used when declaring this array. What I did here is to declare avail_lists as zero element array and allocate space for it when allocating space for swap_info_struct. The reason why keep using array but not pointer is plist_for_each_entry needs the field to be part of the struct, so pointer will not work. This patch is on top of Vasily Averin's fix commit. I think the use of kvzalloc for swap_info_struct is still needed in case nr_node_ids is really big on some systems. Link: http://lkml.kernel.org/r/20181115083847.GA11129@intel.com Signed-off-by: Aaron Lu <aaron.lu@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vasily Averin <vvs@virtuozzo.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28vmscan: return NODE_RECLAIM_NOSCAN in node_reclaim() when CONFIG_NUMA is nWei Yang
Commit fa5e084e43eb ("vmscan: do not unconditionally treat zones that fail zone_reclaim() as full") changed the return value of node_reclaim(). The original return value 0 means NODE_RECLAIM_SOME after this commit. While the return value of node_reclaim() when CONFIG_NUMA is n is not changed. This will leads to call zone_watermark_ok() again. This patch fixes the return value by adjusting to NODE_RECLAIM_NOSCAN. Since node_reclaim() is only called in page_alloc.c, move it to mm/internal.h. Link: http://lkml.kernel.org/r/20181113080436.22078-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: convert totalram_pages and totalhigh_pages variables to atomicArun KS
totalram_pages and totalhigh_pages are made static inline function. Main motivation was that managed_page_count_lock handling was complicating things. It was discussed in length here, https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes better to remove the lock and convert variables to atomic, with preventing poteintial store-to-read tearing as a bonus. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Suggested-by: Michal Hocko <mhocko@suse.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-20Merge drm/drm-next into drm-intel-next-queuedJani Nikula
Pull in v4.20-rc3 via drm-next. Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2018-11-07mm, drm/i915: mark pinned shmemfs pages as unevictableKuo-Hsin Yang
The i915 driver uses shmemfs to allocate backing storage for gem objects. These shmemfs pages can be pinned (increased ref count) by shmem_read_mapping_page_gfp(). When a lot of pages are pinned, vmscan wastes a lot of time scanning these pinned pages. In some extreme case, all pages in the inactive anon lru are pinned, and only the inactive anon lru is scanned due to inactive_ratio, the system cannot swap and invokes the oom-killer. Mark these pinned pages as unevictable to speed up vmscan. Export pagevec API check_move_unevictable_pages(). This patch was inspired by Chris Wilson's change [1]. [1]: https://patchwork.kernel.org/patch/9768741/ Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Kuo-Hsin Yang <vovoy@chromium.org> Acked-by: Michal Hocko <mhocko@suse.com> # mm part Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Acked-by: Dave Hansen <dave.hansen@intel.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Link: https://patchwork.freedesktop.org/patch/msgid/20181106132324.17390-1-chris@chris-wilson.co.uk Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
2018-10-28Merge branch 'xarray' of git://git.infradead.org/users/willy/linux-daxLinus Torvalds
Pull XArray conversion from Matthew Wilcox: "The XArray provides an improved interface to the radix tree data structure, providing locking as part of the API, specifying GFP flags at allocation time, eliminating preloading, less re-walking the tree, more efficient iterations and not exposing RCU-protected pointers to its users. This patch set 1. Introduces the XArray implementation 2. Converts the pagecache to use it 3. Converts memremap to use it The page cache is the most complex and important user of the radix tree, so converting it was most important. Converting the memremap code removes the only other user of the multiorder code, which allows us to remove the radix tree code that supported it. I have 40+ followup patches to convert many other users of the radix tree over to the XArray, but I'd like to get this part in first. The other conversions haven't been in linux-next and aren't suitable for applying yet, but you can see them in the xarray-conv branch if you're interested" * 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits) radix tree: Remove multiorder support radix tree test: Convert multiorder tests to XArray radix tree tests: Convert item_delete_rcu to XArray radix tree tests: Convert item_kill_tree to XArray radix tree tests: Move item_insert_order radix tree test suite: Remove multiorder benchmarking radix tree test suite: Remove __item_insert memremap: Convert to XArray xarray: Add range store functionality xarray: Move multiorder_check to in-kernel tests xarray: Move multiorder_shrink to kernel tests xarray: Move multiorder account test in-kernel radix tree test suite: Convert iteration test to XArray radix tree test suite: Convert tag_tagged_items to XArray radix tree: Remove radix_tree_clear_tags radix tree: Remove radix_tree_maybe_preload_order radix tree: Remove split/join code radix tree: Remove radix_tree_update_node_t page cache: Finish XArray conversion dax: Convert page fault handlers to XArray ...
2018-10-26mm: split SWP_FILE into SWP_ACTIVATED and SWP_FSOmar Sandoval
The SWP_FILE flag serves two purposes: to make swap_{read,write}page() go through the filesystem, and to make swapoff() call ->swap_deactivate(). For Btrfs, we want the latter but not the former, so split this flag into two. This makes us always call ->swap_deactivate() if ->swap_activate() succeeded, not just if it didn't add any swap extents itself. This also resolves the issue of the very misleading name of SWP_FILE, which is only used for swap files over NFS. Link: http://lkml.kernel.org/r/6d63d8668c4287a4f6d203d65696e96f80abdfc7.1536704650.git.osandov@fb.com Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm: workingset: tell cache transitions from workingset thrashingJohannes Weiner
Refaults happen during transitions between workingsets as well as in-place thrashing. Knowing the difference between the two has a range of applications, including measuring the impact of memory shortage on the system performance, as well as the ability to smarter balance pressure between the filesystem cache and the swap-backed workingset. During workingset transitions, inactive cache refaults and pushes out established active cache. When that active cache isn't stale, however, and also ends up refaulting, that's bonafide thrashing. Introduce a new page flag that tells on eviction whether the page has been active or not in its lifetime. This bit is then stored in the shadow entry, to classify refaults as transitioning or thrashing. How many page->flags does this leave us with on 32-bit? 20 bits are always page flags 21 if you have an MMU 23 with the zone bits for DMA, Normal, HighMem, Movable 29 with the sparsemem section bits 30 if PAE is enabled 31 with this patch. So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If that's not enough, the system can switch to discontigmem and re-gain the 6 or 7 sparsemem section bits. Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-21mm: Convert delete_from_swap_cache to XArrayMatthew Wilcox
Both callers of __delete_from_swap_cache have the swp_entry_t already, so pass that in to make constructing the XA_STATE easier. Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21mm: Convert workingset to XArrayMatthew Wilcox
We construct an XA_STATE and use it to delete the node with xas_store() rather than adding a special function for this unique use case. Includes a test that simulates this usage for the test suite. Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21page cache: Add and replace pages using the XArrayMatthew Wilcox
Use the XArray APIs to add and replace pages in the page cache. This removes two uses of the radix tree preload API and is significantly shorter code. It also removes the last user of __radix_tree_create() outside radix-tree.c itself, so make it static. Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-08-22mm, swap, get_swap_pages: use entry_size instead of cluster in parameterHuang Ying
As suggested by Matthew Wilcox, it is better to use "int entry_size" instead of "bool cluster" as parameter to specify whether to operate for huge or normal swap entries. Because this improve the flexibility to support other swap entry size. And Dave Hansen thinks that this improves code readability too. So in this patch, the "bool cluster" parameter of get_swap_pages() is replaced by "int entry_size". And nr_swap_entries() trick is used to reduce the binary size when !CONFIG_TRANSPARENT_HUGE_PAGE. text data bss dec hex filename base 24215 2028 340 26583 67d7 mm/swapfile.o head 24123 2004 340 26467 6763 mm/swapfile.o Link: http://lkml.kernel.org/r/20180720071845.17920-7-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-09memcontrol: schedule throttling if we are congestedTejun Heo
Memory allocations can induce swapping via kswapd or direct reclaim. If we are having IO done for us by kswapd and don't actually go into direct reclaim we may never get scheduled for throttling. So instead check to see if our cgroup is congested, and if so schedule the throttling. Before we return to user space the throttling stuff will only throttle if we actually required it. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Josef Bacik <jbacik@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-16Merge branch 'mm-rst' into docs-nextJonathan Corbet
Mike Rapoport says: These patches convert files in Documentation/vm to ReST format, add an initial index and link it to the top level documentation. There are no contents changes in the documentation, except few spelling fixes. The relatively large diffstat stems from the indentation and paragraph wrapping changes. I've tried to keep the formatting as consistent as possible, but I could miss some places that needed markup and add some markup where it was not necessary. [jc: significant conflicts in vm/hmm.rst]