From a2c43eed8334e878702fca713b212ae2a11d84b9 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:36 -0800 Subject: mm: try_to_free_swap replaces remove_exclusive_swap_page remove_exclusive_swap_page(): its problem is in living up to its name. It doesn't matter if someone else has a reference to the page (raised page_count); it doesn't matter if the page is mapped into userspace (raised page_mapcount - though that hints it may be worth keeping the swap): all that matters is that there be no more references to the swap (and no writeback in progress). swapoff (try_to_unuse) has been removing pages from swapcache for years, with no concern for page count or page mapcount, and we used to have a comment in lookup_swap_cache() recognizing that: if you go for a page of swapcache, you'll get the right page, but it could have been removed from swapcache by the time you get page lock. So, give up asking for exclusivity: get rid of remove_exclusive_swap_page(), and remove_exclusive_swap_page_ref() and remove_exclusive_swap_page_count() which were spawned for the recent LRU work: replace them by the simpler try_to_free_swap() which just checks page_swapcount(). Similarly, remove the page_count limitation from free_swap_and_count(), but assume that it's worth holding on to the swap if page is mapped and swap nowhere near full. Add a vm_swap_full() test in free_swap_cache()? It would be consistent, but I think we probably have enough for now. Signed-off-by: Hugh Dickins Cc: Lee Schermerhorn Cc: Rik van Riel Cc: Nick Piggin Cc: KAMEZAWA Hiroyuki Cc: Robin Holt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index d196f46c8808..c8601dd36603 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -759,7 +759,7 @@ cull_mlocked: activate_locked: /* Not a candidate for swapping, so reclaim swap space. */ if (PageSwapCache(page) && vm_swap_full()) - remove_exclusive_swap_page_ref(page); + try_to_free_swap(page); VM_BUG_ON(PageActive(page)); SetPageActive(page); pgactivate++; -- cgit From 63d6c5ad7fc27455ce5cb4706884671fb7e0df08 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:38 -0800 Subject: mm: remove try_to_munlock from vmscan An unfortunate feature of the Unevictable LRU work was that reclaiming an anonymous page involved an extra scan through the anon_vma: to check that the page is evictable before allocating swap, because the swap could not be freed reliably soon afterwards. Now try_to_free_swap() has replaced remove_exclusive_swap_page(), that's not an issue any more: remove try_to_munlock() call from shrink_page_list(), leaving it to try_to_munmap() to discover if the page is one to be culled to the unevictable list - in which case then try_to_free_swap(). Update unevictable-lru.txt to remove comments on the try_to_munlock() in shrink_page_list(), and shorten some lines over 80 columns. Signed-off-by: Hugh Dickins Cc: Lee Schermerhorn Acked-by: Rik van Riel Cc: Nick Piggin Cc: KAMEZAWA Hiroyuki Cc: Robin Holt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 11 ++--------- 1 file changed, 2 insertions(+), 9 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index c8601dd36603..74f875733e2b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -625,15 +625,6 @@ static unsigned long shrink_page_list(struct list_head *page_list, if (PageAnon(page) && !PageSwapCache(page)) { if (!(sc->gfp_mask & __GFP_IO)) goto keep_locked; - switch (try_to_munlock(page)) { - case SWAP_FAIL: /* shouldn't happen */ - case SWAP_AGAIN: - goto keep_locked; - case SWAP_MLOCK: - goto cull_mlocked; - case SWAP_SUCCESS: - ; /* fall thru'; add to swap cache */ - } if (!add_to_swap(page, GFP_ATOMIC)) goto activate_locked; may_enter_fs = 1; @@ -752,6 +743,8 @@ free_it: continue; cull_mlocked: + if (PageSwapCache(page)) + try_to_free_swap(page); unlock_page(page); putback_lru_page(page); continue; -- cgit From ac47b003d03c2a4f28aef1d505b66d24ad191c4f Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:39 -0800 Subject: mm: remove gfp_mask from add_to_swap Remove gfp_mask argument from add_to_swap(): it's misleading because its only caller, shrink_page_list(), is not atomic at that point; and in due course (implementing discard) we'll sometimes want to allocate some memory with GFP_NOIO (as is used in swap_writepage) when allocating swap. No change to the gfp_mask passed down to add_to_swap_cache(): still use __GFP_HIGH without __GFP_WAIT (with nomemalloc and nowarn as before): though it's not obvious if that's the best combination to ask for here. Signed-off-by: Hugh Dickins Cc: Lee Schermerhorn Cc: Rik van Riel Cc: Nick Piggin Cc: KAMEZAWA Hiroyuki Cc: Robin Holt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 74f875733e2b..cc7401571cba 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -625,7 +625,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, if (PageAnon(page) && !PageSwapCache(page)) { if (!(sc->gfp_mask & __GFP_IO)) goto keep_locked; - if (!add_to_swap(page, GFP_ATOMIC)) + if (!add_to_swap(page)) goto activate_locked; may_enter_fs = 1; } -- cgit From 60371d971a3d01afd102f0bbf2681f32ecc31d78 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:40 -0800 Subject: mm: add add_to_swap stub If we add a failing stub for add_to_swap(), then we can remove the #ifdef CONFIG_SWAP from mm/vmscan.c. This was intended as a source cleanup, but looking more closely, it turns out that the !CONFIG_SWAP case was going to keep_locked for an anonymous page, whereas now it goes to the more suitable activate_locked, like the CONFIG_SWAP nr_swap_pages 0 case. Signed-off-by: Hugh Dickins Cc: Lee Schermerhorn Acked-by: Rik van Riel Cc: Nick Piggin Cc: KAMEZAWA Hiroyuki Cc: Robin Holt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 2 -- 1 file changed, 2 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index cc7401571cba..f350523a8eee 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -617,7 +617,6 @@ static unsigned long shrink_page_list(struct list_head *page_list, referenced && page_mapping_inuse(page)) goto activate_locked; -#ifdef CONFIG_SWAP /* * Anonymous process memory has backing store? * Try to allocate it some swap space here. @@ -629,7 +628,6 @@ static unsigned long shrink_page_list(struct list_head *page_list, goto activate_locked; may_enter_fs = 1; } -#endif /* CONFIG_SWAP */ mapping = page_mapping(page); -- cgit From b962716b459505a8d83aea313fea0abe76749f42 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:41 -0800 Subject: mm: optimize get_scan_ratio for no swap Rik suggests a simplified get_scan_ratio() for !CONFIG_SWAP. Yes, the gcc optimizer gives us that, when nr_swap_pages is #defined as 0L. Move usual declaration to swapfile.c: it never belonged in page_alloc.c. Signed-off-by: Hugh Dickins Cc: Lee Schermerhorn Acked-by: Rik van Riel Cc: Nick Piggin Cc: KAMEZAWA Hiroyuki Cc: Robin Holt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index f350523a8eee..d500b906746c 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1327,12 +1327,6 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc, unsigned long anon_prio, file_prio; unsigned long ap, fp; - anon = zone_page_state(zone, NR_ACTIVE_ANON) + - zone_page_state(zone, NR_INACTIVE_ANON); - file = zone_page_state(zone, NR_ACTIVE_FILE) + - zone_page_state(zone, NR_INACTIVE_FILE); - free = zone_page_state(zone, NR_FREE_PAGES); - /* If we have no swap space, do not bother scanning anon pages. */ if (nr_swap_pages <= 0) { percent[0] = 0; @@ -1340,6 +1334,12 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc, return; } + anon = zone_page_state(zone, NR_ACTIVE_ANON) + + zone_page_state(zone, NR_INACTIVE_ANON); + file = zone_page_state(zone, NR_ACTIVE_FILE) + + zone_page_state(zone, NR_INACTIVE_FILE); + free = zone_page_state(zone, NR_FREE_PAGES); + /* If we have very few page cache pages, force-scan anon pages. */ if (unlikely(file + free <= zone->pages_high)) { percent[0] = 100; -- cgit From 077cbc5864cd9188fa4c4e181e48ff58317e6400 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 6 Jan 2009 14:39:42 -0800 Subject: memcg: reclaim shouldn't change zone->recent_rotated statistics memcg reclaim shouldn't change zone->recent_rotated statistics. If memcgroup reclaim changes zone statistics, global reclaim can get a bit confused. Signed-off-by: KOSAKI Motohiro Acked-by: Rik van Riel Cc: Balbir Singh Cc: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index d500b906746c..da7c3a2304a7 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1246,7 +1246,8 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, * This helps balance scan pressure between file and anonymous * pages in get_scan_ratio. */ - zone->recent_rotated[!!file] += pgmoved; + if (scan_global_lru(sc)) + zone->recent_rotated[!!file] += pgmoved; /* * Move the pages to the [file or anon] inactive list. -- cgit From ff30153bf9647c8646538810d4c01015a5e44787 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 6 Jan 2009 14:39:44 -0800 Subject: mm: make scan_all_zones_unevictable_pages() static sparse output following warning. mm/vmscan.c:2549:6: warning: symbol 'scan_all_zones_unevictable_pages' was not declared. Should it be static? cleanup here. Signed-off-by: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index da7c3a2304a7..062767d159da 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2506,7 +2506,7 @@ void scan_zone_unevictable_pages(struct zone *zone) * that has possibly/probably made some previously unevictable pages * evictable. */ -void scan_all_zones_unevictable_pages(void) +static void scan_all_zones_unevictable_pages(void) { struct zone *zone; -- cgit From 14b90b22ec0f359ef4791033ab386b2b627bae07 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 6 Jan 2009 14:39:45 -0800 Subject: mm: make scan_zone_unevictable_pages() static sparse output following warning mm/vmscan.c:2507:6: warning: symbol 'scan_zone_unevictable_pages' was not declared. Should it be static? cleanup here. Signed-off-by: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 062767d159da..1ef5a2eeb298 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2464,7 +2464,7 @@ void scan_mapping_unevictable_pages(struct address_space *mapping) * back onto @zone's unevictable list. */ #define SCAN_UNEVICTABLE_BATCH_SIZE 16UL /* arbitrary lock hold batch size */ -void scan_zone_unevictable_pages(struct zone *zone) +static void scan_zone_unevictable_pages(struct zone *zone) { struct list_head *l_unevictable = &zone->lru[LRU_UNEVICTABLE].list; unsigned long scan; -- cgit From a79311c14eae4bb946a97af25f3e1b17d625985d Mon Sep 17 00:00:00 2001 From: Rik van Riel Date: Tue, 6 Jan 2009 14:40:01 -0800 Subject: vmscan: bail out of direct reclaim after swap_cluster_max pages When the VM is under pressure, it can happen that several direct reclaim processes are in the pageout code simultaneously. It also happens that the reclaiming processes run into mostly referenced, mapped and dirty pages in the first round. This results in multiple direct reclaim processes having a lower pageout priority, which corresponds to a higher target of pages to scan. This in turn can result in each direct reclaim process freeing many pages. Together, they can end up freeing way too many pages. This kicks useful data out of memory (in some cases more than half of all memory is swapped out). It also impacts performance by keeping tasks stuck in the pageout code for too long. A 30% improvement in hackbench has been observed with this patch. The fix is relatively simple: in shrink_zone() we can check how many pages we have already freed, direct reclaim tasks break out of the scanning loop if they have already freed enough pages and have reached a lower priority level. We do not break out of shrink_zone() when priority == DEF_PRIORITY, to ensure that equal pressure is applied to every zone in the common case. However, in order to do this we do need to know how many pages we already freed, so move nr_reclaimed into scan_control. akpm: a historical interlude... We tried this in 2004: :commit e468e46a9bea3297011d5918663ce6d19094cf87 :Author: akpm :Date: Thu Jun 24 15:53:52 2004 +0000 : :[PATCH] vmscan.c: dont reclaim too many pages : : The shrink_zone() logic can, under some circumstances, cause far too many : pages to be reclaimed. Say, we're scanning at high priority and suddenly hit : a large number of reclaimable pages on the LRU. : Change things so we bale out when SWAP_CLUSTER_MAX pages have been reclaimed. And we reverted it in 2006: :commit 210fe530305ee50cd889fe9250168228b2994f32 :Author: Andrew Morton :Date: Fri Jan 6 00:11:14 2006 -0800 : : [PATCH] vmscan: balancing fix : : Revert a patch which went into 2.6.8-rc1. The changelog for that patch was: : : The shrink_zone() logic can, under some circumstances, cause far too many : pages to be reclaimed. Say, we're scanning at high priority and suddenly : hit a large number of reclaimable pages on the LRU. : : Change things so we bale out when SWAP_CLUSTER_MAX pages have been : reclaimed. : : Problem is, this change caused significant imbalance in inter-zone scan : balancing by truncating scans of larger zones. : : Suppose, for example, ZONE_HIGHMEM is 10x the size of ZONE_NORMAL. The zone : balancing algorithm would require that if we're scanning 100 pages of : ZONE_HIGHMEM, we should scan 10 pages of ZONE_NORMAL. But this logic will : cause the scanning of ZONE_HIGHMEM to bale out after only 32 pages are : reclaimed. Thus effectively causing smaller zones to be scanned relatively : harder than large ones. : : Now I need to remember what the workload was which caused me to write this : patch originally, then fix it up in a different way... And we haven't demonstrated that whatever problem caused that reversion is not being reintroduced by this change in 2008. Signed-off-by: Rik van Riel Cc: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 62 ++++++++++++++++++++++++++++++++----------------------------- 1 file changed, 33 insertions(+), 29 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 1ef5a2eeb298..5faa7739487f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -52,6 +52,9 @@ struct scan_control { /* Incremented by the number of inactive pages that were scanned */ unsigned long nr_scanned; + /* Number of pages freed so far during a call to shrink_zones() */ + unsigned long nr_reclaimed; + /* This context's GFP mask */ gfp_t gfp_mask; @@ -1400,12 +1403,11 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc, /* * This is a basic per-zone page freer. Used by both kswapd and direct reclaim. */ -static unsigned long shrink_zone(int priority, struct zone *zone, +static void shrink_zone(int priority, struct zone *zone, struct scan_control *sc) { unsigned long nr[NR_LRU_LISTS]; unsigned long nr_to_scan; - unsigned long nr_reclaimed = 0; unsigned long percent[2]; /* anon @ 0; file @ 1 */ enum lru_list l; @@ -1446,10 +1448,21 @@ static unsigned long shrink_zone(int priority, struct zone *zone, (unsigned long)sc->swap_cluster_max); nr[l] -= nr_to_scan; - nr_reclaimed += shrink_list(l, nr_to_scan, + sc->nr_reclaimed += shrink_list(l, nr_to_scan, zone, sc, priority); } } + /* + * On large memory systems, scan >> priority can become + * really large. This is fine for the starting priority; + * we want to put equal scanning pressure on each zone. + * However, if the VM has a harder time of freeing pages, + * with multiple processes reclaiming pages, the total + * freeing target can get unreasonably large. + */ + if (sc->nr_reclaimed > sc->swap_cluster_max && + priority < DEF_PRIORITY && !current_is_kswapd()) + break; } /* @@ -1462,7 +1475,6 @@ static unsigned long shrink_zone(int priority, struct zone *zone, shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0); throttle_vm_writeout(sc->gfp_mask); - return nr_reclaimed; } /* @@ -1476,16 +1488,13 @@ static unsigned long shrink_zone(int priority, struct zone *zone, * b) The zones may be over pages_high but they must go *over* pages_high to * satisfy the `incremental min' zone defense algorithm. * - * Returns the number of reclaimed pages. - * * If a zone is deemed to be full of pinned pages then just give it a light * scan then give up on it. */ -static unsigned long shrink_zones(int priority, struct zonelist *zonelist, +static void shrink_zones(int priority, struct zonelist *zonelist, struct scan_control *sc) { enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask); - unsigned long nr_reclaimed = 0; struct zoneref *z; struct zone *zone; @@ -1516,10 +1525,8 @@ static unsigned long shrink_zones(int priority, struct zonelist *zonelist, priority); } - nr_reclaimed += shrink_zone(priority, zone, sc); + shrink_zone(priority, zone, sc); } - - return nr_reclaimed; } /* @@ -1544,7 +1551,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, int priority; unsigned long ret = 0; unsigned long total_scanned = 0; - unsigned long nr_reclaimed = 0; struct reclaim_state *reclaim_state = current->reclaim_state; unsigned long lru_pages = 0; struct zoneref *z; @@ -1572,7 +1578,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, sc->nr_scanned = 0; if (!priority) disable_swap_token(); - nr_reclaimed += shrink_zones(priority, zonelist, sc); + shrink_zones(priority, zonelist, sc); /* * Don't shrink slabs when reclaiming memory from * over limit cgroups @@ -1580,13 +1586,13 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, if (scan_global_lru(sc)) { shrink_slab(sc->nr_scanned, sc->gfp_mask, lru_pages); if (reclaim_state) { - nr_reclaimed += reclaim_state->reclaimed_slab; + sc->nr_reclaimed += reclaim_state->reclaimed_slab; reclaim_state->reclaimed_slab = 0; } } total_scanned += sc->nr_scanned; - if (nr_reclaimed >= sc->swap_cluster_max) { - ret = nr_reclaimed; + if (sc->nr_reclaimed >= sc->swap_cluster_max) { + ret = sc->nr_reclaimed; goto out; } @@ -1609,7 +1615,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, } /* top priority shrink_zones still had more to do? don't OOM, then */ if (!sc->all_unreclaimable && scan_global_lru(sc)) - ret = nr_reclaimed; + ret = sc->nr_reclaimed; out: /* * Now that we've scanned all the zones at this priority level, note @@ -1704,7 +1710,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order) int priority; int i; unsigned long total_scanned; - unsigned long nr_reclaimed; struct reclaim_state *reclaim_state = current->reclaim_state; struct scan_control sc = { .gfp_mask = GFP_KERNEL, @@ -1723,7 +1728,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order) loop_again: total_scanned = 0; - nr_reclaimed = 0; + sc.nr_reclaimed = 0; sc.may_writepage = !laptop_mode; count_vm_event(PAGEOUTRUN); @@ -1809,11 +1814,11 @@ loop_again: */ if (!zone_watermark_ok(zone, order, 8*zone->pages_high, end_zone, 0)) - nr_reclaimed += shrink_zone(priority, zone, &sc); + shrink_zone(priority, zone, &sc); reclaim_state->reclaimed_slab = 0; nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL, lru_pages); - nr_reclaimed += reclaim_state->reclaimed_slab; + sc.nr_reclaimed += reclaim_state->reclaimed_slab; total_scanned += sc.nr_scanned; if (zone_is_all_unreclaimable(zone)) continue; @@ -1827,7 +1832,7 @@ loop_again: * even in laptop mode */ if (total_scanned > SWAP_CLUSTER_MAX * 2 && - total_scanned > nr_reclaimed + nr_reclaimed / 2) + total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2) sc.may_writepage = 1; } if (all_zones_ok) @@ -1845,7 +1850,7 @@ loop_again: * matches the direct reclaim path behaviour in terms of impact * on zone->*_priority. */ - if (nr_reclaimed >= SWAP_CLUSTER_MAX) + if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX) break; } out: @@ -1867,7 +1872,7 @@ out: goto loop_again; } - return nr_reclaimed; + return sc.nr_reclaimed; } /* @@ -2219,7 +2224,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) struct task_struct *p = current; struct reclaim_state reclaim_state; int priority; - unsigned long nr_reclaimed = 0; struct scan_control sc = { .may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE), .may_swap = !!(zone_reclaim_mode & RECLAIM_SWAP), @@ -2252,9 +2256,9 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) priority = ZONE_RECLAIM_PRIORITY; do { note_zone_scanning_priority(zone, priority); - nr_reclaimed += shrink_zone(priority, zone, &sc); + shrink_zone(priority, zone, &sc); priority--; - } while (priority >= 0 && nr_reclaimed < nr_pages); + } while (priority >= 0 && sc.nr_reclaimed < nr_pages); } slab_reclaimable = zone_page_state(zone, NR_SLAB_RECLAIMABLE); @@ -2278,13 +2282,13 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) * Update nr_reclaimed by the number of slab pages we * reclaimed from this zone. */ - nr_reclaimed += slab_reclaimable - + sc.nr_reclaimed += slab_reclaimable - zone_page_state(zone, NR_SLAB_RECLAIMABLE); } p->reclaim_state = NULL; current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE); - return nr_reclaimed >= nr_pages; + return sc.nr_reclaimed >= nr_pages; } int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) -- cgit From 01dbe5c9b1004dab045cb7f38428258ca9cddc02 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 6 Jan 2009 14:40:02 -0800 Subject: vmscan: improve reclaim throughput to bail out patch The vmscan bail out patch move nr_reclaimed variable to struct scan_control. Unfortunately, indirect access can easily happen cache miss. if heavy memory pressure happend, that's ok. cache miss already plenty. it is not observable. but, if memory pressure is lite, performance degression is obserbable. I compared following three pattern (it was mesured 10 times each) hackbench 125 process 3000 hackbench 130 process 3000 hackbench 135 process 3000 2.6.28-rc6 bail-out 125 130 135 125 130 135 ============================================================== 71.866 75.86 81.274 93.414 73.254 193.382 74.145 78.295 77.27 74.897 75.021 80.17 70.305 77.643 75.855 70.134 77.571 79.896 74.288 73.986 75.955 77.222 78.48 80.619 72.029 79.947 78.312 75.128 82.172 79.708 71.499 77.615 77.042 74.177 76.532 77.306 76.188 74.471 83.562 73.839 72.43 79.833 73.236 75.606 78.743 76.001 76.557 82.726 69.427 77.271 76.691 76.236 79.371 103.189 72.473 76.978 80.643 69.128 78.932 75.736 avg 72.545 76.767 78.534 76.017 77.03 93.256 std 1.89 1.71 2.41 6.29 2.79 34.16 min 69.427 73.986 75.855 69.128 72.43 75.736 max 76.188 79.947 83.562 93.414 82.172 193.382 about 4-5% degression. Then, this patch introduces a temporary local variable. result: 2.6.28-rc6 this patch num 125 130 135 125 130 135 ============================================================== 71.866 75.86 81.274 67.302 68.269 77.161 74.145 78.295 77.27 72.616 72.712 79.06 70.305 77.643 75.855 72.475 75.712 77.735 74.288 73.986 75.955 69.229 73.062 78.814 72.029 79.947 78.312 71.551 74.392 78.564 71.499 77.615 77.042 69.227 74.31 78.837 76.188 74.471 83.562 70.759 75.256 76.6 73.236 75.606 78.743 69.966 76.001 78.464 69.427 77.271 76.691 69.068 75.218 80.321 72.473 76.978 80.643 72.057 77.151 79.068 avg 72.545 76.767 78.534 70.425 74.2083 78.462 std 1.89 1.71 2.41 1.66 2.34 1.00 min 69.427 73.986 75.855 67.302 68.269 76.6 max 76.188 79.947 83.562 72.616 77.151 80.321 OK. the degression is disappeared. Signed-off-by: KOSAKI Motohiro Acked-by: Rik van Riel Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 5faa7739487f..13f050d667e9 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1410,6 +1410,8 @@ static void shrink_zone(int priority, struct zone *zone, unsigned long nr_to_scan; unsigned long percent[2]; /* anon @ 0; file @ 1 */ enum lru_list l; + unsigned long nr_reclaimed = sc->nr_reclaimed; + unsigned long swap_cluster_max = sc->swap_cluster_max; get_scan_ratio(zone, sc, percent); @@ -1425,7 +1427,7 @@ static void shrink_zone(int priority, struct zone *zone, } zone->lru[l].nr_scan += scan; nr[l] = zone->lru[l].nr_scan; - if (nr[l] >= sc->swap_cluster_max) + if (nr[l] >= swap_cluster_max) zone->lru[l].nr_scan = 0; else nr[l] = 0; @@ -1444,12 +1446,11 @@ static void shrink_zone(int priority, struct zone *zone, nr[LRU_INACTIVE_FILE]) { for_each_evictable_lru(l) { if (nr[l]) { - nr_to_scan = min(nr[l], - (unsigned long)sc->swap_cluster_max); + nr_to_scan = min(nr[l], swap_cluster_max); nr[l] -= nr_to_scan; - sc->nr_reclaimed += shrink_list(l, nr_to_scan, - zone, sc, priority); + nr_reclaimed += shrink_list(l, nr_to_scan, + zone, sc, priority); } } /* @@ -1460,11 +1461,13 @@ static void shrink_zone(int priority, struct zone *zone, * with multiple processes reclaiming pages, the total * freeing target can get unreasonably large. */ - if (sc->nr_reclaimed > sc->swap_cluster_max && + if (nr_reclaimed > swap_cluster_max && priority < DEF_PRIORITY && !current_is_kswapd()) break; } + sc->nr_reclaimed = nr_reclaimed; + /* * Even if we did not try to evict anon pages at all, we want to * rebalance the anon lru active/inactive ratio. -- cgit From 09f445e7f5107c91be12ed386350de6cd055e0a4 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 6 Jan 2009 14:40:03 -0800 Subject: mm: kill zone_is_near_oom() zone_is_near_oom() is unused. Signed-off-by: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 5 ----- 1 file changed, 5 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 13f050d667e9..466a36b3bada 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1167,11 +1167,6 @@ static inline void note_zone_scanning_priority(struct zone *zone, int priority) zone->prev_priority = priority; } -static inline int zone_is_near_oom(struct zone *zone) -{ - return zone->pages_scanned >= (zone_lru_pages(zone) * 3); -} - /* * This moves pages from the active list to the inactive list. * -- cgit From b555749aac87d7c2637f153e44bd77c7fdf4c65b Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Tue, 6 Jan 2009 14:40:13 -0800 Subject: vmscan: shrink_active_list(): reduce lru_lock hold time These three statements manipulate local variables and do not need the lock coverage. Cc: Johannes Weiner Cc: Lee Schermerhorn Cc: Rik van Riel Signed-off-by: Linus Torvalds --- mm/vmscan.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 466a36b3bada..5daf606e0a35 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1237,6 +1237,13 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, list_add(&page->lru, &l_inactive); } + /* + * Move the pages to the [file or anon] inactive list. + */ + pagevec_init(&pvec, 1); + pgmoved = 0; + lru = LRU_BASE + file * LRU_FILE; + spin_lock_irq(&zone->lru_lock); /* * Count referenced pages from currently used mappings as @@ -1247,13 +1254,6 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, if (scan_global_lru(sc)) zone->recent_rotated[!!file] += pgmoved; - /* - * Move the pages to the [file or anon] inactive list. - */ - pagevec_init(&pvec, 1); - - pgmoved = 0; - lru = LRU_BASE + file * LRU_FILE; while (!list_empty(&l_inactive)) { page = lru_to_page(&l_inactive); prefetchw_prev_lru_page(page, &l_inactive, flags); -- cgit From 73ce02e96fe34a983199a9855b2ae738f960a6ee Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 6 Jan 2009 14:40:33 -0800 Subject: mm: stop kswapd's infinite loop at high order allocation Wassim Dagash reported following kswapd infinite loop problem. kswapd runs in some infinite loop trying to swap until order 10 of zone highmem is OK.... kswapd will continue to try to balance order 10 of zone highmem forever (or until someone release a very large chunk of highmem). For non order-0 allocations, the system may never be balanced due to fragmentation but kswapd should not infinitely loop as a result. Instead, recheck all watermarks at order-0 as they are the most important. If watermarks are ok, kswapd will go back to sleep. [akpm@linux-foundation.org: fix comment] Reported-by: wassim dagash Signed-off-by: KOSAKI Motohiro Reviewed-by: Nick Piggin Signed-off-by: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 5daf606e0a35..b07c48b09a93 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1867,6 +1867,23 @@ out: try_to_freeze(); + /* + * Fragmentation may mean that the system cannot be + * rebalanced for high-order allocations in all zones. + * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX, + * it means the zones have been fully scanned and are still + * not balanced. For high-order allocations, there is + * little point trying all over again as kswapd may + * infinite loop. + * + * Instead, recheck all watermarks at order-0 as they + * are the most important. If watermarks are ok, kswapd will go + * back to sleep. High-order users can still perform direct + * reclaim if they wish. + */ + if (sc.nr_reclaimed < SWAP_CLUSTER_MAX) + order = sc.order = 0; + goto loop_again; } -- cgit