summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2014-10-09mm: dmapool: add/remove sysfs file outside of the pool lock lockSebastian Andrzej Siewior
cat /sys/.../pools followed by removal the device leads to: |====================================================== |[ INFO: possible circular locking dependency detected ] |3.17.0-rc4+ #1498 Not tainted |------------------------------------------------------- |rmmod/2505 is trying to acquire lock: | (s_active#28){++++.+}, at: [<c017f754>] kernfs_remove_by_name_ns+0x3c/0x88 | |but task is already holding lock: | (pools_lock){+.+.+.}, at: [<c011494c>] dma_pool_destroy+0x18/0x17c | |which lock already depends on the new lock. |the existing dependency chain (in reverse order) is: | |-> #1 (pools_lock){+.+.+.}: | [<c0114ae8>] show_pools+0x30/0xf8 | [<c0313210>] dev_attr_show+0x1c/0x48 | [<c0180e84>] sysfs_kf_seq_show+0x88/0x10c | [<c017f960>] kernfs_seq_show+0x24/0x28 | [<c013efc4>] seq_read+0x1b8/0x480 | [<c011e820>] vfs_read+0x8c/0x148 | [<c011ea10>] SyS_read+0x40/0x8c | [<c000e960>] ret_fast_syscall+0x0/0x48 | |-> #0 (s_active#28){++++.+}: | [<c017e9ac>] __kernfs_remove+0x258/0x2ec | [<c017f754>] kernfs_remove_by_name_ns+0x3c/0x88 | [<c0114a7c>] dma_pool_destroy+0x148/0x17c | [<c03ad288>] hcd_buffer_destroy+0x20/0x34 | [<c03a4780>] usb_remove_hcd+0x110/0x1a4 The problem is the lock order of pools_lock and kernfs_mutex in dma_pool_destroy() vs show_pools() call path. This patch breaks out the creation of the sysfs file outside of the pools_lock mutex. The newly added pools_reg_lock ensures that there is no race of create vs destroy code path in terms whether or not the sysfs file has to be deleted (and was it deleted before we try to create a new one) and what to do if device_create_file() failed. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09memcg: move memcg_update_cache_size() to slab_common.cVladimir Davydov
`While growing per memcg caches arrays, we jump between memcontrol.c and slab_common.c in a weird way: memcg_alloc_cache_id - memcontrol.c memcg_update_all_caches - slab_common.c memcg_update_cache_size - memcontrol.c There's absolutely no reason why memcg_update_cache_size can't live on the slab's side though. So let's move it there and settle it comfortably amid per-memcg cache allocation functions. Besides, this patch cleans this function up a bit, removing all the useless comments from it, and renames it to memcg_update_cache_params to conform to memcg_alloc/free_cache_params, which we already have in slab_common.c. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Glauber Costa <glommer@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09memcg: don't call memcg_update_all_caches if new cache id fitsVladimir Davydov
memcg_update_all_caches grows arrays of per-memcg caches, so we only need to call it when memcg_limited_groups_array_size is increased. However, currently we invoke it each time a new kmem-active memory cgroup is created. Then it just iterates over all slab_caches and does nothing (memcg_update_cache_size returns immediately). This patch fixes this insanity. In the meantime it moves the code dealing with id allocations to separate functions, memcg_alloc_cache_id and memcg_free_cache_id. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Glauber Costa <glommer@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09memcg: move memcg_{alloc,free}_cache_params to slab_common.cVladimir Davydov
The only reason why they live in memcontrol.c is that we get/put css reference to the owner memory cgroup in them. However, we can do that in memcg_{un,}register_cache. OTOH, there are several reasons to move them to slab_common.c. First, I think that the less public interface functions we have in memcontrol.h the better. Since the functions I move don't depend on memcontrol, I think it's worth making them private to slab, especially taking into account that the arrays are defined on the slab's side too. Second, the way how per-memcg arrays are updated looks rather awkward: it proceeds from memcontrol.c (__memcg_activate_kmem) to slab_common.c (memcg_update_all_caches) and back to memcontrol.c again (memcg_update_array_size). In the following patches I move the function relocating the arrays (memcg_update_array_size) to slab_common.c and therefore get rid this circular call path. I think we should have the cache allocation stuff in the same place where we have relocation, because it's easier to follow the code then. So I move arrays alloc/free functions to slab_common.c too. The third point isn't obvious. I'm going to make the list_lru structure per-memcg to allow targeted kmem reclaim. That means we will have per-memcg arrays in list_lrus too. It turns out that it's much easier to update these arrays in list_lru.c rather than in memcontrol.c, because all the stuff we need is defined there. This patch makes memcg caches arrays allocation path conform that of the upcoming list_lru. So let's move these functions to slab_common.c and make them static. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Glauber Costa <glommer@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm/debug.c: use pr_emerg()Andrew Morton
- s/KERN_ALERT/pr_emerg/: we're going BUG so let's maximize the changes of getting the message out. - convert debug.c to pr_foo() Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm: use VM_BUG_ON_MM where possibleSasha Levin
Dump the contents of the relevant struct_mm when we hit the bug condition. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm: introduce VM_BUG_ON_MMSasha Levin
Very similar to VM_BUG_ON_PAGE and VM_BUG_ON_VMA, dump struct_mm when the bug is hit. [akpm@linux-foundation.org: coding-style fixes] [mhocko@suse.cz: fix build] [mhocko@suse.cz: fix build some more] [akpm@linux-foundation.org: do strange things to avoid doing strange things for the comma separators] Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Dave Jones <davej@redhat.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm: move debug code out of page_alloc.cSasha Levin
dump_page() and dump_vma() are not specific to page_alloc.c, move them out so page_alloc.c won't turn into the unofficial debug repository. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm: page_alloc: default node-ordering on 64-bit NUMA, zone-ordering on 32-bitMel Gorman
Zones are allocated by the page allocator in either node or zone order. Node ordering is preferred in terms of locality and is applied automatically in one of three cases: 1. If a node has only low memory 2. If DMA/DMA32 is a high percentage of memory 3. If low memory on a single node is greater than 70% of the node size Otherwise zone ordering is used to preserve low memory for devices that require it. Unfortunately a consequence of this is that applications running on a machine with balanced NUMA nodes will experience different performance characteristics depending on which node they happen to start from. The point of zone ordering is to protect lower zones for devices that require DMA/DMA32 memory. When NUMA was first introduced, this was critical as 32-bit NUMA machines existed and exhausting low memory triggered OOMs easily as so many allocations required low memory. On 64-bit machines the primary concern is devices that are 32-bit only which is less severe than the low memory exhaustion problem on 32-bit NUMA. It seems there are really few devices that depends on it. AGP -- I assume this is getting more rare but even then I think the allocations happen early in boot time where lowmem pressure is less of a problem DRM -- If the device is 32-bit only then there may be low pressure. I didn't evaluate these in detail but it looks like some of these are mobile graphics card. Not many NUMA laptops out there. DRM folk should know better though. Some TV cards -- Much demand for 32-bit capable TV cards on NUMA machines? B43 wireless card -- again not really a NUMA thing. I cannot find a good reason to incur a performance penalty on all 64-bit NUMA machines in case someone throws a brain damanged TV or graphics card in there. This patch defaults to node-ordering on 64-bit NUMA machines. I was tempted to make it default everywhere but I understand that some embedded arches may be using 32-bit NUMA where I cannot predict the consequences. The performance impact depends on the workload and the characteristics of the machine and the machine I tested on had a large Normal zone on node 0 so the impact is within the noise for the majority of tests. The allocation stats show more allocation requests were from DMA32 and local node. Running SpecJBB with multiple JVMs and automatic NUMA balancing disabled the results were specjbb 3.17.0-rc2 3.17.0-rc2 vanilla nodeorder-v1r1 Min 1 29534.00 ( 0.00%) 30020.00 ( 1.65%) Min 10 115717.00 ( 0.00%) 134038.00 ( 15.83%) Min 19 109718.00 ( 0.00%) 114186.00 ( 4.07%) Min 28 104459.00 ( 0.00%) 103639.00 ( -0.78%) Min 37 98245.00 ( 0.00%) 103756.00 ( 5.61%) Min 46 97198.00 ( 0.00%) 96197.00 ( -1.03%) Mean 1 30953.25 ( 0.00%) 31917.75 ( 3.12%) Mean 10 124432.50 ( 0.00%) 140904.00 ( 13.24%) Mean 19 116033.50 ( 0.00%) 119294.75 ( 2.81%) Mean 28 108365.25 ( 0.00%) 106879.50 ( -1.37%) Mean 37 102984.75 ( 0.00%) 106924.25 ( 3.83%) Mean 46 100783.25 ( 0.00%) 105368.50 ( 4.55%) Stddev 1 1260.38 ( 0.00%) 1109.66 ( 11.96%) Stddev 10 7434.03 ( 0.00%) 5171.91 ( 30.43%) Stddev 19 8453.84 ( 0.00%) 5309.59 ( 37.19%) Stddev 28 4184.55 ( 0.00%) 2906.63 ( 30.54%) Stddev 37 5409.49 ( 0.00%) 3192.12 ( 40.99%) Stddev 46 4521.95 ( 0.00%) 7392.52 (-63.48%) Max 1 32738.00 ( 0.00%) 32719.00 ( -0.06%) Max 10 136039.00 ( 0.00%) 148614.00 ( 9.24%) Max 19 130566.00 ( 0.00%) 127418.00 ( -2.41%) Max 28 115404.00 ( 0.00%) 111254.00 ( -3.60%) Max 37 112118.00 ( 0.00%) 111732.00 ( -0.34%) Max 46 108541.00 ( 0.00%) 116849.00 ( 7.65%) TPut 1 123813.00 ( 0.00%) 127671.00 ( 3.12%) TPut 10 497730.00 ( 0.00%) 563616.00 ( 13.24%) TPut 19 464134.00 ( 0.00%) 477179.00 ( 2.81%) TPut 28 433461.00 ( 0.00%) 427518.00 ( -1.37%) TPut 37 411939.00 ( 0.00%) 427697.00 ( 3.83%) TPut 46 403133.00 ( 0.00%) 421474.00 ( 4.55%) 3.17.0-rc2 3.17.0-rc2 vanillanodeorder-v1r1 DMA allocs 0 0 DMA32 allocs 57 1491992 Normal allocs 32543566 30026383 Movable allocs 0 0 Direct pages scanned 0 0 Kswapd pages scanned 0 0 Kswapd pages reclaimed 0 0 Direct pages reclaimed 0 0 Kswapd efficiency 100% 100% Kswapd velocity 0.000 0.000 Direct efficiency 100% 100% Direct velocity 0.000 0.000 Percentage direct scans 0% 0% Zone normal velocity 0.000 0.000 Zone dma32 velocity 0.000 0.000 Zone dma velocity 0.000 0.000 THP fault alloc 55164 52987 THP collapse alloc 139 147 THP splits 26 21 NUMA alloc hit 4169066 4250692 NUMA alloc miss 0 0 Note that there were more DMA32 allocations with the patch applied. In this particular case there was no difference in numa_hit and numa_miss. The expectation is that DMA32 was being used at the low watermark instead of falling into the slow path. kswapd was not woken but it's not worken for THP allocations. On 32-bit, this patch defaults to zone-ordering as low memory depletion can be a serious problem on 32-bit large memory machines. If the default ordering was node then processes on node 0 will deplete the Normal zone due to normal activity. The problem is worse if CONFIG_HIGHPTE is not set. If combined with large amounts of dirty/writeback pages in Normal zone then there is also a high risk of OOM. The heuristics are removed as it's not clear they were ever important on 32-bit. They were only relevant for setting node-ordering on 64-bit. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm: page_alloc: Make paranoid check in move_freepages a VM_BUG_ONMel Gorman
Since 2.6.24 there has been a paranoid check in move_freepages that looks up the zone of two pages. This is a very slow path and the only time I've seen this bug trigger recently is when memory initialisation was broken during patch development. Despite the fact it's a slow path, this patch converts the check to a VM_BUG_ON anyway as it has served its purpose by now. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm/compaction.c: fix warning of 'flags' may be used uninitializedXiubo Li
C mm/compaction.o mm/compaction.c: In function isolate_freepages_block: mm/compaction.c:364:37: warning: flags may be used uninitialized in this function [-Wmaybe-uninitialized] && compact_unlock_should_abort(&cc->zone->lock, flags, ^ Signed-off-by: Xiubo Li <Li.Xiubo@freescale.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm/mmap.c: clean up CONFIG_DEBUG_VM_RB checksAndrew Morton
- be consistent in printing the test which failed - one message was actually wrong (a<b != b>a) - don't print second bogus warning if browse_rb() failed Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm: clean up zone flagsJohannes Weiner
Page reclaim tests zone_is_reclaim_dirty(), but the site that actually sets this state does zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY), sending the reader through layers indirection just to track down a simple bit. Remove all zone flag wrappers and just use bitops against zone->flags directly. It's just as readable and the lines are barely any longer. Also rename ZONE_TAIL_LRU_DIRTY to ZONE_DIRTY to match ZONE_WRITEBACK, and remove the zone_flags_t typedef. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm/page-writeback.c: use min3/max3 macros to avoid shadow warningsMark Rustad
Nested calls to min/max functions result in shadow warnings in W=2 builds. Avoid the warning by using the min3 and max3 macros to get the min/max of 3 values instead of nested calls. Signed-off-by: Mark Rustad <mark.d.rustad@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm: page_alloc: avoid wakeup kswapd on the unintended nodeWeijie Yang
When entering the page_alloc slowpath, we wakeup kswapd on every pgdat according to the zonelist and high_zoneidx. However, this doesn't take nodemask into account, and could prematurely wakeup kswapd on some unintended nodes. This patch uses for_each_zone_zonelist_nodemask() instead of for_each_zone_zonelist() in wake_all_kswapds() to avoid the above situation. Signed-off-by: Weijie Yang <weijie.yang@samsung.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm: convert a few VM_BUG_ON callers to VM_BUG_ON_VMASasha Levin
Trivially convert a few VM_BUG_ON calls to VM_BUG_ON_VMA to extract more information when they trigger. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michel Lespinasse <walken@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm: introduce dump_vmaSasha Levin
Introduce a helper to dump information about a VMA, this also makes dump_page_flags more generic and re-uses that so the output looks very similar to dump_page: [ 61.903437] vma ffff88070f88be00 start 00007fff25970000 end 00007fff25992000 [ 61.903437] next ffff88070facd600 prev ffff88070face400 mm ffff88070fade000 [ 61.903437] prot 8000000000000025 anon_vma ffff88070fa1e200 vm_ops (null) [ 61.903437] pgoff 7ffffffdd file (null) private_data (null) [ 61.909129] flags: 0x100173(read|write|mayread|maywrite|mayexec|growsdown|account) [akpm@linux-foundation.org: make dump_vma() require CONFIG_DEBUG_VM] [swarren@nvidia.com: fix dump_vma() compilation] Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michel Lespinasse <walken@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Stephen Warren <swarren@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm/slab.c: use __seq_open_private() instead of seq_open()Rob Jones
Using __seq_open_private() removes boilerplate code from slabstats_open() The resultant code is shorter and easier to follow. This patch does not change any functionality. Signed-off-by: Rob Jones <rob.jones@codethink.co.uk> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm/vmalloc.c: use seq_open_private() instead of seq_open()Rob Jones
Using seq_open_private() removes boilerplate code from vmalloc_open(). The resultant code is shorter and easier to follow. However, please note that seq_open_private() call kzalloc() rather than kmalloc() which may affect timing due to the memory initialisation overhead. Signed-off-by: Rob Jones <rob.jones@codethink.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09include/linux/migrate.h: remove migrate_page #defineAndrew Morton
This is designed to avoid a few ifdefs in .c files but it's obnoxious because it can cause unsuspecting "migrate_page" symbols to get turned into "NULL". Just nuke it and use the ifdefs. Cc: Konstantin Khlebnikov <k.khlebnikov@samsung.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mempolicy: unexport get_vma_policy() and remove its "task" argOleg Nesterov
- get_vma_policy(task) is not safe if task != current, remove this argument. - get_vma_policy() no longer has callers outside of mempolicy.c, make it static. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mempolicy: kill do_set_mempolicy()->down_write(&mm->mmap_sem)Oleg Nesterov
Remove down_write(&mm->mmap_sem) in do_set_mempolicy(). This logic was never correct and it is no longer needed, see the previous patch. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mempolicy: introduce __get_vma_policy(), export get_task_policy()Oleg Nesterov
Extract the code which looks for vma's policy from get_vma_policy() into the new helper, __get_vma_policy(). Export get_task_policy(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mempolicy: remove the "task" arg of vma_policy_mof() and simplify itOleg Nesterov
1. vma_policy_mof(task) is simply not safe unless task == current, it can race with do_exit()->mpol_put(). Remove this arg and update its single caller. 2. vma can not be NULL, remove this check and simplify the code. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mempolicy: sanitize the usage of get_task_policy()Oleg Nesterov
Cleanup + preparation. Every user of get_task_policy() calls it unconditionally, even if it is not going to use the result. get_task_policy() is cheap but still this does not look clean, plus the code looks simpler if get_task_policy() is called only when this is really needed. Note: I hope this is correct, but it is not clear why vma_policy_mof() doesn't fall back to get_task_policy() if ->get_policy() returns NULL. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mempolicy: change get_task_policy() to return default_policy rather than NULLOleg Nesterov
Every caller of get_task_policy() falls back to default_policy if it returns NULL. Change get_task_policy() to do this. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mempolicy: change alloc_pages_vma() to use mpol_cond_put()Oleg Nesterov
Trivial cleanup. alloc_pages_vma() can use mpol_cond_put(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm: remove noisy remainder of the scan_unevictable interfaceJohannes Weiner
The deprecation warnings for the scan_unevictable interface triggers by scripts doing `sysctl -a | grep something else'. This is annoying and not helpful. The interface has been defunct since 264e56d8247e ("mm: disable user interface to manually rescue unevictable pages"), which was in 2011, and there haven't been any reports of usecases for it, only reports that the deprecation warnings are annying. It's unlikely that anybody is using this interface specifically at this point, so remove it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm: use may_adjust_brk helperCyrill Gorcunov
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Kees Cook <keescook@chromium.org> Cc: Tejun Heo <tj@kernel.org> Cc: Andrew Vagin <avagin@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Julien Tinnes <jln@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm, compaction: pass gfp mask to compact_controlDavid Rientjes
struct compact_control currently converts the gfp mask to a migratetype, but we need the entire gfp mask in a follow-up patch. Pass the entire gfp mask as part of struct compact_control. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm: rename allocflags_to_migratetype for clarityDavid Rientjes
The page allocator has gfp flags (like __GFP_WAIT) and alloc flags (like ALLOC_CPUSET) that have separate semantics. The function allocflags_to_migratetype() actually takes gfp flags, not alloc flags, and returns a migratetype. Rename it to gfpflags_to_migratetype(). Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm, compaction: skip buddy pages by their order in the migrate scannerVlastimil Babka
The migration scanner skips PageBuddy pages, but does not consider their order as checking page_order() is generally unsafe without holding the zone->lock, and acquiring the lock just for the check wouldn't be a good tradeoff. Still, this could avoid some iterations over the rest of the buddy page, and if we are careful, the race window between PageBuddy() check and page_order() is small, and the worst thing that can happen is that we skip too much and miss some isolation candidates. This is not that bad, as compaction can already fail for many other reasons like parallel allocations, and those have much larger race window. This patch therefore makes the migration scanner obtain the buddy page order and use it to skip the whole buddy page, if the order appears to be in the valid range. It's important that the page_order() is read only once, so that the value used in the checks and in the pfn calculation is the same. But in theory the compiler can replace the local variable by multiple inlines of page_order(). Therefore, the patch introduces page_order_unsafe() that uses ACCESS_ONCE to prevent this. Testing with stress-highalloc from mmtests shows a 15% reduction in number of pages scanned by migration scanner. The reduction is >60% with __GFP_NO_KSWAPD allocations, along with success rates better by few percent. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm, compaction: remember position within pageblock in free pages scannerVlastimil Babka
Unlike the migration scanner, the free scanner remembers the beginning of the last scanned pageblock in cc->free_pfn. It might be therefore rescanning pages uselessly when called several times during single compaction. This might have been useful when pages were returned to the buddy allocator after a failed migration, but this is no longer the case. This patch changes the meaning of cc->free_pfn so that if it points to a middle of a pageblock, that pageblock is scanned only from cc->free_pfn to the end. isolate_freepages_block() will record the pfn of the last page it looked at, which is then used to update cc->free_pfn. In the mmtests stress-highalloc benchmark, this has resulted in lowering the ratio between pages scanned by both scanners, from 2.5 free pages per migrate page, to 2.25 free pages per migrate page, without affecting success rates. With __GFP_NO_KSWAPD allocations, this appears to result in a worse ratio (2.1 instead of 1.8), but page migration successes increased by 10%, so this could mean that more useful work can be done until need_resched() aborts this kind of compaction. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm, compaction: skip rechecks when lock was already heldVlastimil Babka
Compaction scanners try to lock zone locks as late as possible by checking many page or pageblock properties opportunistically without lock and skipping them if not unsuitable. For pages that pass the initial checks, some properties have to be checked again safely under lock. However, if the lock was already held from a previous iteration in the initial checks, the rechecks are unnecessary. This patch therefore skips the rechecks when the lock was already held. This is now possible to do, since we don't (potentially) drop and reacquire the lock between the initial checks and the safe rechecks anymore. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm, compaction: periodically drop lock and restore IRQs in scannersVlastimil Babka
Compaction scanners regularly check for lock contention and need_resched() through the compact_checklock_irqsave() function. However, if there is no contention, the lock can be held and IRQ disabled for potentially long time. This has been addressed by commit b2eef8c0d091 ("mm: compaction: minimise the time IRQs are disabled while isolating pages for migration") for the migration scanner. However, the refactoring done by commit 2a1402aa044b ("mm: compaction: acquire the zone->lru_lock as late as possible") has changed the conditions so that the lock is dropped only when there's contention on the lock or need_resched() is true. Also, need_resched() is checked only when the lock is already held. The comment "give a chance to irqs before checking need_resched" is therefore misleading, as IRQs remain disabled when the check is done. This patch restores the behavior intended by commit b2eef8c0d091 and also tries to better balance and make more deterministic the time spent by checking for contention vs the time the scanners might run between the checks. It also avoids situations where checking has not been done often enough before. The result should be avoiding both too frequent and too infrequent contention checking, and especially the potentially long-running scans with IRQs disabled and no checking of need_resched() or for fatal signal pending, which can happen when many consecutive pages or pageblocks fail the preliminary tests and do not reach the later call site to compact_checklock_irqsave(), as explained below. Before the patch: In the migration scanner, compact_checklock_irqsave() was called each loop, if reached. If not reached, some lower-frequency checking could still be done if the lock was already held, but this would not result in aborting contended async compaction until reaching compact_checklock_irqsave() or end of pageblock. In the free scanner, it was similar but completely without the periodical checking, so lock can be potentially held until reaching the end of pageblock. After the patch, in both scanners: The periodical check is done as the first thing in the loop on each SWAP_CLUSTER_MAX aligned pfn, using the new compact_unlock_should_abort() function, which always unlocks the lock (if locked) and aborts async compaction if scheduling is needed. It also aborts any type of compaction when a fatal signal is pending. The compact_checklock_irqsave() function is replaced with a slightly different compact_trylock_irqsave(). The biggest difference is that the function is not called at all if the lock is already held. The periodical need_resched() checking is left solely to compact_unlock_should_abort(). The lock contention avoidance for async compaction is achieved by the periodical unlock by compact_unlock_should_abort() and by using trylock in compact_trylock_irqsave() and aborting when trylock fails. Sync compaction does not use trylock. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm, compaction: khugepaged should not give up due to need_resched()Vlastimil Babka
Async compaction aborts when it detects zone lock contention or need_resched() is true. David Rientjes has reported that in practice, most direct async compactions for THP allocation abort due to need_resched(). This means that a second direct compaction is never attempted, which might be OK for a page fault, but khugepaged is intended to attempt a sync compaction in such case and in these cases it won't. This patch replaces "bool contended" in compact_control with an int that distinguishes between aborting due to need_resched() and aborting due to lock contention. This allows propagating the abort through all compaction functions as before, but passing the abort reason up to __alloc_pages_slowpath() which decides when to continue with direct reclaim and another compaction attempt. Another problem is that try_to_compact_pages() did not act upon the reported contention (both need_resched() or lock contention) immediately and would proceed with another zone from the zonelist. When need_resched() is true, that means initializing another zone compaction, only to check again need_resched() in isolate_migratepages() and aborting. For zone lock contention, the unintended consequence is that the lock contended status reported back to the allocator is detrmined from the last zone where compaction was attempted, which is rather arbitrary. This patch fixes the problem in the following way: - async compaction of a zone aborting due to need_resched() or fatal signal pending means that further zones should not be tried. We report COMPACT_CONTENDED_SCHED to the allocator. - aborting zone compaction due to lock contention means we can still try another zone, since it has different set of locks. We report back COMPACT_CONTENDED_LOCK only if *all* zones where compaction was attempted, it was aborted due to lock contention. As a result of these fixes, khugepaged will proceed with second sync compaction as intended, when the preceding async compaction aborted due to need_resched(). Page fault compactions aborting due to need_resched() will spare some cycles previously wasted by initializing another zone compaction only to abort again. Lock contention will be reported only when compaction in all zones aborted due to lock contention, and therefore it's not a good idea to try again after reclaim. In stress-highalloc from mmtests configured to use __GFP_NO_KSWAPD, this has improved number of THP collapse allocations by 10%, which shows positive effect on khugepaged. The benchmark's success rates are unchanged as it is not recognized as khugepaged. Numbers of compact_stall and compact_fail events have however decreased by 20%, with compact_success still a bit improved, which is good. With benchmark configured not to use __GFP_NO_KSWAPD, there is 6% improvement in THP collapse allocations, and only slight improvement in stalls and failures. [akpm@linux-foundation.org: fix warnings] Reported-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm, compaction: reduce zone checking frequency in the migration scannerVlastimil Babka
The unification of the migrate and free scanner families of function has highlighted a difference in how the scanners ensure they only isolate pages of the intended zone. This is important for taking zone lock or lru lock of the correct zone. Due to nodes overlapping, it is however possible to encounter a different zone within the range of the zone being compacted. The free scanner, since its inception by commit 748446bb6b5a ("mm: compaction: memory compaction core"), has been checking the zone of the first valid page in a pageblock, and skipping the whole pageblock if the zone does not match. This checking was completely missing from the migration scanner at first, and later added by commit dc9086004b3d ("mm: compaction: check for overlapping nodes during isolation for migration") in a reaction to a bug report. But the zone comparison in migration scanner is done once per a single scanned page, which is more defensive and thus more costly than a check per pageblock. This patch unifies the checking done in both scanners to once per pageblock, through a new pageblock_pfn_to_page() function, which also includes pfn_valid() checks. It is more defensive than the current free scanner checks, as it checks both the first and last page of the pageblock, but less defensive by the migration scanner per-page checks. It assumes that node overlapping may result (on some architecture) in a boundary between two nodes falling into the middle of a pageblock, but that there cannot be a node0 node1 node0 interleaving within a single pageblock. The result is more code being shared and a bit less per-page CPU cost in the migration scanner. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm, compaction: move pageblock checks up from isolate_migratepages_range()Vlastimil Babka
isolate_migratepages_range() is the main function of the compaction scanner, called either on a single pageblock by isolate_migratepages() during regular compaction, or on an arbitrary range by CMA's __alloc_contig_migrate_range(). It currently perfoms two pageblock-wide compaction suitability checks, and because of the CMA callpath, it tracks if it crossed a pageblock boundary in order to repeat those checks. However, closer inspection shows that those checks are always true for CMA: - isolation_suitable() is true because CMA sets cc->ignore_skip_hint to true - migrate_async_suitable() check is skipped because CMA uses sync compaction We can therefore move the compaction-specific checks to isolate_migratepages() and simplify isolate_migratepages_range(). Furthermore, we can mimic the freepage scanner family of functions, which has isolate_freepages_block() function called both by compaction from isolate_freepages() and by CMA from isolate_freepages_range(), where each use-case adds own specific glue code. This allows further code simplification. Thus, we rename isolate_migratepages_range() to isolate_migratepages_block() and limit its functionality to a single pageblock (or its subset). For CMA, a new different isolate_migratepages_range() is created as a CMA-specific wrapper for the _block() function. The checks specific to compaction are moved to isolate_migratepages(). As part of the unification of these two families of functions, we remove the redundant zone parameter where applicable, since zone pointer is already passed in cc->zone. Furthermore, going back to compact_zone() and compact_finished() when pageblock is found unsuitable (now by isolate_migratepages()) is wasteful - the checks are meant to skip pageblocks quickly. The patch therefore also introduces a simple loop into isolate_migratepages() so that it does not return immediately on failed pageblock checks, but keeps going until isolate_migratepages_range() gets called once. Similarily to isolate_freepages(), the function periodically checks if it needs to reschedule or abort async compaction. [iamjoonsoo.kim@lge.com: fix isolated page counting bug in compaction] Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm, compaction: do not recheck suitable_migration_target under lockVlastimil Babka
isolate_freepages_block() rechecks if the pageblock is suitable to be a target for migration after it has taken the zone->lock. However, the check has been optimized to occur only once per pageblock, and compact_checklock_irqsave() might be dropping and reacquiring lock, which means somebody else might have changed the pageblock's migratetype meanwhile. Furthermore, nothing prevents the migratetype to change right after isolate_freepages_block() has finished isolating. Given how imperfect this is, it's simpler to just rely on the check done in isolate_freepages() without lock, and not pretend that the recheck under lock guarantees anything. It is just a heuristic after all. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm, compaction: do not count compact_stall if all zones skipped compactionVlastimil Babka
The compact_stall vmstat counter counts the number of allocations stalled by direct compaction. It does not count when all attempted zones had deferred compaction, but it does count when all zones skipped compaction. The skipping is decided based on very early check of compaction_suitable(), based on watermarks and memory fragmentation. Therefore it makes sense not to count skipped compactions as stalls. Moreover, compact_success or compact_fail is also already not being counted when compaction was skipped, so this patch changes the compact_stall counting to match the other two. Additionally, restructure __alloc_pages_direct_compact() code for better readability. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm, compaction: defer each zone individually instead of preferred zoneVlastimil Babka
When direct sync compaction is often unsuccessful, it may become deferred for some time to avoid further useless attempts, both sync and async. Successful high-order allocations un-defer compaction, while further unsuccessful compaction attempts prolong the compaction deferred period. Currently the checking and setting deferred status is performed only on the preferred zone of the allocation that invoked direct compaction. But compaction itself is attempted on all eligible zones in the zonelist, so the behavior is suboptimal and may lead both to scenarios where 1) compaction is attempted uselessly, or 2) where it's not attempted despite good chances of succeeding, as shown on the examples below: 1) A direct compaction with Normal preferred zone failed and set deferred compaction for the Normal zone. Another unrelated direct compaction with DMA32 as preferred zone will attempt to compact DMA32 zone even though the first compaction attempt also included DMA32 zone. In another scenario, compaction with Normal preferred zone failed to compact Normal zone, but succeeded in the DMA32 zone, so it will not defer compaction. In the next attempt, it will try Normal zone which will fail again, instead of skipping Normal zone and trying DMA32 directly. 2) Kswapd will balance DMA32 zone and reset defer status based on watermarks looking good. A direct compaction with preferred Normal zone will skip compaction of all zones including DMA32 because Normal was still deferred. The allocation might have succeeded in DMA32, but won't. This patch makes compaction deferring work on individual zone basis instead of preferred zone. For each zone, it checks compaction_deferred() to decide if the zone should be skipped. If watermarks fail after compacting the zone, defer_compaction() is called. The zone where watermarks passed can still be deferred when the allocation attempt is unsuccessful. When allocation is successful, compaction_defer_reset() is called for the zone containing the allocated page. This approach should approximate calling defer_compaction() only on zones where compaction was attempted and did not yield allocated page. There might be corner cases but that is inevitable as long as the decision to stop compacting dues not guarantee that a page will be allocated. Due to a new COMPACT_DEFERRED return value, some functions relying implicitly on COMPACT_SKIPPED = 0 had to be updated, with comments made more accurate. The did_some_progress output parameter of __alloc_pages_direct_compact() is removed completely, as the caller actually does not use it after compaction sets it - it is only considered when direct reclaim sets it. During testing on a two-node machine with a single very small Normal zone on node 1, this patch has improved success rates in stress-highalloc mmtests benchmark. The success here were previously made worse by commit 3a025760fc15 ("mm: page_alloc: spill to remote nodes before waking kswapd") as kswapd was no longer resetting often enough the deferred compaction for the Normal zone, and DMA32 zones on both nodes were thus not considered for compaction. On different machine, success rates were improved with __GFP_NO_KSWAPD allocations. [akpm@linux-foundation.org: fix CONFIG_COMPACTION=n build] Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm, THP: don't hold mmap_sem in khugepaged when allocating THPVlastimil Babka
When allocating huge page for collapsing, khugepaged currently holds mmap_sem for reading on the mm where collapsing occurs. Afterwards the read lock is dropped before write lock is taken on the same mmap_sem. Holding mmap_sem during whole huge page allocation is therefore useless, the vma needs to be rechecked after taking the write lock anyway. Furthemore, huge page allocation might involve a rather long sync compaction, and thus block any mmap_sem writers and i.e. affect workloads that perform frequent m(un)map or mprotect oterations. This patch simply releases the read lock before allocating a huge page. It also deletes an outdated comment that assumed vma must be stable, as it was using alloc_hugepage_vma(). This is no longer true since commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target node"). Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm: page_alloc: determine migratetype only onceVlastimil Babka
The check for ALLOC_CMA in __alloc_pages_nodemask() derives migratetype from gfp_mask in each retry pass, although the migratetype variable already has the value determined and it does not change. Use the variable and perform the check only once. Also convert #ifdef CONFIG_CMA to IS_ENABLED. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm: cma: adjust address limit to avoid hitting low/high memory boundaryMarek Szyprowski
Russell King recently noticed that limiting default CMA region only to low memory on ARM architecture causes serious memory management issues with machines having a lot of memory (which is mainly available as high memory). More information can be found the following thread: http://thread.gmane.org/gmane.linux.ports.arm.kernel/348441/ Those two patches removes this limit letting kernel to put default CMA region into high memory when this is possible (there is enough high memory available and architecture specific DMA limit fits). This should solve strange OOM issues on systems with lots of RAM (i.e. >1GiB) and large (>256M) CMA area. This patch (of 2): Automatically allocated regions should not cross low/high memory boundary, because such regions cannot be later correctly initialized due to spanning across two memory zones. This patch adds a check for this case and a simple code for moving region to low memory if automatically selected address might not fit completely into high memory. Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Daniel Drake <drake@endlessm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09memory-hotplug: add sysfs valid_zones attributeZhang Zhen
Currently memory-hotplug has two limits: 1. If the memory block is in ZONE_NORMAL, you can change it to ZONE_MOVABLE, but this memory block must be adjacent to ZONE_MOVABLE. 2. If the memory block is in ZONE_MOVABLE, you can change it to ZONE_NORMAL, but this memory block must be adjacent to ZONE_NORMAL. With this patch, we can easy to know a memory block can be onlined to which zone, and don't need to know the above two limits. Updated the related Documentation. [akpm@linux-foundation.org: use conventional comment layout] [akpm@linux-foundation.org: fix build with CONFIG_MEMORY_HOTREMOVE=n] [akpm@linux-foundation.org: remove unused local zone_prev] Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Wang Nan <wangnan0@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm/mmap.c: whitespace fixesvishnu.ps
Signed-off-by: vishnu.ps <vishnu.ps@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm/slab: use percpu allocator for cpu cacheJoonsoo Kim
Because of chicken and egg problem, initialization of SLAB is really complicated. We need to allocate cpu cache through SLAB to make the kmem_cache work, but before initialization of kmem_cache, allocation through SLAB is impossible. On the other hand, SLUB does initialization in a more simple way. It uses percpu allocator to allocate cpu cache so there is no chicken and egg problem. So, this patch try to use percpu allocator in SLAB. This simplifies the initialization step in SLAB so that we could maintain SLAB code more easily. In my testing there is no performance difference. This implementation relies on percpu allocator. Because percpu allocator uses vmalloc address space, vmalloc address space could be exhausted by this change on many cpu system with *32 bit* kernel. This implementation can cover 1024 cpus in worst case by following calculation. Worst: 1024 cpus * 4 bytes for pointer * 300 kmem_caches * 120 objects per cpu_cache = 140 MB Normal: 1024 cpus * 4 bytes for pointer * 150 kmem_caches(slab merge) * 80 objects per cpu_cache = 46 MB Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Jeremiah Mahler <jmmahler@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm/slab: support slab mergeJoonsoo Kim
Slab merge is good feature to reduce fragmentation. If new creating slab have similar size and property with exsitent slab, this feature reuse it rather than creating new one. As a result, objects are packed into fewer slabs so that fragmentation is reduced. Below is result of my testing. * After boot, sleep 20; cat /proc/meminfo | grep Slab <Before> Slab: 25136 kB <After> Slab: 24364 kB We can save 3% memory used by slab. For supporting this feature in SLAB, we need to implement SLAB specific kmem_cache_flag() and __kmem_cache_alias(), because SLUB implements some SLUB specific processing related to debug flag and object size change on these functions. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm/slab_common: commonize slab merge logicJoonsoo Kim
Slab merge is good feature to reduce fragmentation. Now, it is only applied to SLUB, but, it would be good to apply it to SLAB. This patch is preparation step to apply slab merge to SLAB by commonizing slab merge logic. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09slab: fix for_each_kmem_cache_node()Mikulas Patocka
Fix a bug (discovered with kmemcheck) in for_each_kmem_cache_node(). The for loop reads the array "node" before verifying that the index is within the range. This results in kmemcheck warning. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Acked-by: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>