From a50527b19c62c808a7fca022816fff88a50b948d Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Fri, 2 Dec 2011 09:17:02 +0800 Subject: fs: Make write(2) interruptible by a fatal signal Currently write(2) to a file is not interruptible by any signal. Sometimes this is desirable, e.g. when you want to quickly kill a process hogging your disk. Also, with commit 499d05ecf990 ("mm: Make task in balance_dirty_pages() killable"), it's necessary to abort the current write accordingly to avoid it quickly dirtying lots more pages at unthrottled rate. This patch makes write interruptible by SIGKILL. We do not allow write to be interruptible by any other signal because that has larger potential of screwing some badly written applications. Reported-by: Kazuya Mio Tested-by: Kazuya Mio Acked-by: Matthew Wilcox Signed-off-by: Jan Kara Signed-off-by: Wu Fengguang --- mm/filemap.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/filemap.c b/mm/filemap.c index c0018f2d50e0..c106d3b3cc64 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2407,7 +2407,6 @@ static ssize_t generic_perform_write(struct file *file, iov_iter_count(i)); again: - /* * Bring in the user page that we will copy from _first_. * Otherwise there's a nasty deadlock on copying from the @@ -2463,7 +2462,10 @@ again: written += copied; balance_dirty_pages_ratelimited(mapping); - + if (fatal_signal_pending(current)) { + status = -EINTR; + break; + } } while (iov_iter_count(i)); return written ? written : status; -- cgit From aed21ad28b1323b2807faea019e5ac388a7bc837 Mon Sep 17 00:00:00 2001 From: Wu Fengguang Date: Wed, 23 Nov 2011 11:44:41 -0600 Subject: writeback: comment on the bdi dirty threshold We do "floating proportions" to let active devices to grow its target share of dirty pages and stalled/inactive devices to decrease its target share over time. It works well except in the case of "an inactive disk suddenly goes busy", where the initial target share may be too small. To mitigate this, bdi_position_ratio() has the below line to raise a small bdi_thresh when it's safe to do so, so that the disk be feed with enough dirty pages for efficient IO and in turn fast rampup of bdi_thresh: bdi_thresh = max(bdi_thresh, (limit - dirty) / 8); balance_dirty_pages() normally does negative feedback control which adjusts ratelimit to balance the bdi dirty pages around the target. In some extreme cases when that is not enough, it will have to block the tasks completely until the bdi dirty pages drop below bdi_thresh. Acked-by: Jan Kara Acked-by: Peter Zijlstra Signed-off-by: Wu Fengguang --- mm/page-writeback.c | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 71252486bc6f..155efca4c123 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -411,8 +411,13 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty) * * Returns @bdi's dirty limit in pages. The term "dirty" in the context of * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages. - * And the "limit" in the name is not seriously taken as hard limit in - * balance_dirty_pages(). + * + * Note that balance_dirty_pages() will only seriously take it as a hard limit + * when sleeping max_pause per page is not enough to keep the dirty pages under + * control. For example, when the device is completely stalled due to some error + * conditions, or when there are 1000 dd tasks writing to a slow 10MB/s USB key. + * In the other normal situations, it acts more gently by throttling the tasks + * more (rather than completely block them) when the bdi dirty pages go high. * * It allocates high/low dirty limits to fast/slow devices, in order to prevent * - starving fast devices @@ -594,6 +599,13 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi, */ if (unlikely(bdi_thresh > thresh)) bdi_thresh = thresh; + /* + * It's very possible that bdi_thresh is close to 0 not because the + * device is slow, but that it has remained inactive for long time. + * Honour such devices a reasonable good (hopefully IO efficient) + * threshold, so that the occasional writes won't be blocked and active + * writes can rampup the threshold quickly. + */ bdi_thresh = max(bdi_thresh, (limit - dirty) / 8); /* * scale global setpoint to bdi's: -- cgit From c5c6343c4d75f9d3226e05a72e7861e967fc8099 Mon Sep 17 00:00:00 2001 From: Wu Fengguang Date: Fri, 2 Dec 2011 10:21:33 -0600 Subject: writeback: permit through good bdi even when global dirty exceeded On a system with 1 local mount and 1 NFS mount, if the NFS server becomes not responding when dd to the NFS mount, the NFS dirty pages may exceed the global dirty limit and _every_ task involving writing will be blocked. The whole system appears unresponsive. The workaround is to permit through the bdi's that only has a small number of dirty pages. The number chosen (bdi_stat_error pages) is not enough to enable the local disk to run in optimal throughput, however is enough to make the system responsive on a broken NFS mount. The user can then kill the dirtiers on the NFS mount and increase the global dirty limit to bring up the local disk's throughput. It risks allowing dirty pages to grow much larger than the global dirty limit when there are 1000+ mounts, however that's very unlikely to happen, especially in low memory profiles. Signed-off-by: Wu Fengguang --- mm/page-writeback.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 155efca4c123..17403e3a7c89 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1148,6 +1148,19 @@ pause: if (task_ratelimit) break; + /* + * In the case of an unresponding NFS server and the NFS dirty + * pages exceeds dirty_thresh, give the other good bdi's a pipe + * to go through, so that tasks on them still remain responsive. + * + * In theory 1 page is enough to keep the comsumer-producer + * pipe going: the flusher cleans 1 page => the task dirties 1 + * more page. However bdi_dirty has accounting errors. So use + * the larger and more IO friendly bdi_stat_error. + */ + if (bdi_dirty <= bdi_stat_error(bdi)) + break; + if (fatal_signal_pending(current)) break; } -- cgit From 82e230a07de3812a5e87a27979f033dad59172e3 Mon Sep 17 00:00:00 2001 From: Wu Fengguang Date: Fri, 2 Dec 2011 18:21:51 -0600 Subject: writeback: set max_pause to lowest value on zero bdi_dirty Some trace shows lots of bdi_dirty=0 lines where it's actually some small value if w/o the accounting errors in the per-cpu bdi stats. In this case the max pause time should really be set to the smallest (non-zero) value to avoid IO queue underrun and improve throughput. Signed-off-by: Wu Fengguang --- mm/page-writeback.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 17403e3a7c89..50f08241f981 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -989,8 +989,7 @@ static unsigned long bdi_max_pause(struct backing_dev_info *bdi, * * 8 serves as the safety ratio. */ - if (bdi_dirty) - t = min(t, bdi_dirty * HZ / (8 * bw + 1)); + t = min(t, bdi_dirty * HZ / (8 * bw + 1)); /* * The pause time will be settled within range (max_pause/4, max_pause). -- cgit From 635697c663f38106063d5659f0cf2e45afcd4bb5 Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Thu, 8 Dec 2011 14:33:51 -0800 Subject: vmscan: fix initial shrinker size handling A shrinker function can return -1, means that it cannot do anything without a risk of deadlock. For example prune_super() does this if it cannot grab a superblock refrence, even if nr_to_scan=0. Currently we interpret this -1 as a ULONG_MAX size shrinker and evaluate `total_scan' according to this. So the next time around this shrinker can cause really big pressure. Let's skip such shrinkers instead. Also make total_scan signed, otherwise the check (total_scan < 0) below never works. Signed-off-by: Konstantin Khlebnikov Cc: Dave Chinner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/vmscan.c b/mm/vmscan.c index a1893c050795..f5255442ae2b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -247,14 +247,18 @@ unsigned long shrink_slab(struct shrink_control *shrink, list_for_each_entry(shrinker, &shrinker_list, list) { unsigned long long delta; - unsigned long total_scan; - unsigned long max_pass; + long total_scan; + long max_pass; int shrink_ret = 0; long nr; long new_nr; long batch_size = shrinker->batch ? shrinker->batch : SHRINK_BATCH; + max_pass = do_shrinker_shrink(shrinker, shrink, 0); + if (max_pass <= 0) + continue; + /* * copy the current shrinker scan count into a local variable * and zero it so that other concurrent shrinker invocations @@ -265,7 +269,6 @@ unsigned long shrink_slab(struct shrink_control *shrink, } while (cmpxchg(&shrinker->nr, nr, 0) != nr); total_scan = nr; - max_pass = do_shrinker_shrink(shrinker, shrink, 0); delta = (4 * nr_pages_scanned) / shrinker->seeks; delta *= max_pass; do_div(delta, lru_pages + 1); -- cgit From 83aeeada7c69f35e5100b27ec354335597a7a488 Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Thu, 8 Dec 2011 14:33:54 -0800 Subject: vmscan: use atomic-long for shrinker batching Use atomic-long operations instead of looping around cmpxchg(). [akpm@linux-foundation.org: massage atomic.h inclusions] Signed-off-by: Konstantin Khlebnikov Cc: Dave Chinner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 17 +++++++---------- 1 file changed, 7 insertions(+), 10 deletions(-) (limited to 'mm') diff --git a/mm/vmscan.c b/mm/vmscan.c index f5255442ae2b..f54a05b7a61d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -183,7 +183,7 @@ static unsigned long zone_nr_lru_pages(struct zone *zone, */ void register_shrinker(struct shrinker *shrinker) { - shrinker->nr = 0; + atomic_long_set(&shrinker->nr_in_batch, 0); down_write(&shrinker_rwsem); list_add_tail(&shrinker->list, &shrinker_list); up_write(&shrinker_rwsem); @@ -264,9 +264,7 @@ unsigned long shrink_slab(struct shrink_control *shrink, * and zero it so that other concurrent shrinker invocations * don't also do this scanning work. */ - do { - nr = shrinker->nr; - } while (cmpxchg(&shrinker->nr, nr, 0) != nr); + nr = atomic_long_xchg(&shrinker->nr_in_batch, 0); total_scan = nr; delta = (4 * nr_pages_scanned) / shrinker->seeks; @@ -328,12 +326,11 @@ unsigned long shrink_slab(struct shrink_control *shrink, * manner that handles concurrent updates. If we exhausted the * scan, there is no need to do an update. */ - do { - nr = shrinker->nr; - new_nr = total_scan + nr; - if (total_scan <= 0) - break; - } while (cmpxchg(&shrinker->nr, nr, new_nr) != nr); + if (total_scan > 0) + new_nr = atomic_long_add_return(total_scan, + &shrinker->nr_in_batch); + else + new_nr = atomic_long_read(&shrinker->nr_in_batch); trace_mm_shrink_slab_end(shrinker, shrink_ret, nr, new_nr); } -- cgit From 1dfb059b9438633b0546c5431538a47f6ed99028 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Thu, 8 Dec 2011 14:33:57 -0800 Subject: thp: reduce khugepaged freezing latency khugepaged can sometimes cause suspend to fail, requiring that the user retry the suspend operation. Use wait_event_freezable_timeout() instead of schedule_timeout_interruptible() to avoid missing freezer wakeups. A try_to_freeze() would have been needed in the khugepaged_alloc_hugepage tight loop too in case of the allocation failing repeatedly, and wait_event_freezable_timeout will provide it too. khugepaged would still freeze just fine by trying again the next minute but it's better if it freezes immediately. Reported-by: Jiri Slaby Signed-off-by: Andrea Arcangeli Tested-by: Jiri Slaby Cc: Tejun Heo Cc: Oleg Nesterov Cc: "Srivatsa S. Bhat" Cc: "Rafael J. Wysocki" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/huge_memory.c | 16 ++++------------ 1 file changed, 4 insertions(+), 12 deletions(-) (limited to 'mm') diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 4298abaae153..36b3d988b4ef 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2259,12 +2259,8 @@ static void khugepaged_do_scan(struct page **hpage) static void khugepaged_alloc_sleep(void) { - DEFINE_WAIT(wait); - add_wait_queue(&khugepaged_wait, &wait); - schedule_timeout_interruptible( - msecs_to_jiffies( - khugepaged_alloc_sleep_millisecs)); - remove_wait_queue(&khugepaged_wait, &wait); + wait_event_freezable_timeout(khugepaged_wait, false, + msecs_to_jiffies(khugepaged_alloc_sleep_millisecs)); } #ifndef CONFIG_NUMA @@ -2313,14 +2309,10 @@ static void khugepaged_loop(void) if (unlikely(kthread_should_stop())) break; if (khugepaged_has_work()) { - DEFINE_WAIT(wait); if (!khugepaged_scan_sleep_millisecs) continue; - add_wait_queue(&khugepaged_wait, &wait); - schedule_timeout_interruptible( - msecs_to_jiffies( - khugepaged_scan_sleep_millisecs)); - remove_wait_queue(&khugepaged_wait, &wait); + wait_event_freezable_timeout(khugepaged_wait, false, + msecs_to_jiffies(khugepaged_scan_sleep_millisecs)); } else if (khugepaged_enabled()) wait_event_freezable(khugepaged_wait, khugepaged_wait_event()); -- cgit From 58a84aa92723d1ac3e1cc4e3b0ff49291663f7e1 Mon Sep 17 00:00:00 2001 From: Youquan Song Date: Thu, 8 Dec 2011 14:34:18 -0800 Subject: thp: set compound tail page _count to zero Commit 70b50f94f1644 ("mm: thp: tail page refcounting fix") keeps all page_tail->_count zero at all times. But the current kernel does not set page_tail->_count to zero if a 1GB page is utilized. So when an IOMMU 1GB page is used by KVM, it wil result in a kernel oops because a tail page's _count does not equal zero. kernel BUG at include/linux/mm.h:386! invalid opcode: 0000 [#1] SMP Call Trace: gup_pud_range+0xb8/0x19d get_user_pages_fast+0xcb/0x192 ? trace_hardirqs_off+0xd/0xf hva_to_pfn+0x119/0x2f2 gfn_to_pfn_memslot+0x2c/0x2e kvm_iommu_map_pages+0xfd/0x1c1 kvm_iommu_map_memslots+0x7c/0xbd kvm_iommu_map_guest+0xaa/0xbf kvm_vm_ioctl_assigned_device+0x2ef/0xa47 kvm_vm_ioctl+0x36c/0x3a2 do_vfs_ioctl+0x49e/0x4e4 sys_ioctl+0x5a/0x7c system_call_fastpath+0x16/0x1b RIP gup_huge_pud+0xf2/0x159 Signed-off-by: Youquan Song Reviewed-by: Andrea Arcangeli Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/hugetlb.c | 1 + mm/page_alloc.c | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/hugetlb.c b/mm/hugetlb.c index bb28a5f9db8d..73f17c0293c0 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -576,6 +576,7 @@ static void prep_compound_gigantic_page(struct page *page, unsigned long order) __SetPageHead(page); for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) { __SetPageTail(p); + set_page_count(p, 0); p->first_page = page; } } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9dd443d89d8b..850009a7101e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -356,8 +356,8 @@ void prep_compound_page(struct page *page, unsigned long order) __SetPageHead(page); for (i = 1; i < nr_pages; i++) { struct page *p = page + i; - __SetPageTail(p); + set_page_count(p, 0); p->first_page = page; } } -- cgit From 09761333ed47e899cc1482c13090b95f3f711971 Mon Sep 17 00:00:00 2001 From: Hillf Danton Date: Thu, 8 Dec 2011 14:34:20 -0800 Subject: mm/migrate.c: pair unlock_page() and lock_page() when migrating huge pages Avoid unlocking and unlocked page if we failed to lock it. Signed-off-by: Hillf Danton Cc: Naoya Horiguchi Cc: Andrea Arcangeli Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/migrate.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/migrate.c b/mm/migrate.c index 578e29174fa6..177aca424a06 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -871,9 +871,9 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, if (anon_vma) put_anon_vma(anon_vma); -out: unlock_page(hpage); +out: if (rc != -EAGAIN) { list_del(&hpage->lru); put_page(hpage); -- cgit From d021563888312018ca65681096f62e36c20e63cc Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Thu, 8 Dec 2011 14:34:27 -0800 Subject: mm: Ensure that pfn_valid() is called once per pageblock when reserving pageblocks setup_zone_migrate_reserve() expects that zone->start_pfn starts at pageblock_nr_pages aligned pfn otherwise we could access beyond an existing memblock resulting in the following panic if CONFIG_HOLES_IN_ZONE is not configured and we do not check pfn_valid: IP: [] setup_zone_migrate_reserve+0xcd/0x180 *pdpt = 0000000000000000 *pde = f000ff53f000ff53 Oops: 0000 [#1] SMP Pid: 1, comm: swapper Not tainted 3.0.7-0.7-pae #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform EIP: 0060:[] EFLAGS: 00010006 CPU: 0 EIP is at setup_zone_migrate_reserve+0xcd/0x180 EAX: 000c0000 EBX: f5801fc0 ECX: 000c0000 EDX: 00000000 ESI: 000c01fe EDI: 000c01fe EBP: 00140000 ESP: f2475f58 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 Process swapper (pid: 1, ti=f2474000 task=f2472cd0 task.ti=f2474000) Call Trace: [] __setup_per_zone_wmarks+0xec/0x160 [] setup_per_zone_wmarks+0xf/0x20 [] init_per_zone_wmark_min+0x27/0x86 [] do_one_initcall+0x2b/0x160 [] kernel_init+0xbe/0x157 [] kernel_thread_helper+0x6/0xd Code: a5 39 f5 89 f7 0f 46 fd 39 cf 76 40 8b 03 f6 c4 08 74 32 eb 91 90 89 c8 c1 e8 0e 0f be 80 80 2f 86 c0 8b 14 85 60 2f 86 c0 89 c8 <2b> 82 b4 12 00 00 c1 e0 05 03 82 ac 12 00 00 8b 00 f6 c4 08 0f EIP: [] setup_zone_migrate_reserve+0xcd/0x180 SS:ESP 0068:f2475f58 CR2: 00000000000012b4 We crashed in pageblock_is_reserved() when accessing pfn 0xc0000 because highstart_pfn = 0x36ffe. The issue was introduced in 3.0-rc1 by 6d3163ce ("mm: check if any page in a pageblock is reserved before marking it MIGRATE_RESERVE"). Make sure that start_pfn is always aligned to pageblock_nr_pages to ensure that pfn_valid s always called at the start of each pageblock. Architectures with holes in pageblocks will be correctly handled by pfn_valid_within in pageblock_is_reserved. Signed-off-by: Michal Hocko Signed-off-by: Mel Gorman Tested-by: Dang Bo Reviewed-by: KAMEZAWA Hiroyuki Cc: Andrea Arcangeli Cc: David Rientjes Cc: Arve Hjnnevg Cc: KOSAKI Motohiro Cc: John Stultz Cc: Dave Hansen Cc: [3.0+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 850009a7101e..2b8ba3aebf6e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3377,9 +3377,15 @@ static void setup_zone_migrate_reserve(struct zone *zone) unsigned long block_migratetype; int reserve; - /* Get the start pfn, end pfn and the number of blocks to reserve */ + /* + * Get the start pfn, end pfn and the number of blocks to reserve + * We have to be careful to be aligned to pageblock_nr_pages to + * make sure that we always check pfn_valid for the first page in + * the block. + */ start_pfn = zone->zone_start_pfn; end_pfn = start_pfn + zone->spanned_pages; + start_pfn = roundup(start_pfn, pageblock_nr_pages); reserve = roundup(min_wmark_pages(zone), pageblock_nr_pages) >> pageblock_order; -- cgit From 1368edf0647ac112d8cfa6ce47257dc950c50f5c Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 8 Dec 2011 14:34:30 -0800 Subject: mm: vmalloc: check for page allocation failure before vmlist insertion Commit f5252e00 ("mm: avoid null pointer access in vm_struct via /proc/vmallocinfo") adds newly allocated vm_structs to the vmlist after it is fully initialised. Unfortunately, it did not check that __vmalloc_area_node() successfully populated the area. In the event of allocation failure, the vmalloc area is freed but the pointer to freed memory is inserted into the vmlist leading to a a crash later in get_vmalloc_info(). This patch adds a check for ____vmalloc_area_node() failure within __vmalloc_node_range. It does not use "goto fail" as in the previous error path as a warning was already displayed by __vmalloc_area_node() before it called vfree in its failure path. Credit goes to Luciano Chavez for doing all the real work of identifying exactly where the problem was. Signed-off-by: Mel Gorman Reported-by: Luciano Chavez Tested-by: Luciano Chavez Reviewed-by: Rik van Riel Acked-by: David Rientjes Cc: [3.1.x+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmalloc.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'mm') diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 3231bf332878..1d8b32f07139 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1633,6 +1633,8 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, goto fail; addr = __vmalloc_area_node(area, gfp_mask, prot, node, caller); + if (!addr) + return NULL; /* * In this function, newly allocated vm_struct is not added -- cgit From 9f57bd4d6dc69a4e3bf43044fa00fcd24dd363e3 Mon Sep 17 00:00:00 2001 From: Eugene Surovegin Date: Thu, 15 Dec 2011 11:25:59 -0800 Subject: percpu: fix per_cpu_ptr_to_phys() handling of non-page-aligned addresses per_cpu_ptr_to_phys() incorrectly rounds up its result for non-kmalloc case to the page boundary, which is bogus for any non-page-aligned address. This affects the only in-tree user of this function - sysfs handler for per-cpu 'crash_notes' physical address. The trouble is that the crash_notes per-cpu variable is not page-aligned: crash_notes = 0xc08e8ed4 PER-CPU OFFSET VALUES: CPU 0: 3711f000 CPU 1: 37129000 CPU 2: 37133000 CPU 3: 3713d000 So, the per-cpu addresses are: crash_notes on CPU 0: f7a07ed4 => phys 36b57ed4 crash_notes on CPU 1: f7a11ed4 => phys 36b4ded4 crash_notes on CPU 2: f7a1bed4 => phys 36b43ed4 crash_notes on CPU 3: f7a25ed4 => phys 36b39ed4 However, /sys/devices/system/cpu/cpu*/crash_notes says: /sys/devices/system/cpu/cpu0/crash_notes: 36b57000 /sys/devices/system/cpu/cpu1/crash_notes: 36b4d000 /sys/devices/system/cpu/cpu2/crash_notes: 36b43000 /sys/devices/system/cpu/cpu3/crash_notes: 36b39000 As you can see, all values are rounded down to a page boundary. Consequently, this is where kexec sets up the NOTE segments, and thus where the secondary kernel is looking for them. However, when the first kernel crashes, it saves the notes to the unaligned addresses, where they are not found. Fix it by adding offset_in_page() to the translated page address. -tj: Combined Eugene's and Petr's commit messages. Signed-off-by: Eugene Surovegin Signed-off-by: Tejun Heo Reported-by: Petr Tesarik Cc: stable@kernel.org --- mm/percpu.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/percpu.c b/mm/percpu.c index 3bb810a72006..716eb4acf2fc 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1023,9 +1023,11 @@ phys_addr_t per_cpu_ptr_to_phys(void *addr) if (!is_vmalloc_addr(addr)) return __pa(addr); else - return page_to_phys(vmalloc_to_page(addr)); + return page_to_phys(vmalloc_to_page(addr)) + + offset_in_page(addr); } else - return page_to_phys(pcpu_addr_to_page(addr)); + return page_to_phys(pcpu_addr_to_page(addr)) + + offset_in_page(addr); } /** -- cgit From a41c58a6665cc995e237303b05db42100b71b65e Mon Sep 17 00:00:00 2001 From: Hillf Danton Date: Mon, 19 Dec 2011 17:11:57 -0800 Subject: memcg: keep root group unchanged if creation fails If the request is to create non-root group and we fail to meet it, we should leave the root unchanged. Signed-off-by: Hillf Danton Acked-by: Hugh Dickins Acked-by: KAMEZAWA Hiroyuki Acked-by: Michal Hocko Cc: Balbir Singh Cc: David Rientjes Cc: Andrea Arcangeli Cc: Johannes Weiner Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6aff93c98aca..b63f5f7dfa07 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4907,9 +4907,9 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) int cpu; enable_swap_cgroup(); parent = NULL; - root_mem_cgroup = memcg; if (mem_cgroup_soft_limit_tree_init()) goto free_out; + root_mem_cgroup = memcg; for_each_possible_cpu(cpu) { struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu); @@ -4948,7 +4948,6 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) return &memcg->css; free_out: __mem_cgroup_free(memcg); - root_mem_cgroup = NULL; return ERR_PTR(error); } -- cgit From ff05b6f7ae762b6eb464183eec994b28ea09f6dd Mon Sep 17 00:00:00 2001 From: Frantisek Hrbata Date: Mon, 19 Dec 2011 17:11:59 -0800 Subject: oom: fix integer overflow of points in oom_badness An integer overflow will happen on 64bit archs if task's sum of rss, swapents and nr_ptes exceeds (2^31)/1000 value. This was introduced by commit f755a04 oom: use pte pages in OOM score where the oom score computation was divided into several steps and it's no longer computed as one expression in unsigned long(rss, swapents, nr_pte are unsigned long), where the result value assigned to points(int) is in range(1..1000). So there could be an int overflow while computing 176 points *= 1000; and points may have negative value. Meaning the oom score for a mem hog task will be one. 196 if (points <= 0) 197 return 1; For example: [ 3366] 0 3366 35390480 24303939 5 0 0 oom01 Out of memory: Kill process 3366 (oom01) score 1 or sacrifice child Here the oom1 process consumes more than 24303939(rss)*4096~=92GB physical memory, but it's oom score is one. In this situation the mem hog task is skipped and oom killer kills another and most probably innocent task with oom score greater than one. The points variable should be of type long instead of int to prevent the int overflow. Signed-off-by: Frantisek Hrbata Acked-by: KOSAKI Motohiro Acked-by: Oleg Nesterov Acked-by: David Rientjes Cc: [2.6.36+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/oom_kill.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 76f2c5ae908e..069b64e521fc 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -176,7 +176,7 @@ static bool oom_unkillable_task(struct task_struct *p, unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem, const nodemask_t *nodemask, unsigned long totalpages) { - int points; + long points; if (oom_unkillable_task(p, mem, nodemask)) return 0; -- cgit From 0006526d78e93c3684c806bf7cf3f67dfa49c3c8 Mon Sep 17 00:00:00 2001 From: Kautuk Consul Date: Mon, 19 Dec 2011 17:12:04 -0800 Subject: mm/vmalloc.c: remove static declaration of va from __get_vm_area_node Static storage is not required for the struct vmap_area in __get_vm_area_node. Removing "static" to store this variable on the stack instead. Signed-off-by: Kautuk Consul Acked-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmalloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 1d8b32f07139..27be2f0d4cb7 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1290,7 +1290,7 @@ static struct vm_struct *__get_vm_area_node(unsigned long size, unsigned long align, unsigned long flags, unsigned long start, unsigned long end, int node, gfp_t gfp_mask, void *caller) { - static struct vmap_area *va; + struct vmap_area *va; struct vm_struct *area; BUG_ON(in_interrupt()); -- cgit From e6f67b8c05f5e129e126f4409ddac6f25f58ffcb Mon Sep 17 00:00:00 2001 From: Dave Kleikamp Date: Wed, 21 Dec 2011 11:05:48 -0600 Subject: vfs: __read_cache_page should use gfp argument rather than GFP_KERNEL lockdep reports a deadlock in jfs because a special inode's rw semaphore is taken recursively. The mapping's gfp mask is GFP_NOFS, but is not used when __read_cache_page() calls add_to_page_cache_lru(). Signed-off-by: Dave Kleikamp Acked-by: Hugh Dickins Acked-by: Al Viro Cc: stable@kernel.org Signed-off-by: Linus Torvalds --- mm/filemap.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) (limited to 'mm') diff --git a/mm/filemap.c b/mm/filemap.c index c106d3b3cc64..5f0a3c91fdac 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1828,7 +1828,7 @@ repeat: page = __page_cache_alloc(gfp | __GFP_COLD); if (!page) return ERR_PTR(-ENOMEM); - err = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL); + err = add_to_page_cache_lru(page, mapping, index, gfp); if (unlikely(err)) { page_cache_release(page); if (err == -EEXIST) @@ -1925,10 +1925,7 @@ static struct page *wait_on_page_read(struct page *page) * @gfp: the page allocator flags to use if allocating * * This is the same as "read_mapping_page(mapping, index, NULL)", but with - * any new page allocations done using the specified allocation flags. Note - * that the Radix tree operations will still use GFP_KERNEL, so you can't - * expect to do this atomically or anything like that - but you can pass in - * other page requirements. + * any new page allocations done using the specified allocation flags. * * If the page does not get brought uptodate, return -EIO. */ -- cgit