From e020d5bd8a730757b565b18d620240f71c3e21fe Mon Sep 17 00:00:00 2001 From: Steven Miao Date: Mon, 23 Jun 2014 13:22:02 -0700 Subject: mm: nommu: per-thread vma cache fix mm could be removed from current task struct, using previous vma->vm_mm It will crash on blackfin after updated to Linux 3.15. The commit "mm: per-thread vma caching" caused the crash. mm could be removed from current task struct before mmput()-> exit_mmap()-> delete_vma_from_mm() the detailed fault information: NULL pointer access Kernel OOPS in progress Deferred Exception context CURRENT PROCESS: COMM=modprobe PID=278 CPU=0 invalid mm return address: [0x000531de]; contents of: 0x000531b0: c727 acea 0c42 181d 0000 0000 0000 a0a8 0x000531c0: b090 acaa 0c42 1806 0000 0000 0000 a0e8 0x000531d0: b0d0 e801 0000 05b3 0010 e522 0046 [a090] 0x000531e0: 6408 b090 0c00 17cc 3042 e3ff f37b 2fc8 CPU: 0 PID: 278 Comm: modprobe Not tainted 3.15.0-ADI-2014R1-pre-00345-gea9f446 #25 task: 0572b720 ti: 0569e000 task.ti: 0569e000 Compiled for cpu family 0x27fe (Rev 0), but running on:0x0000 (Rev 0) ADSP-BF609-0.0 500(MHz CCLK) 125(MHz SCLK) (mpu off) Linux version 3.15.0-ADI-2014R1-pre-00345-gea9f446 (steven@steven-OptiPlex-390) (gcc version 4.3.5 (ADI-trunk/svn-5962) ) #25 Tue Jun 10 17:47:46 CST 2014 SEQUENCER STATUS: Not tainted SEQSTAT: 00000027 IPEND: 8008 IMASK: ffff SYSCFG: 2806 EXCAUSE : 0x27 physical IVG3 asserted : <0xffa00744> { _trap + 0x0 } physical IVG15 asserted : <0xffa00d68> { _evt_system_call + 0x0 } logical irq 6 mapped : <0xffa003bc> { _bfin_coretmr_interrupt + 0x0 } logical irq 7 mapped : <0x00008828> { _bfin_fault_routine + 0x0 } logical irq 11 mapped : <0x00007724> { _l2_ecc_err + 0x0 } logical irq 13 mapped : <0x00008828> { _bfin_fault_routine + 0x0 } logical irq 39 mapped : <0x00150788> { _bfin_twi_interrupt_entry + 0x0 } logical irq 40 mapped : <0x00150788> { _bfin_twi_interrupt_entry + 0x0 } RETE: <0x00000000> /* Maybe null pointer? */ RETN: <0x0569fe50> /* kernel dynamic memory (maybe user-space) */ RETX: <0x00000480> /* Maybe fixed code section */ RETS: <0x00053384> { _exit_mmap + 0x28 } PC : <0x000531de> { _delete_vma_from_mm + 0x92 } DCPLB_FAULT_ADDR: <0x00000008> /* Maybe null pointer? */ ICPLB_FAULT_ADDR: <0x000531de> { _delete_vma_from_mm + 0x92 } PROCESSOR STATE: R0 : 00000004 R1 : 0569e000 R2 : 00bf3db4 R3 : 00000000 R4 : 057f9800 R5 : 00000001 R6 : 0569ddd0 R7 : 0572b720 P0 : 0572b854 P1 : 00000004 P2 : 00000000 P3 : 0569dda0 P4 : 0572b720 P5 : 0566c368 FP : 0569fe5c SP : 0569fd74 LB0: 057f523f LT0: 057f523e LC0: 00000000 LB1: 0005317c LT1: 00053172 LC1: 00000002 B0 : 00000000 L0 : 00000000 M0 : 0566f5bc I0 : 00000000 B1 : 00000000 L1 : 00000000 M1 : 00000000 I1 : ffffffff B2 : 00000001 L2 : 00000000 M2 : 00000000 I2 : 00000000 B3 : 00000000 L3 : 00000000 M3 : 00000000 I3 : 057f8000 A0.w: 00000000 A0.x: 00000000 A1.w: 00000000 A1.x: 00000000 USP : 056ffcf8 ASTAT: 02003024 Hardware Trace: 0 Target : <0x00003fb8> { _trap_c + 0x0 } Source : <0xffa006d8> { _exception_to_level5 + 0xa0 } JUMP.L 1 Target : <0xffa00638> { _exception_to_level5 + 0x0 } Source : <0xffa004f2> { _bfin_return_from_exception + 0x6 } RTX 2 Target : <0xffa004ec> { _bfin_return_from_exception + 0x0 } Source : <0xffa00590> { _ex_trap_c + 0x70 } JUMP.S 3 Target : <0xffa00520> { _ex_trap_c + 0x0 } Source : <0xffa0076e> { _trap + 0x2a } JUMP (P4) 4 Target : <0xffa00744> { _trap + 0x0 } FAULT : <0x000531de> { _delete_vma_from_mm + 0x92 } P0 = W[P2 + 2] Source : <0x000531da> { _delete_vma_from_mm + 0x8e } P2 = [P4 + 0x18] 5 Target : <0x000531da> { _delete_vma_from_mm + 0x8e } Source : <0x00053176> { _delete_vma_from_mm + 0x2a } IF CC JUMP pcrel 6 Target : <0x0005314c> { _delete_vma_from_mm + 0x0 } Source : <0x00053380> { _exit_mmap + 0x24 } JUMP.L 7 Target : <0x00053378> { _exit_mmap + 0x1c } Source : <0x00053394> { _exit_mmap + 0x38 } IF !CC JUMP pcrel (BP) 8 Target : <0x00053390> { _exit_mmap + 0x34 } Source : <0xffa020e0> { __cond_resched + 0x20 } RTS 9 Target : <0xffa020c0> { __cond_resched + 0x0 } Source : <0x0005338c> { _exit_mmap + 0x30 } JUMP.L 10 Target : <0x0005338c> { _exit_mmap + 0x30 } Source : <0x0005333a> { _delete_vma + 0xb2 } RTS 11 Target : <0x00053334> { _delete_vma + 0xac } Source : <0x0005507a> { _kmem_cache_free + 0xba } RTS 12 Target : <0x00055068> { _kmem_cache_free + 0xa8 } Source : <0x0005505e> { _kmem_cache_free + 0x9e } IF !CC JUMP pcrel (BP) 13 Target : <0x00055052> { _kmem_cache_free + 0x92 } Source : <0x0005501a> { _kmem_cache_free + 0x5a } IF CC JUMP pcrel 14 Target : <0x00054ff4> { _kmem_cache_free + 0x34 } Source : <0x00054fce> { _kmem_cache_free + 0xe } IF CC JUMP pcrel (BP) 15 Target : <0x00054fc0> { _kmem_cache_free + 0x0 } Source : <0x00053330> { _delete_vma + 0xa8 } JUMP.L Kernel Stack Stack info: SP: [0x0569ff24] <0x0569ff24> /* kernel dynamic memory (maybe user-space) */ Memory from 0x0569ff20 to 056a0000 0569ff20: 00000001 [04e8da5a] 00008000 00000000 00000000 056a0000 04e8da5a 04e8da5a 0569ff40: 04eb9eea ffa00dce 02003025 04ea09c5 057f523f 04ea09c4 057f523e 00000000 0569ff60: 00000000 00000000 00000000 00000000 00000000 00000000 00000001 00000000 0569ff80: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 0569ffa0: 0566f5bc 057f8000 057f8000 00000001 04ec0170 056ffcf8 056ffd04 057f9800 0569ffc0: 04d1d498 057f9800 057f8fe4 057f8ef0 00000001 057f928c 00000001 00000001 0569ffe0: 057f9800 00000000 00000008 00000007 00000001 00000001 00000001 <00002806> Return addresses in stack: address : <0x00002806> { _show_cpuinfo + 0x2d2 } Modules linked in: Kernel panic - not syncing: Kernel exception [ end Kernel panic - not syncing: Kernel exception Signed-off-by: Steven Miao Acked-by: Davidlohr Bueso Reviewed-by: Rik van Riel Acked-by: David Rientjes Cc: [3.15.x] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/nommu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/nommu.c b/mm/nommu.c index b78e3a8f5ee7..4a852f6c5709 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -786,7 +786,7 @@ static void delete_vma_from_mm(struct vm_area_struct *vma) for (i = 0; i < VMACACHE_SIZE; i++) { /* if the vma is cached, invalidate the entire cache */ if (curr->vmacache[i] == vma) { - vmacache_invalidate(curr->mm); + vmacache_invalidate(mm); break; } } -- cgit From 13ace4d0d9db40e10ecd66dfda14e297571be813 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Mon, 23 Jun 2014 13:22:03 -0700 Subject: tmpfs: ZERO_RANGE and COLLAPSE_RANGE not currently supported I was well aware of FALLOC_FL_ZERO_RANGE and FALLOC_FL_COLLAPSE_RANGE support being added to fallocate(); but didn't realize until now that I had been too stupid to future-proof shmem_fallocate() against new additions. -EOPNOTSUPP instead of going on to ordinary fallocation. Signed-off-by: Hugh Dickins Reviewed-by: Lukas Czerner Cc: [3.15] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/shmem.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'mm') diff --git a/mm/shmem.c b/mm/shmem.c index f484c276e994..91b912b37b8c 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1724,6 +1724,9 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, pgoff_t start, index, end; int error; + if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)) + return -EOPNOTSUPP; + mutex_lock(&inode->i_mutex); if (mode & FALLOC_FL_PUNCH_HOLE) { -- cgit From 4a705fef986231a3e7a6b1a6d3c37025f021f49f Mon Sep 17 00:00:00 2001 From: Naoya Horiguchi Date: Mon, 23 Jun 2014 13:22:03 -0700 Subject: hugetlb: fix copy_hugetlb_page_range() to handle migration/hwpoisoned entry There's a race between fork() and hugepage migration, as a result we try to "dereference" a swap entry as a normal pte, causing kernel panic. The cause of the problem is that copy_hugetlb_page_range() can't handle "swap entry" family (migration entry and hwpoisoned entry) so let's fix it. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Naoya Horiguchi Acked-by: Hugh Dickins Cc: Christoph Lameter Cc: [2.6.37+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/hugetlb.c | 71 ++++++++++++++++++++++++++++++++++++------------------------ 1 file changed, 43 insertions(+), 28 deletions(-) (limited to 'mm') diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 226910cb7c9b..2024bbd573d2 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2520,6 +2520,31 @@ static void set_huge_ptep_writable(struct vm_area_struct *vma, update_mmu_cache(vma, address, ptep); } +static int is_hugetlb_entry_migration(pte_t pte) +{ + swp_entry_t swp; + + if (huge_pte_none(pte) || pte_present(pte)) + return 0; + swp = pte_to_swp_entry(pte); + if (non_swap_entry(swp) && is_migration_entry(swp)) + return 1; + else + return 0; +} + +static int is_hugetlb_entry_hwpoisoned(pte_t pte) +{ + swp_entry_t swp; + + if (huge_pte_none(pte) || pte_present(pte)) + return 0; + swp = pte_to_swp_entry(pte); + if (non_swap_entry(swp) && is_hwpoison_entry(swp)) + return 1; + else + return 0; +} int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma) @@ -2559,10 +2584,26 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, dst_ptl = huge_pte_lock(h, dst, dst_pte); src_ptl = huge_pte_lockptr(h, src, src_pte); spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); - if (!huge_pte_none(huge_ptep_get(src_pte))) { + entry = huge_ptep_get(src_pte); + if (huge_pte_none(entry)) { /* skip none entry */ + ; + } else if (unlikely(is_hugetlb_entry_migration(entry) || + is_hugetlb_entry_hwpoisoned(entry))) { + swp_entry_t swp_entry = pte_to_swp_entry(entry); + + if (is_write_migration_entry(swp_entry) && cow) { + /* + * COW mappings require pages in both + * parent and child to be set to read. + */ + make_migration_entry_read(&swp_entry); + entry = swp_entry_to_pte(swp_entry); + set_huge_pte_at(src, addr, src_pte, entry); + } + set_huge_pte_at(dst, addr, dst_pte, entry); + } else { if (cow) huge_ptep_set_wrprotect(src, addr, src_pte); - entry = huge_ptep_get(src_pte); ptepage = pte_page(entry); get_page(ptepage); page_dup_rmap(ptepage); @@ -2578,32 +2619,6 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, return ret; } -static int is_hugetlb_entry_migration(pte_t pte) -{ - swp_entry_t swp; - - if (huge_pte_none(pte) || pte_present(pte)) - return 0; - swp = pte_to_swp_entry(pte); - if (non_swap_entry(swp) && is_migration_entry(swp)) - return 1; - else - return 0; -} - -static int is_hugetlb_entry_hwpoisoned(pte_t pte) -{ - swp_entry_t swp; - - if (huge_pte_none(pte) || pte_present(pte)) - return 0; - swp = pte_to_swp_entry(pte); - if (non_swap_entry(swp) && is_hwpoison_entry(swp)) - return 1; - else - return 0; -} - void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long start, unsigned long end, struct page *ref_page) -- cgit From 7cd2b0a34ab8e4db971920eef8982f985441adfb Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Mon, 23 Jun 2014 13:22:04 -0700 Subject: mm, pcp: allow restoring percpu_pagelist_fraction default Oleg reports a division by zero error on zero-length write() to the percpu_pagelist_fraction sysctl: divide error: 0000 [#1] SMP DEBUG_PAGEALLOC CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000 RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120 RSP: 0018:ffff8800d87a3e78 EFLAGS: 00010246 RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010 RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50 R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060 R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800 FS: 00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0 Call Trace: proc_sys_call_handler+0xb3/0xc0 proc_sys_write+0x14/0x20 vfs_write+0xba/0x1e0 SyS_write+0x46/0xb0 tracesys+0xe1/0xe6 However, if the percpu_pagelist_fraction sysctl is set by the user, it is also impossible to restore it to the kernel default since the user cannot write 0 to the sysctl. This patch allows the user to write 0 to restore the default behavior. It still requires a fraction equal to or larger than 8, however, as stated by the documentation for sanity. If a value in the range [1, 7] is written, the sysctl will return EINVAL. This successfully solves the divide by zero issue at the same time. Signed-off-by: David Rientjes Reported-by: Oleg Drokin Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 40 ++++++++++++++++++++++++++++------------ 1 file changed, 28 insertions(+), 12 deletions(-) (limited to 'mm') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 4f59fa29eda8..20d17f8266fe 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -69,6 +69,7 @@ /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ static DEFINE_MUTEX(pcp_batch_high_lock); +#define MIN_PERCPU_PAGELIST_FRACTION (8) #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID DEFINE_PER_CPU(int, numa_node); @@ -4145,7 +4146,7 @@ static void __meminit zone_init_free_lists(struct zone *zone) memmap_init_zone((size), (nid), (zone), (start_pfn), MEMMAP_EARLY) #endif -static int __meminit zone_batchsize(struct zone *zone) +static int zone_batchsize(struct zone *zone) { #ifdef CONFIG_MMU int batch; @@ -4261,8 +4262,8 @@ static void pageset_set_high(struct per_cpu_pageset *p, pageset_update(&p->pcp, high, batch); } -static void __meminit pageset_set_high_and_batch(struct zone *zone, - struct per_cpu_pageset *pcp) +static void pageset_set_high_and_batch(struct zone *zone, + struct per_cpu_pageset *pcp) { if (percpu_pagelist_fraction) pageset_set_high(pcp, @@ -5881,23 +5882,38 @@ int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos) { struct zone *zone; - unsigned int cpu; + int old_percpu_pagelist_fraction; int ret; + mutex_lock(&pcp_batch_high_lock); + old_percpu_pagelist_fraction = percpu_pagelist_fraction; + ret = proc_dointvec_minmax(table, write, buffer, length, ppos); - if (!write || (ret < 0)) - return ret; + if (!write || ret < 0) + goto out; + + /* Sanity checking to avoid pcp imbalance */ + if (percpu_pagelist_fraction && + percpu_pagelist_fraction < MIN_PERCPU_PAGELIST_FRACTION) { + percpu_pagelist_fraction = old_percpu_pagelist_fraction; + ret = -EINVAL; + goto out; + } + + /* No change? */ + if (percpu_pagelist_fraction == old_percpu_pagelist_fraction) + goto out; - mutex_lock(&pcp_batch_high_lock); for_each_populated_zone(zone) { - unsigned long high; - high = zone->managed_pages / percpu_pagelist_fraction; + unsigned int cpu; + for_each_possible_cpu(cpu) - pageset_set_high(per_cpu_ptr(zone->pageset, cpu), - high); + pageset_set_high_and_batch(zone, + per_cpu_ptr(zone->pageset, cpu)); } +out: mutex_unlock(&pcp_batch_high_lock); - return 0; + return ret; } int hashdist = HASHDIST_DEFAULT; -- cgit From 5338a9372234f8b782c7d78f0355e1cb21d02468 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Mon, 23 Jun 2014 13:22:05 -0700 Subject: mm: thp: fix DEBUG_PAGEALLOC oops in copy_page_rep() Trinity has for over a year been reporting a CONFIG_DEBUG_PAGEALLOC oops in copy_page_rep() called from copy_user_huge_page() called from do_huge_pmd_wp_page(). I believe this is a DEBUG_PAGEALLOC false positive, due to the source page being split, and a tail page freed, while copy is in progress; and not a problem without DEBUG_PAGEALLOC, since the pmd_same() check will prevent a miscopy from being made visible. Fix by adding get_user_huge_page() and put_user_huge_page(): reducing to the usual get_page() and put_page() on head page in the usual config; but get and put references to all of the tail pages when DEBUG_PAGEALLOC. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Hugh Dickins Acked-by: Kirill A. Shutemov Reported-by: Sasha Levin Tested-by: Sasha Levin Cc: Dave Jones Cc: Andrea Arcangeli Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/huge_memory.c | 39 +++++++++++++++++++++++++++++++++++---- 1 file changed, 35 insertions(+), 4 deletions(-) (limited to 'mm') diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e60837dc785c..bade35ef563b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -941,6 +941,37 @@ unlock: spin_unlock(ptl); } +/* + * Save CONFIG_DEBUG_PAGEALLOC from faulting falsely on tail pages + * during copy_user_huge_page()'s copy_page_rep(): in the case when + * the source page gets split and a tail freed before copy completes. + * Called under pmd_lock of checked pmd, so safe from splitting itself. + */ +static void get_user_huge_page(struct page *page) +{ + if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) { + struct page *endpage = page + HPAGE_PMD_NR; + + atomic_add(HPAGE_PMD_NR, &page->_count); + while (++page < endpage) + get_huge_page_tail(page); + } else { + get_page(page); + } +} + +static void put_user_huge_page(struct page *page) +{ + if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) { + struct page *endpage = page + HPAGE_PMD_NR; + + while (page < endpage) + put_page(page++); + } else { + put_page(page); + } +} + static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, @@ -1074,7 +1105,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, ret |= VM_FAULT_WRITE; goto out_unlock; } - get_page(page); + get_user_huge_page(page); spin_unlock(ptl); alloc: if (transparent_hugepage_enabled(vma) && @@ -1095,7 +1126,7 @@ alloc: split_huge_page(page); ret |= VM_FAULT_FALLBACK; } - put_page(page); + put_user_huge_page(page); } count_vm_event(THP_FAULT_FALLBACK); goto out; @@ -1105,7 +1136,7 @@ alloc: put_page(new_page); if (page) { split_huge_page(page); - put_page(page); + put_user_huge_page(page); } else split_huge_page_pmd(vma, address, pmd); ret |= VM_FAULT_FALLBACK; @@ -1127,7 +1158,7 @@ alloc: spin_lock(ptl); if (page) - put_page(page); + put_user_huge_page(page); if (unlikely(!pmd_same(*pmd, orig_pmd))) { spin_unlock(ptl); mem_cgroup_uncharge_page(new_page); -- cgit From f72e7dcdd25229446b102e587ef2f826f76bff28 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Mon, 23 Jun 2014 13:22:05 -0700 Subject: mm: let mm_find_pmd fix buggy race with THP fault Trinity has reported: BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 IP: __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1)) CPU: 6 PID: 16173 Comm: trinity-c364 Tainted: G W 3.15.0-rc1-next-20140415-sasha-00020-gaa90d09 #398 lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602) _raw_spin_lock (include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151) remove_migration_pte (mm/migrate.c:137) rmap_walk (mm/rmap.c:1628 mm/rmap.c:1699) remove_migration_ptes (mm/migrate.c:224) migrate_pages (mm/migrate.c:922 mm/migrate.c:960 mm/migrate.c:1126) migrate_misplaced_page (mm/migrate.c:1733) __handle_mm_fault (mm/memory.c:3762 mm/memory.c:3812 mm/memory.c:3925) handle_mm_fault (mm/memory.c:3948) __get_user_pages (mm/memory.c:1851) __mlock_vma_pages_range (mm/mlock.c:255) __mm_populate (mm/mlock.c:711) SyS_mlockall (include/linux/mm.h:1799 mm/mlock.c:817 mm/mlock.c:791) I believe this comes about because, whereas collapsing and splitting THP functions take anon_vma lock in write mode (which excludes concurrent rmap walks), faulting THP functions (write protection and misplaced NUMA) do not - and mostly they do not need to. But they do use a pmdp_clear_flush(), set_pmd_at() sequence which, for an instant (indeed, for a long instant, given the inter-CPU TLB flush in there), leaves *pmd neither present not trans_huge. Which can confuse a concurrent rmap walk, as when removing migration ptes, seen in the dumped trace. Although that rmap walk has a 4k page to insert, anon_vmas containing THPs are in no way segregated from 4k-page anon_vmas, so the 4k-intent mm_find_pmd() does need to cope with that instant when a trans_huge pmd is temporarily absent. I don't think we need strengthen the locking at the THP end: it's easily handled with an ACCESS_ONCE() before testing both conditions. And since mm_find_pmd() had only one caller who wanted a THP rather than a pmd, let's slightly repurpose it to fail when it hits a THP or non-present pmd, and open code split_huge_page_address() again. Signed-off-by: Hugh Dickins Reported-by: Sasha Levin Acked-by: Kirill A. Shutemov Cc: Konstantin Khlebnikov Cc: Mel Gorman Cc: Bob Liu Cc: Christoph Lameter Cc: Dave Jones Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/huge_memory.c | 18 ++++++++++++------ mm/ksm.c | 1 - mm/migrate.c | 2 -- mm/rmap.c | 12 ++++++++---- 4 files changed, 20 insertions(+), 13 deletions(-) (limited to 'mm') diff --git a/mm/huge_memory.c b/mm/huge_memory.c index bade35ef563b..33514d88fef9 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2423,8 +2423,6 @@ static void collapse_huge_page(struct mm_struct *mm, pmd = mm_find_pmd(mm, address); if (!pmd) goto out; - if (pmd_trans_huge(*pmd)) - goto out; anon_vma_lock_write(vma->anon_vma); @@ -2523,8 +2521,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, pmd = mm_find_pmd(mm, address); if (!pmd) goto out; - if (pmd_trans_huge(*pmd)) - goto out; memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load)); pte = pte_offset_map_lock(mm, pmd, address, &ptl); @@ -2877,12 +2873,22 @@ void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address, static void split_huge_page_address(struct mm_struct *mm, unsigned long address) { + pgd_t *pgd; + pud_t *pud; pmd_t *pmd; VM_BUG_ON(!(address & ~HPAGE_PMD_MASK)); - pmd = mm_find_pmd(mm, address); - if (!pmd) + pgd = pgd_offset(mm, address); + if (!pgd_present(*pgd)) + return; + + pud = pud_offset(pgd, address); + if (!pud_present(*pud)) + return; + + pmd = pmd_offset(pud, address); + if (!pmd_present(*pmd)) return; /* * Caller holds the mmap_sem write mode, so a huge pmd cannot diff --git a/mm/ksm.c b/mm/ksm.c index 68710e80994a..346ddc9e4c0d 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -945,7 +945,6 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, pmd = mm_find_pmd(mm, addr); if (!pmd) goto out; - BUG_ON(pmd_trans_huge(*pmd)); mmun_start = addr; mmun_end = addr + PAGE_SIZE; diff --git a/mm/migrate.c b/mm/migrate.c index 63f0cd559999..9e0beaa91845 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -120,8 +120,6 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma, pmd = mm_find_pmd(mm, addr); if (!pmd) goto out; - if (pmd_trans_huge(*pmd)) - goto out; ptep = pte_offset_map(pmd, addr); diff --git a/mm/rmap.c b/mm/rmap.c index bf05fc872ae8..b7e94ebbd09e 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -569,6 +569,7 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address) pgd_t *pgd; pud_t *pud; pmd_t *pmd = NULL; + pmd_t pmde; pgd = pgd_offset(mm, address); if (!pgd_present(*pgd)) @@ -579,7 +580,13 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address) goto out; pmd = pmd_offset(pud, address); - if (!pmd_present(*pmd)) + /* + * Some THP functions use the sequence pmdp_clear_flush(), set_pmd_at() + * without holding anon_vma lock for write. So when looking for a + * genuine pmde (in which to find pte), test present and !THP together. + */ + pmde = ACCESS_ONCE(*pmd); + if (!pmd_present(pmde) || pmd_trans_huge(pmde)) pmd = NULL; out: return pmd; @@ -615,9 +622,6 @@ pte_t *__page_check_address(struct page *page, struct mm_struct *mm, if (!pmd) return NULL; - if (pmd_trans_huge(*pmd)) - return NULL; - pte = pte_offset_map(pmd, address); /* Make a quick check before getting the lock */ if (!sync && !pte_present(*pte)) { -- cgit From f00cdc6df7d7cfcabb5b740911e6788cb0802bdb Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Mon, 23 Jun 2014 13:22:06 -0700 Subject: shmem: fix faulting into a hole while it's punched Trinity finds that mmap access to a hole while it's punched from shmem can prevent the madvise(MADV_REMOVE) or fallocate(FALLOC_FL_PUNCH_HOLE) from completing, until the reader chooses to stop; with the puncher's hold on i_mutex locking out all other writers until it can complete. It appears that the tmpfs fault path is too light in comparison with its hole-punching path, lacking an i_data_sem to obstruct it; but we don't want to slow down the common case. Extend shmem_fallocate()'s existing range notification mechanism, so shmem_fault() can refrain from faulting pages into the hole while it's punched, waiting instead on i_mutex (when safe to sleep; or repeatedly faulting when not). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Hugh Dickins Reported-by: Sasha Levin Tested-by: Sasha Levin Cc: Dave Jones Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/shmem.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 52 insertions(+), 4 deletions(-) (limited to 'mm') diff --git a/mm/shmem.c b/mm/shmem.c index 91b912b37b8c..8f419cff9e34 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -80,11 +80,12 @@ static struct vfsmount *shm_mnt; #define SHORT_SYMLINK_LEN 128 /* - * shmem_fallocate and shmem_writepage communicate via inode->i_private - * (with i_mutex making sure that it has only one user at a time): - * we would prefer not to enlarge the shmem inode just for that. + * shmem_fallocate communicates with shmem_fault or shmem_writepage via + * inode->i_private (with i_mutex making sure that it has only one user at + * a time): we would prefer not to enlarge the shmem inode just for that. */ struct shmem_falloc { + int mode; /* FALLOC_FL mode currently operating */ pgoff_t start; /* start of range currently being fallocated */ pgoff_t next; /* the next page offset to be fallocated */ pgoff_t nr_falloced; /* how many new pages have been fallocated */ @@ -759,6 +760,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) spin_lock(&inode->i_lock); shmem_falloc = inode->i_private; if (shmem_falloc && + !shmem_falloc->mode && index >= shmem_falloc->start && index < shmem_falloc->next) shmem_falloc->nr_unswapped++; @@ -1233,6 +1235,44 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf) int error; int ret = VM_FAULT_LOCKED; + /* + * Trinity finds that probing a hole which tmpfs is punching can + * prevent the hole-punch from ever completing: which in turn + * locks writers out with its hold on i_mutex. So refrain from + * faulting pages into the hole while it's being punched, and + * wait on i_mutex to be released if vmf->flags permits. + */ + if (unlikely(inode->i_private)) { + struct shmem_falloc *shmem_falloc; + + spin_lock(&inode->i_lock); + shmem_falloc = inode->i_private; + if (!shmem_falloc || + shmem_falloc->mode != FALLOC_FL_PUNCH_HOLE || + vmf->pgoff < shmem_falloc->start || + vmf->pgoff >= shmem_falloc->next) + shmem_falloc = NULL; + spin_unlock(&inode->i_lock); + /* + * i_lock has protected us from taking shmem_falloc seriously + * once return from shmem_fallocate() went back up that stack. + * i_lock does not serialize with i_mutex at all, but it does + * not matter if sometimes we wait unnecessarily, or sometimes + * miss out on waiting: we just need to make those cases rare. + */ + if (shmem_falloc) { + if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) && + !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) { + up_read(&vma->vm_mm->mmap_sem); + mutex_lock(&inode->i_mutex); + mutex_unlock(&inode->i_mutex); + return VM_FAULT_RETRY; + } + /* cond_resched? Leave that to GUP or return to user */ + return VM_FAULT_NOPAGE; + } + } + error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret); if (error) return ((error == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS); @@ -1729,18 +1769,26 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, mutex_lock(&inode->i_mutex); + shmem_falloc.mode = mode & ~FALLOC_FL_KEEP_SIZE; + if (mode & FALLOC_FL_PUNCH_HOLE) { struct address_space *mapping = file->f_mapping; loff_t unmap_start = round_up(offset, PAGE_SIZE); loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1; + shmem_falloc.start = unmap_start >> PAGE_SHIFT; + shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT; + spin_lock(&inode->i_lock); + inode->i_private = &shmem_falloc; + spin_unlock(&inode->i_lock); + if ((u64)unmap_end > (u64)unmap_start) unmap_mapping_range(mapping, unmap_start, 1 + unmap_end - unmap_start, 0); shmem_truncate_range(inode, offset, offset + len - 1); /* No need to unmap again: hole-punching leaves COWed pages */ error = 0; - goto out; + goto undone; } /* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */ -- cgit From 03787301420376ae41fbaf4267f4a6253d152ac5 Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Mon, 23 Jun 2014 13:22:06 -0700 Subject: slab: fix oops when reading /proc/slab_allocators Commit b1cb0982bdd6 ("change the management method of free objects of the slab") introduced a bug on slab leak detector ('/proc/slab_allocators'). This detector works like as following decription. 1. traverse all objects on all the slabs. 2. determine whether it is active or not. 3. if active, print who allocate this object. but that commit changed the way how to manage free objects, so the logic determining whether it is active or not is also changed. In before, we regard object in cpu caches as inactive one, but, with this commit, we mistakenly regard object in cpu caches as active one. This intoduces kernel oops if DEBUG_PAGEALLOC is enabled. If DEBUG_PAGEALLOC is enabled, kernel_map_pages() is used to detect who corrupt free memory in the slab. It unmaps page table mapping if object is free and map it if object is active. When slab leak detector check object in cpu caches, it mistakenly think this object active so try to access object memory to retrieve caller of allocation. At this point, page table mapping to this object doesn't exist, so oops occurs. Following is oops message reported from Dave. It blew up when something tried to read /proc/slab_allocators (Just cat it, and you should see the oops below) Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: [snip...] CPU: 1 PID: 9386 Comm: trinity-c33 Not tainted 3.14.0-rc5+ #131 task: ffff8801aa46e890 ti: ffff880076924000 task.ti: ffff880076924000 RIP: 0010:[] [] handle_slab+0x8a/0x180 RSP: 0018:ffff880076925de0 EFLAGS: 00010002 RAX: 0000000000001000 RBX: 0000000000000000 RCX: 000000005ce85ce7 RDX: ffffea00079be100 RSI: 0000000000001000 RDI: ffff880107458000 RBP: ffff880076925e18 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 000000000000000f R12: ffff8801e6f84000 R13: ffffea00079be100 R14: ffff880107458000 R15: ffff88022bb8d2c0 FS: 00007fb769e45740(0000) GS:ffff88024d040000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8801e6f84ff8 CR3: 00000000a22db000 CR4: 00000000001407e0 DR0: 0000000002695000 DR1: 0000000002695000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000070602 Call Trace: leaks_show+0xce/0x240 seq_read+0x28e/0x490 proc_reg_read+0x3d/0x80 vfs_read+0x9b/0x160 SyS_read+0x58/0xb0 tracesys+0xd4/0xd9 Code: f5 00 00 00 0f 1f 44 00 00 48 63 c8 44 3b 0c 8a 0f 84 e3 00 00 00 83 c0 01 44 39 c0 72 eb 41 f6 47 1a 01 0f 84 e9 00 00 00 89 f0 <4d> 8b 4c 04 f8 4d 85 c9 0f 84 88 00 00 00 49 8b 7e 08 4d 8d 46 RIP handle_slab+0x8a/0x180 To fix the problem, I introduce an object status buffer on each slab. With this, we can track object status precisely, so slab leak detector would not access active object and no kernel oops would occur. Memory overhead caused by this fix is only imposed to CONFIG_DEBUG_SLAB_LEAK which is mainly used for debugging, so memory overhead isn't big problem. Signed-off-by: Joonsoo Kim Reported-by: Dave Jones Reported-by: Tetsuo Handa Reviewed-by: Vladimir Davydov Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/slab.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 71 insertions(+), 19 deletions(-) (limited to 'mm') diff --git a/mm/slab.c b/mm/slab.c index 9ca3b87edabc..3070b929a1bf 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -386,6 +386,39 @@ static void **dbg_userword(struct kmem_cache *cachep, void *objp) #endif +#define OBJECT_FREE (0) +#define OBJECT_ACTIVE (1) + +#ifdef CONFIG_DEBUG_SLAB_LEAK + +static void set_obj_status(struct page *page, int idx, int val) +{ + int freelist_size; + char *status; + struct kmem_cache *cachep = page->slab_cache; + + freelist_size = cachep->num * sizeof(freelist_idx_t); + status = (char *)page->freelist + freelist_size; + status[idx] = val; +} + +static inline unsigned int get_obj_status(struct page *page, int idx) +{ + int freelist_size; + char *status; + struct kmem_cache *cachep = page->slab_cache; + + freelist_size = cachep->num * sizeof(freelist_idx_t); + status = (char *)page->freelist + freelist_size; + + return status[idx]; +} + +#else +static inline void set_obj_status(struct page *page, int idx, int val) {} + +#endif + /* * Do not go above this order unless 0 objects fit into the slab or * overridden on the command line. @@ -576,12 +609,30 @@ static inline struct array_cache *cpu_cache_get(struct kmem_cache *cachep) return cachep->array[smp_processor_id()]; } +static size_t calculate_freelist_size(int nr_objs, size_t align) +{ + size_t freelist_size; + + freelist_size = nr_objs * sizeof(freelist_idx_t); + if (IS_ENABLED(CONFIG_DEBUG_SLAB_LEAK)) + freelist_size += nr_objs * sizeof(char); + + if (align) + freelist_size = ALIGN(freelist_size, align); + + return freelist_size; +} + static int calculate_nr_objs(size_t slab_size, size_t buffer_size, size_t idx_size, size_t align) { int nr_objs; + size_t remained_size; size_t freelist_size; + int extra_space = 0; + if (IS_ENABLED(CONFIG_DEBUG_SLAB_LEAK)) + extra_space = sizeof(char); /* * Ignore padding for the initial guess. The padding * is at most @align-1 bytes, and @buffer_size is at @@ -590,14 +641,15 @@ static int calculate_nr_objs(size_t slab_size, size_t buffer_size, * into the memory allocation when taking the padding * into account. */ - nr_objs = slab_size / (buffer_size + idx_size); + nr_objs = slab_size / (buffer_size + idx_size + extra_space); /* * This calculated number will be either the right * amount, or one greater than what we want. */ - freelist_size = slab_size - nr_objs * buffer_size; - if (freelist_size < ALIGN(nr_objs * idx_size, align)) + remained_size = slab_size - nr_objs * buffer_size; + freelist_size = calculate_freelist_size(nr_objs, align); + if (remained_size < freelist_size) nr_objs--; return nr_objs; @@ -635,7 +687,7 @@ static void cache_estimate(unsigned long gfporder, size_t buffer_size, } else { nr_objs = calculate_nr_objs(slab_size, buffer_size, sizeof(freelist_idx_t), align); - mgmt_size = ALIGN(nr_objs * sizeof(freelist_idx_t), align); + mgmt_size = calculate_freelist_size(nr_objs, align); } *num = nr_objs; *left_over = slab_size - nr_objs*buffer_size - mgmt_size; @@ -2041,13 +2093,16 @@ static size_t calculate_slab_order(struct kmem_cache *cachep, break; if (flags & CFLGS_OFF_SLAB) { + size_t freelist_size_per_obj = sizeof(freelist_idx_t); /* * Max number of objs-per-slab for caches which * use off-slab slabs. Needed to avoid a possible * looping condition in cache_grow(). */ + if (IS_ENABLED(CONFIG_DEBUG_SLAB_LEAK)) + freelist_size_per_obj += sizeof(char); offslab_limit = size; - offslab_limit /= sizeof(freelist_idx_t); + offslab_limit /= freelist_size_per_obj; if (num > offslab_limit) break; @@ -2294,8 +2349,7 @@ __kmem_cache_create (struct kmem_cache *cachep, unsigned long flags) if (!cachep->num) return -E2BIG; - freelist_size = - ALIGN(cachep->num * sizeof(freelist_idx_t), cachep->align); + freelist_size = calculate_freelist_size(cachep->num, cachep->align); /* * If the slab has been placed off-slab, and we have enough space then @@ -2308,7 +2362,7 @@ __kmem_cache_create (struct kmem_cache *cachep, unsigned long flags) if (flags & CFLGS_OFF_SLAB) { /* really off slab. No need for manual alignment */ - freelist_size = cachep->num * sizeof(freelist_idx_t); + freelist_size = calculate_freelist_size(cachep->num, 0); #ifdef CONFIG_PAGE_POISONING /* If we're going to use the generic kernel_map_pages() @@ -2612,6 +2666,7 @@ static void cache_init_objs(struct kmem_cache *cachep, if (cachep->ctor) cachep->ctor(objp); #endif + set_obj_status(page, i, OBJECT_FREE); set_free_obj(page, i, i); } } @@ -2820,6 +2875,7 @@ static void *cache_free_debugcheck(struct kmem_cache *cachep, void *objp, BUG_ON(objnr >= cachep->num); BUG_ON(objp != index_to_obj(cachep, page, objnr)); + set_obj_status(page, objnr, OBJECT_FREE); if (cachep->flags & SLAB_POISON) { #ifdef CONFIG_DEBUG_PAGEALLOC if ((cachep->size % PAGE_SIZE)==0 && OFF_SLAB(cachep)) { @@ -2953,6 +3009,8 @@ static inline void cache_alloc_debugcheck_before(struct kmem_cache *cachep, static void *cache_alloc_debugcheck_after(struct kmem_cache *cachep, gfp_t flags, void *objp, unsigned long caller) { + struct page *page; + if (!objp) return objp; if (cachep->flags & SLAB_POISON) { @@ -2983,6 +3041,9 @@ static void *cache_alloc_debugcheck_after(struct kmem_cache *cachep, *dbg_redzone1(cachep, objp) = RED_ACTIVE; *dbg_redzone2(cachep, objp) = RED_ACTIVE; } + + page = virt_to_head_page(objp); + set_obj_status(page, obj_to_index(cachep, page, objp), OBJECT_ACTIVE); objp += obj_offset(cachep); if (cachep->ctor && cachep->flags & SLAB_POISON) cachep->ctor(objp); @@ -4219,21 +4280,12 @@ static void handle_slab(unsigned long *n, struct kmem_cache *c, struct page *page) { void *p; - int i, j; + int i; if (n[0] == n[1]) return; for (i = 0, p = page->s_mem; i < c->num; i++, p += c->size) { - bool active = true; - - for (j = page->active; j < c->num; j++) { - /* Skip freed item */ - if (get_free_obj(page, j) == i) { - active = false; - break; - } - } - if (!active) + if (get_obj_status(page, i) != OBJECT_ACTIVE) continue; if (!add_caller(n, (unsigned long)*dbg_userword(c, p))) -- cgit From d05f0cdcbe6388723f1900c549b4850360545201 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Mon, 23 Jun 2014 13:22:07 -0700 Subject: mm: fix crashes from mbind() merging vmas In v2.6.34 commit 9d8cebd4bcd7 ("mm: fix mbind vma merge problem") introduced vma merging to mbind(), but it should have also changed the convention of passing start vma from queue_pages_range() (formerly check_range()) to new_vma_page(): vma merging may have already freed that structure, resulting in BUG at mm/mempolicy.c:1738 and probably worse crashes. Fixes: 9d8cebd4bcd7 ("mm: fix mbind vma merge problem") Reported-by: Naoya Horiguchi Tested-by: Naoya Horiguchi Signed-off-by: Hugh Dickins Acked-by: Christoph Lameter Cc: KOSAKI Motohiro Cc: Minchan Kim Cc: [2.6.34+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mempolicy.c | 46 ++++++++++++++++++++-------------------------- 1 file changed, 20 insertions(+), 26 deletions(-) (limited to 'mm') diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 284974230459..eb58de19f815 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -656,19 +656,18 @@ static unsigned long change_prot_numa(struct vm_area_struct *vma, * @nodes and @flags,) it's isolated and queued to the pagelist which is * passed via @private.) */ -static struct vm_area_struct * +static int queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end, const nodemask_t *nodes, unsigned long flags, void *private) { - int err; - struct vm_area_struct *first, *vma, *prev; - + int err = 0; + struct vm_area_struct *vma, *prev; - first = find_vma(mm, start); - if (!first) - return ERR_PTR(-EFAULT); + vma = find_vma(mm, start); + if (!vma) + return -EFAULT; prev = NULL; - for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) { + for (; vma && vma->vm_start < end; vma = vma->vm_next) { unsigned long endvma = vma->vm_end; if (endvma > end) @@ -678,9 +677,9 @@ queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end, if (!(flags & MPOL_MF_DISCONTIG_OK)) { if (!vma->vm_next && vma->vm_end < end) - return ERR_PTR(-EFAULT); + return -EFAULT; if (prev && prev->vm_end < vma->vm_start) - return ERR_PTR(-EFAULT); + return -EFAULT; } if (flags & MPOL_MF_LAZY) { @@ -694,15 +693,13 @@ queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end, err = queue_pages_pgd_range(vma, start, endvma, nodes, flags, private); - if (err) { - first = ERR_PTR(err); + if (err) break; - } } next: prev = vma; } - return first; + return err; } /* @@ -1156,16 +1153,17 @@ out: /* * Allocate a new page for page migration based on vma policy. - * Start assuming that page is mapped by vma pointed to by @private. + * Start by assuming the page is mapped by the same vma as contains @start. * Search forward from there, if not. N.B., this assumes that the * list of pages handed to migrate_pages()--which is how we get here-- * is in virtual address order. */ -static struct page *new_vma_page(struct page *page, unsigned long private, int **x) +static struct page *new_page(struct page *page, unsigned long start, int **x) { - struct vm_area_struct *vma = (struct vm_area_struct *)private; + struct vm_area_struct *vma; unsigned long uninitialized_var(address); + vma = find_vma(current->mm, start); while (vma) { address = page_address_in_vma(page, vma); if (address != -EFAULT) @@ -1195,7 +1193,7 @@ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from, return -ENOSYS; } -static struct page *new_vma_page(struct page *page, unsigned long private, int **x) +static struct page *new_page(struct page *page, unsigned long start, int **x) { return NULL; } @@ -1205,7 +1203,6 @@ static long do_mbind(unsigned long start, unsigned long len, unsigned short mode, unsigned short mode_flags, nodemask_t *nmask, unsigned long flags) { - struct vm_area_struct *vma; struct mm_struct *mm = current->mm; struct mempolicy *new; unsigned long end; @@ -1271,11 +1268,9 @@ static long do_mbind(unsigned long start, unsigned long len, if (err) goto mpol_out; - vma = queue_pages_range(mm, start, end, nmask, + err = queue_pages_range(mm, start, end, nmask, flags | MPOL_MF_INVERT, &pagelist); - - err = PTR_ERR(vma); /* maybe ... */ - if (!IS_ERR(vma)) + if (!err) err = mbind_range(mm, start, end, new); if (!err) { @@ -1283,9 +1278,8 @@ static long do_mbind(unsigned long start, unsigned long len, if (!list_empty(&pagelist)) { WARN_ON_ONCE(flags & MPOL_MF_LAZY); - nr_failed = migrate_pages(&pagelist, new_vma_page, - NULL, (unsigned long)vma, - MIGRATE_SYNC, MR_MEMPOLICY_MBIND); + nr_failed = migrate_pages(&pagelist, new_page, NULL, + start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND); if (nr_failed) putback_movable_pages(&pagelist); } -- cgit From 391acf970d21219a2a5446282d3b20eace0c0d7a Mon Sep 17 00:00:00 2001 From: Gu Zheng Date: Wed, 25 Jun 2014 09:57:18 +0800 Subject: cpuset,mempolicy: fix sleeping function called from invalid context When runing with the kernel(3.15-rc7+), the follow bug occurs: [ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586 [ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python [ 9969.441175] INFO: lockdep is turned off. [ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G A 3.15.0-rc7+ #85 [ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012 [ 9969.706052] ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18 [ 9969.795323] ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c [ 9969.884710] ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000 [ 9969.974071] Call Trace: [ 9970.003403] [] dump_stack+0x4d/0x66 [ 9970.065074] [] __might_sleep+0xfa/0x130 [ 9970.130743] [] mutex_lock_nested+0x3c/0x4f0 [ 9970.200638] [] ? kmem_cache_alloc+0x1bc/0x210 [ 9970.272610] [] cpuset_mems_allowed+0x27/0x140 [ 9970.344584] [] ? __mpol_dup+0x63/0x150 [ 9970.409282] [] __mpol_dup+0xe5/0x150 [ 9970.471897] [] ? __mpol_dup+0x63/0x150 [ 9970.536585] [] ? copy_process.part.23+0x606/0x1d40 [ 9970.613763] [] ? trace_hardirqs_on+0xd/0x10 [ 9970.683660] [] ? monotonic_to_bootbased+0x2f/0x50 [ 9970.759795] [] copy_process.part.23+0x670/0x1d40 [ 9970.834885] [] do_fork+0xd8/0x380 [ 9970.894375] [] ? __audit_syscall_entry+0x9c/0xf0 [ 9970.969470] [] SyS_clone+0x16/0x20 [ 9971.030011] [] stub_clone+0x69/0x90 [ 9971.091573] [] ? system_call_fastpath+0x16/0x1b The cause is that cpuset_mems_allowed() try to take mutex_lock(&callback_mutex) under the rcu_read_lock(which was hold in __mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock protection region to protect the access to cpuset only in current_cpuset_is_being_rebound(). So that we can avoid this bug. This patch is a temporary solution that just addresses the bug mentioned above, can not fix the long-standing issue about cpuset.mems rebinding on fork(): "When the forker's task_struct is duplicated (which includes ->mems_allowed) and it races with an update to cpuset_being_rebound in update_tasks_nodemask() then the task's mems_allowed doesn't get updated. And the child task's mems_allowed can be wrong if the cpuset's nodemask changes before the child has been added to the cgroup's tasklist." Signed-off-by: Gu Zheng Acked-by: Li Zefan Signed-off-by: Tejun Heo Cc: stable --- mm/mempolicy.c | 2 -- 1 file changed, 2 deletions(-) (limited to 'mm') diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 284974230459..9a3783ccff67 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2145,7 +2145,6 @@ struct mempolicy *__mpol_dup(struct mempolicy *old) } else *new = *old; - rcu_read_lock(); if (current_cpuset_is_being_rebound()) { nodemask_t mems = cpuset_mems_allowed(current); if (new->flags & MPOL_F_REBINDING) @@ -2153,7 +2152,6 @@ struct mempolicy *__mpol_dup(struct mempolicy *old) else mpol_rebind_policy(new, &mems, MPOL_REBIND_ONCE); } - rcu_read_unlock(); atomic_set(&new->refcnt, 1); return new; } -- cgit From dc78327c0ea7da5186d8cbc1647bd6088c5c9fa5 Mon Sep 17 00:00:00 2001 From: Michal Nazarewicz Date: Wed, 2 Jul 2014 15:22:35 -0700 Subject: mm: page_alloc: fix CMA area initialisation when pageblock > MAX_ORDER MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit With a kernel configured with ARM64_64K_PAGES && !TRANSPARENT_HUGEPAGE, the following is triggered at early boot: SMP: Total of 8 processors activated. devtmpfs: initialized Unable to handle kernel NULL pointer dereference at virtual address 00000008 pgd = fffffe0000050000 [00000008] *pgd=00000043fba00003, *pmd=00000043fba00003, *pte=00e0000078010407 Internal error: Oops: 96000006 [#1] SMP Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.15.0-rc864k+ #44 task: fffffe03bc040000 ti: fffffe03bc080000 task.ti: fffffe03bc080000 PC is at __list_add+0x10/0xd4 LR is at free_one_page+0x270/0x638 ... Call trace: __list_add+0x10/0xd4 free_one_page+0x26c/0x638 __free_pages_ok.part.52+0x84/0xbc __free_pages+0x74/0xbc init_cma_reserved_pageblock+0xe8/0x104 cma_init_reserved_areas+0x190/0x1e4 do_one_initcall+0xc4/0x154 kernel_init_freeable+0x204/0x2a8 kernel_init+0xc/0xd4 This happens because init_cma_reserved_pageblock() calls __free_one_page() with pageblock_order as page order but it is bigger than MAX_ORDER. This in turn causes accesses past zone->free_list[]. Fix the problem by changing init_cma_reserved_pageblock() such that it splits pageblock into individual MAX_ORDER pages if pageblock is bigger than a MAX_ORDER page. In cases where !CONFIG_HUGETLB_PAGE_SIZE_VARIABLE, which is all architectures expect for ia64, powerpc and tile at the moment, the “pageblock_order > MAX_ORDER” condition will be optimised out since both sides of the operator are constants. In cases where pageblock size is variable, the performance degradation should not be significant anyway since init_cma_reserved_pageblock() is called only at boot time at most MAX_CMA_AREAS times which by default is eight. Signed-off-by: Michal Nazarewicz Reported-by: Mark Salter Tested-by: Mark Salter Tested-by: Christopher Covington Cc: Mel Gorman Cc: David Rientjes Cc: Marek Szyprowski Cc: Catalin Marinas Cc: [3.5+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 20d17f8266fe..0ea758b898fd 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -816,9 +816,21 @@ void __init init_cma_reserved_pageblock(struct page *page) set_page_count(p, 0); } while (++p, --i); - set_page_refcounted(page); set_pageblock_migratetype(page, MIGRATE_CMA); - __free_pages(page, pageblock_order); + + if (pageblock_order >= MAX_ORDER) { + i = pageblock_nr_pages; + p = page; + do { + set_page_refcounted(p); + __free_pages(p, MAX_ORDER - 1); + p += MAX_ORDER_NR_PAGES; + } while (i -= MAX_ORDER_NR_PAGES); + } else { + set_page_refcounted(page); + __free_pages(page, pageblock_order); + } + adjust_managed_page_count(page, pageblock_nr_pages); } #endif -- cgit From 8a5b20aebaa3d0ade5b8381e64d35fb777b7b355 Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Wed, 2 Jul 2014 15:22:35 -0700 Subject: slub: fix off by one in number of slab tests min_partial means minimum number of slab cached in node partial list. So, if nr_partial is less than it, we keep newly empty slab on node partial list rather than freeing it. But if nr_partial is equal or greater than it, it means that we have enough partial slabs so should free newly empty slab. Current implementation missed the equal case so if we set min_partial is 0, then, at least one slab could be cached. This is critical problem to kmemcg destroying logic because it doesn't works properly if some slabs is cached. This patch fixes this problem. Fixes 91cb69620284 ("slub: make dead memcg caches discard free slabs immediately"). Signed-off-by: Joonsoo Kim Acked-by: Vladimir Davydov Cc: Christoph Lameter Cc: Pekka Enberg Acked-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/slub.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/slub.c b/mm/slub.c index b2b047327d76..73004808537e 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1881,7 +1881,7 @@ redo: new.frozen = 0; - if (!new.inuse && n->nr_partial > s->min_partial) + if (!new.inuse && n->nr_partial >= s->min_partial) m = M_FREE; else if (new.freelist) { m = M_PARTIAL; @@ -1992,7 +1992,7 @@ static void unfreeze_partials(struct kmem_cache *s, new.freelist, new.counters, "unfreezing slab")); - if (unlikely(!new.inuse && n->nr_partial > s->min_partial)) { + if (unlikely(!new.inuse && n->nr_partial >= s->min_partial)) { page->next = discard_page; discard_page = page; } else { @@ -2620,7 +2620,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, return; } - if (unlikely(!new.inuse && n->nr_partial > s->min_partial)) + if (unlikely(!new.inuse && n->nr_partial >= s->min_partial)) goto slab_empty; /* -- cgit From 496a8e68654a5f42db90c650be305dcb50bfebdb Mon Sep 17 00:00:00 2001 From: Namjae Jeon Date: Wed, 2 Jul 2014 15:22:36 -0700 Subject: msync: fix incorrect fstart calculation Fix a regression caused by 7fc34a62ca44 ("mm/msync.c: sync only the requested range in msync()"). xfstests generic/075 fail occured on ext4 data=journal mode because the intended range was not syncing due to wrong fstart calculation. Signed-off-by: Namjae Jeon Signed-off-by: Ashish Sangwan Reported-by: Eric Whitney Tested-by: Eric Whitney Acked-by: Matthew Wilcox Reviewed-by: Lukas Czerner Tested-by: Lukas Czerner Reviewed-by: Theodore Ts'o Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/msync.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/msync.c b/mm/msync.c index a5c673669ca6..992a1673d488 100644 --- a/mm/msync.c +++ b/mm/msync.c @@ -78,7 +78,8 @@ SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags) goto out_unlock; } file = vma->vm_file; - fstart = start + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); + fstart = (start - vma->vm_start) + + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); fend = fstart + (min(end, vma->vm_end) - start) - 1; start = vma->vm_end; if ((flags & MS_SYNC) && file && -- cgit From 0bc1f8b0682caa39f45ce1e0228ebf43acb46111 Mon Sep 17 00:00:00 2001 From: Chen Yucong Date: Wed, 2 Jul 2014 15:22:37 -0700 Subject: hwpoison: fix the handling path of the victimized page frame that belong to non-LRU Until now, the kernel has the same policy to handle victimized page frames that belong to kernel-space(reserved/slab-subsystem) or non-LRU(unknown page state). In other word, the result of handling either of these victimized page frames is (IGNORED | FAILED), and the return value of memory_failure() is -EBUSY. This patch is to avoid that memory_failure() returns very soon due to the "true" value of (!PageLRU(p)), and it also ensures that action_result() can report more precise information("reserved kernel", "kernel slab", and "unknown page state") instead of "non LRU", especially for memory errors which are detected by memory-scrubbing. Andi said: : While running the mcelog test suite on 3.14 I hit the following VM_BUG_ON: : : soft_offline: 0x56d4: unknown non LRU page type 3ffff800008000 : page:ffffea000015b400 count:3 mapcount:2097169 mapping: (null) index:0xffff8800056d7000 : page flags: 0x3ffff800004081(locked|slab|head) : ------------[ cut here ]------------ : kernel BUG at mm/rmap.c:1495! : : I think what happened is that a LRU page turned into a slab page in : parallel with offlining. memory_failure initially tests for this case, : but doesn't retest later after the page has been locked. : : ... : : I ran this patch in a loop over night with some stress plus : the mcelog test suite running in a loop. I cannot guarantee it hit it, : but it should have given it a good beating. : : The kernel survived with no messages, although the mcelog test suite : got killed at some point because it couldn't fork anymore. Probably : some unrelated problem. : : So the patch is ok for me for .16. Signed-off-by: Chen Yucong Acked-by: Naoya Horiguchi Reported-by: Andi Kleen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory-failure.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) (limited to 'mm') diff --git a/mm/memory-failure.c b/mm/memory-failure.c index cd8989c1027e..c6399e328931 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -895,7 +895,7 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn, struct page *hpage = *hpagep; struct page *ppage; - if (PageReserved(p) || PageSlab(p)) + if (PageReserved(p) || PageSlab(p) || !PageLRU(p)) return SWAP_SUCCESS; /* @@ -1159,9 +1159,6 @@ int memory_failure(unsigned long pfn, int trapno, int flags) action_result(pfn, "free buddy, 2nd try", DELAYED); return 0; } - action_result(pfn, "non LRU", IGNORED); - put_page(p); - return -EBUSY; } } @@ -1194,6 +1191,9 @@ int memory_failure(unsigned long pfn, int trapno, int flags) return 0; } + if (!PageHuge(p) && !PageTransTail(p) && !PageLRU(p)) + goto identify_page_state; + /* * For error on the tail page, we should set PG_hwpoison * on the head page to show that the hugepage is hwpoisoned @@ -1243,6 +1243,7 @@ int memory_failure(unsigned long pfn, int trapno, int flags) goto out; } +identify_page_state: res = -EBUSY; /* * The first check uses the current page flags which may not have any -- cgit From 66d2f4d28cd030220e7ea2a628993fcabcb956d1 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Wed, 2 Jul 2014 15:22:38 -0700 Subject: shmem: fix init_page_accessed use to stop !PageLRU bug Under shmem swapping load, I sometimes hit the VM_BUG_ON_PAGE(!PageLRU) in isolate_lru_pages() at mm/vmscan.c:1281! Commit 2457aec63745 ("mm: non-atomically mark page accessed during page cache allocation where possible") looks like interrupted work-in-progress. mm/filemap.c's call to init_page_accessed() is fine, but not mm/shmem.c's - shmem_write_begin() is clearly wrong to use it after shmem_getpage(), when the page is always visible in radix_tree, and often already on LRU. Revert change to shmem_write_begin(), and use init_page_accessed() or mark_page_accessed() appropriately for SGP_WRITE in shmem_getpage_gfp(). SGP_WRITE also covers shmem_symlink(), which did not mark_page_accessed() before; but since many other filesystems use [__]page_symlink(), which did and does mark the page accessed, consider this as rectifying an oversight. Signed-off-by: Hugh Dickins Acked-by: Mel Gorman Cc: Johannes Weiner Cc: Vlastimil Babka Cc: Michal Hocko Cc: Dave Hansen Cc: Prabhakar Lad Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/shmem.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) (limited to 'mm') diff --git a/mm/shmem.c b/mm/shmem.c index 8f419cff9e34..1140f49b6ded 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1029,6 +1029,9 @@ repeat: goto failed; } + if (page && sgp == SGP_WRITE) + mark_page_accessed(page); + /* fallocated page? */ if (page && !PageUptodate(page)) { if (sgp != SGP_READ) @@ -1110,6 +1113,9 @@ repeat: shmem_recalc_inode(inode); spin_unlock(&info->lock); + if (sgp == SGP_WRITE) + mark_page_accessed(page); + delete_from_swap_cache(page); set_page_dirty(page); swap_free(swap); @@ -1136,6 +1142,9 @@ repeat: __SetPageSwapBacked(page); __set_page_locked(page); + if (sgp == SGP_WRITE) + init_page_accessed(page); + error = mem_cgroup_charge_file(page, current->mm, gfp & GFP_RECLAIM_MASK); if (error) @@ -1412,13 +1421,9 @@ shmem_write_begin(struct file *file, struct address_space *mapping, loff_t pos, unsigned len, unsigned flags, struct page **pagep, void **fsdata) { - int ret; struct inode *inode = mapping->host; pgoff_t index = pos >> PAGE_CACHE_SHIFT; - ret = shmem_getpage(inode, index, pagep, SGP_WRITE, NULL); - if (ret == 0 && *pagep) - init_page_accessed(*pagep); - return ret; + return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL); } static int -- cgit