From 2be7cfed995e25de1ffaffe14cc065b7ffb528e0 Mon Sep 17 00:00:00 2001 From: Daniel Jordan Date: Wed, 2 Aug 2017 13:31:47 -0700 Subject: mm/hugetlb.c: __get_user_pages ignores certain follow_hugetlb_page errors Commit 9a291a7c9428 ("mm/hugetlb: report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified") causes __get_user_pages to ignore certain errors from follow_hugetlb_page. After such error, __get_user_pages subsequently calls faultin_page on the same VMA and start address that follow_hugetlb_page failed on instead of returning the error immediately as it should. In follow_hugetlb_page, when hugetlb_fault returns a value covered under VM_FAULT_ERROR, follow_hugetlb_page returns it without setting nr_pages to 0 as __get_user_pages expects in this case, which causes the following to happen in __get_user_pages: the "while (nr_pages)" check succeeds, we skip the "if (!vma..." check because we got a VMA the last time around, we find no page with follow_page_mask, and we call faultin_page, which calls hugetlb_fault for the second time. This issue also slightly changes how __get_user_pages works. Before, it only returned error if it had made no progress (i = 0). But now, follow_hugetlb_page can clobber "i" with an error code since its new return path doesn't check for progress. So if "i" is nonzero before a failing call to follow_hugetlb_page, that indication of progress is lost and __get_user_pages can return error even if some pages were successfully pinned. To fix this, change follow_hugetlb_page so that it updates nr_pages, allowing __get_user_pages to fail immediately and restoring the "error only if no progress" behavior to __get_user_pages. Tested that __get_user_pages returns when expected on error from hugetlb_fault in follow_hugetlb_page. Fixes: 9a291a7c9428 ("mm/hugetlb: report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified") Link: http://lkml.kernel.org/r/1500406795-58462-1-git-send-email-daniel.m.jordan@oracle.com Signed-off-by: Daniel Jordan Acked-by: Punit Agrawal Cc: Andrea Arcangeli Cc: "Aneesh Kumar K.V" Cc: Gerald Schaefer Cc: James Morse Cc: "Kirill A. Shutemov" Cc: Michal Hocko Cc: Mike Kravetz Cc: Naoya Horiguchi Cc: zhong jiang Cc: [4.12.x] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/hugetlb.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) (limited to 'mm') diff --git a/mm/hugetlb.c b/mm/hugetlb.c index bc48ee783dd9..a1a0ac0ad6f6 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4078,6 +4078,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long vaddr = *position; unsigned long remainder = *nr_pages; struct hstate *h = hstate_vma(vma); + int err = -EFAULT; while (vaddr < vma->vm_end && remainder) { pte_t *pte; @@ -4154,11 +4155,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, } ret = hugetlb_fault(mm, vma, vaddr, fault_flags); if (ret & VM_FAULT_ERROR) { - int err = vm_fault_to_errno(ret, flags); - - if (err) - return err; - + err = vm_fault_to_errno(ret, flags); remainder = 0; break; } @@ -4213,7 +4210,7 @@ same_page: */ *position = vaddr; - return i ? i : -EFAULT; + return i ? i : err; } #ifndef __HAVE_ARCH_FLUSH_HUGETLB_TLB_RANGE -- cgit From 3ea277194daaeaa84ce75180ec7c7a2075027a68 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 2 Aug 2017 13:31:52 -0700 Subject: mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries Nadav Amit identified a theoritical race between page reclaim and mprotect due to TLB flushes being batched outside of the PTL being held. He described the race as follows: CPU0 CPU1 ---- ---- user accesses memory using RW PTE [PTE now cached in TLB] try_to_unmap_one() ==> ptep_get_and_clear() ==> set_tlb_ubc_flush_pending() mprotect(addr, PROT_READ) ==> change_pte_range() ==> [ PTE non-present - no flush ] user writes using cached RW PTE ... try_to_unmap_flush() The same type of race exists for reads when protecting for PROT_NONE and also exists for operations that can leave an old TLB entry behind such as munmap, mremap and madvise. For some operations like mprotect, it's not necessarily a data integrity issue but it is a correctness issue as there is a window where an mprotect that limits access still allows access. For munmap, it's potentially a data integrity issue although the race is massive as an munmap, mmap and return to userspace must all complete between the window when reclaim drops the PTL and flushes the TLB. However, it's theoritically possible so handle this issue by flushing the mm if reclaim is potentially currently batching TLB flushes. Other instances where a flush is required for a present pte should be ok as either the page lock is held preventing parallel reclaim or a page reference count is elevated preventing a parallel free leading to corruption. In the case of page_mkclean there isn't an obvious path that userspace could take advantage of without using the operations that are guarded by this patch. Other users such as gup as a race with reclaim looks just at PTEs. huge page variants should be ok as they don't race with reclaim. mincore only looks at PTEs. userfault also should be ok as if a parallel reclaim takes place, it will either fault the page back in or read some of the data before the flush occurs triggering a fault. Note that a variant of this patch was acked by Andy Lutomirski but this was for the x86 parts on top of his PCID work which didn't make the 4.13 merge window as expected. His ack is dropped from this version and there will be a follow-on patch on top of PCID that will include his ack. [akpm@linux-foundation.org: tweak comments] [akpm@linux-foundation.org: fix spello] Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.de Reported-by: Nadav Amit Signed-off-by: Mel Gorman Cc: Andy Lutomirski Cc: [v4.4+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/internal.h | 5 ++++- mm/madvise.c | 1 + mm/memory.c | 1 + mm/mprotect.c | 1 + mm/mremap.c | 1 + mm/rmap.c | 36 ++++++++++++++++++++++++++++++++++++ 6 files changed, 44 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/internal.h b/mm/internal.h index 24d88f084705..4ef49fc55e58 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -498,6 +498,7 @@ extern struct workqueue_struct *mm_percpu_wq; #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH void try_to_unmap_flush(void); void try_to_unmap_flush_dirty(void); +void flush_tlb_batched_pending(struct mm_struct *mm); #else static inline void try_to_unmap_flush(void) { @@ -505,7 +506,9 @@ static inline void try_to_unmap_flush(void) static inline void try_to_unmap_flush_dirty(void) { } - +static inline void flush_tlb_batched_pending(struct mm_struct *mm) +{ +} #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */ extern const struct trace_print_flags pageflag_names[]; diff --git a/mm/madvise.c b/mm/madvise.c index 9976852f1e1c..47d8d8a25eae 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -320,6 +320,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, tlb_remove_check_page_size_change(tlb, PAGE_SIZE); orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl); + flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); for (; addr != end; pte++, addr += PAGE_SIZE) { ptent = *pte; diff --git a/mm/memory.c b/mm/memory.c index 0e517be91a89..f65beaad319b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1197,6 +1197,7 @@ again: init_rss_vec(rss); start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl); pte = start_pte; + flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); do { pte_t ptent = *pte; diff --git a/mm/mprotect.c b/mm/mprotect.c index 1a8c9ca83e48..4180ad8cc9c5 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -64,6 +64,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, atomic_read(&vma->vm_mm->mm_users) == 1) target_node = numa_node_id(); + flush_tlb_batched_pending(vma->vm_mm); arch_enter_lazy_mmu_mode(); do { oldpte = *pte; diff --git a/mm/mremap.c b/mm/mremap.c index cd8a1b199ef9..6e3d857458de 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -152,6 +152,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, new_ptl = pte_lockptr(mm, new_pmd); if (new_ptl != old_ptl) spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING); + flush_tlb_batched_pending(vma->vm_mm); arch_enter_lazy_mmu_mode(); for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE, diff --git a/mm/rmap.c b/mm/rmap.c index ced14f1af6dc..c8993c63eb25 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -604,6 +604,13 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable) arch_tlbbatch_add_mm(&tlb_ubc->arch, mm); tlb_ubc->flush_required = true; + /* + * Ensure compiler does not re-order the setting of tlb_flush_batched + * before the PTE is cleared. + */ + barrier(); + mm->tlb_flush_batched = true; + /* * If the PTE was dirty then it's best to assume it's writable. The * caller must use try_to_unmap_flush_dirty() or try_to_unmap_flush() @@ -631,6 +638,35 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) return should_defer; } + +/* + * Reclaim unmaps pages under the PTL but do not flush the TLB prior to + * releasing the PTL if TLB flushes are batched. It's possible for a parallel + * operation such as mprotect or munmap to race between reclaim unmapping + * the page and flushing the page. If this race occurs, it potentially allows + * access to data via a stale TLB entry. Tracking all mm's that have TLB + * batching in flight would be expensive during reclaim so instead track + * whether TLB batching occurred in the past and if so then do a flush here + * if required. This will cost one additional flush per reclaim cycle paid + * by the first operation at risk such as mprotect and mumap. + * + * This must be called under the PTL so that an access to tlb_flush_batched + * that is potentially a "reclaim vs mprotect/munmap/etc" race will synchronise + * via the PTL. + */ +void flush_tlb_batched_pending(struct mm_struct *mm) +{ + if (mm->tlb_flush_batched) { + flush_tlb_mm(mm); + + /* + * Do not allow the compiler to re-order the clearing of + * tlb_flush_batched before the tlb is flushed. + */ + barrier(); + mm->tlb_flush_batched = false; + } +} #else static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable) { -- cgit From b22823719302e88d0e2a6bb06433bd97b175a8d8 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 2 Aug 2017 13:31:55 -0700 Subject: userfaultfd: non-cooperative: notify about unmap of destination during mremap When mremap is called with MREMAP_FIXED it unmaps memory at the destination address without notifying userfaultfd monitor. If the destination were registered with userfaultfd, the monitor has no way to distinguish between the old and new ranges and to properly relate the page faults that would occur in the destination region. Fixes: 897ab3e0c49e ("userfaultfd: non-cooperative: add event for memory unmaps") Link: http://lkml.kernel.org/r/1500276876-3350-1-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport Acked-by: Pavel Emelyanov Cc: Andrea Arcangeli Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mremap.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/mremap.c b/mm/mremap.c index 6e3d857458de..3f23715d3c69 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -429,6 +429,7 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr, static unsigned long mremap_to(unsigned long addr, unsigned long old_len, unsigned long new_addr, unsigned long new_len, bool *locked, struct vm_userfaultfd_ctx *uf, + struct list_head *uf_unmap_early, struct list_head *uf_unmap) { struct mm_struct *mm = current->mm; @@ -447,7 +448,7 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len, if (addr + old_len > new_addr && new_addr + new_len > addr) goto out; - ret = do_munmap(mm, new_addr, new_len, NULL); + ret = do_munmap(mm, new_addr, new_len, uf_unmap_early); if (ret) goto out; @@ -515,6 +516,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, unsigned long charged = 0; bool locked = false; struct vm_userfaultfd_ctx uf = NULL_VM_UFFD_CTX; + LIST_HEAD(uf_unmap_early); LIST_HEAD(uf_unmap); if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE)) @@ -542,7 +544,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, if (flags & MREMAP_FIXED) { ret = mremap_to(addr, old_len, new_addr, new_len, - &locked, &uf, &uf_unmap); + &locked, &uf, &uf_unmap_early, &uf_unmap); goto out; } @@ -622,6 +624,7 @@ out: up_write(¤t->mm->mmap_sem); if (locked && new_len > old_len) mm_populate(new_addr + old_len, new_len - old_len); + userfaultfd_unmap_complete(mm, &uf_unmap_early); mremap_userfaultfd_complete(&uf, addr, new_addr, old_len); userfaultfd_unmap_complete(mm, &uf_unmap); return ret; -- cgit From e7701557bfdd81ff44cab13a80439319a735d8e2 Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Wed, 2 Aug 2017 13:31:58 -0700 Subject: kasan: avoid -Wmaybe-uninitialized warning gcc-7 produces this warning: mm/kasan/report.c: In function 'kasan_report': mm/kasan/report.c:351:3: error: 'info.first_bad_addr' may be used uninitialized in this function [-Werror=maybe-uninitialized] print_shadow_for_address(info->first_bad_addr); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mm/kasan/report.c:360:27: note: 'info.first_bad_addr' was declared here The code seems fine as we only print info.first_bad_addr when there is a shadow, and we always initialize it in that case, but this is relatively hard for gcc to figure out after the latest rework. Adding an intialization to the most likely value together with the other struct members shuts up that warning. Fixes: b235b9808664 ("kasan: unify report headers") Link: https://patchwork.kernel.org/patch/9641417/ Link: http://lkml.kernel.org/r/20170725152739.4176967-1-arnd@arndb.de Signed-off-by: Arnd Bergmann Suggested-by: Alexander Potapenko Suggested-by: Andrey Ryabinin Acked-by: Andrey Ryabinin Cc: Dmitry Vyukov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/kasan/report.c | 1 + 1 file changed, 1 insertion(+) (limited to 'mm') diff --git a/mm/kasan/report.c b/mm/kasan/report.c index 04bb1d3eb9ec..6bcfb01ba038 100644 --- a/mm/kasan/report.c +++ b/mm/kasan/report.c @@ -401,6 +401,7 @@ void kasan_report(unsigned long addr, size_t size, disable_trace_on_warning(); info.access_addr = (void *)addr; + info.first_bad_addr = (void *)addr; info.access_size = size; info.is_write = is_write; info.ip = ip; -- cgit From 3189c82056beeaf26733296be21043c091743cd9 Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Wed, 2 Aug 2017 13:32:03 -0700 Subject: zram: do not free pool->size_class Mike reported kernel goes oops with ltp:zram03 testcase. zram: Added device: zram0 zram0: detected capacity change from 0 to 107374182400 BUG: unable to handle kernel paging request at 0000306d61727a77 IP: zs_map_object+0xb9/0x260 PGD 0 P4D 0 Oops: 0000 [#1] SMP Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: zram(E) xfs(E) libcrc32c(E) btrfs(E) xor(E) raid6_pq(E) loop(E) ebtable_filter(E) ebtables(E) ip6table_filter(E) ip6_tables(E) iptable_filter(E) ip_tables(E) x_tables(E) af_packet(E) br_netfilter(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) fat(E) intel_powerclamp(E) coretemp(E) cdc_ether(E) kvm_intel(E) usbnet(E) mii(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) iTCO_wdt(E) ghash_clmulni_intel(E) bnx2(E) iTCO_vendor_support(E) pcbc(E) ioatdma(E) ipmi_ssif(E) aesni_intel(E) i5500_temp(E) i2c_i801(E) aes_x86_64(E) lpc_ich(E) shpchp(E) mfd_core(E) crypto_simd(E) i7core_edac(E) dca(E) glue_helper(E) cryptd(E) ipmi_si(E) button(E) acpi_cpufreq(E) ipmi_devintf(E) pcspkr(E) ipmi_msghandler(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) ext4(E) crc16(E) mbcache(E) jbd2(E) sd_mod(E) ata_generic(E) i2c_algo_bit(E) ata_piix(E) drm_kms_helper(E) ahci(E) syscopyarea(E) sysfillrect(E) libahci(E) sysimgblt(E) fb_sys_fops(E) uhci_hcd(E) ehci_pci(E) ttm(E) ehci_hcd(E) libata(E) drm(E) megaraid_sas(E) usbcore(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) efivarfs(E) autofs4(E) [last unloaded: zram] CPU: 6 PID: 12356 Comm: swapon Tainted: G E 4.13.0.g87b2c3f-default #194 Hardware name: IBM System x3550 M3 -[7944K3G]-/69Y5698 , BIOS -[D6E150AUS-1.10]- 12/15/2010 task: ffff880158d2c4c0 task.stack: ffffc90001680000 RIP: 0010:zs_map_object+0xb9/0x260 Call Trace: zram_bvec_rw.isra.26+0xe8/0x780 [zram] zram_rw_page+0x6e/0xa0 [zram] bdev_read_page+0x81/0xb0 do_mpage_readpage+0x51a/0x710 mpage_readpages+0x122/0x1a0 blkdev_readpages+0x1d/0x20 __do_page_cache_readahead+0x1b2/0x270 ondemand_readahead+0x180/0x2c0 page_cache_sync_readahead+0x31/0x50 generic_file_read_iter+0x7e7/0xaf0 blkdev_read_iter+0x37/0x40 __vfs_read+0xce/0x140 vfs_read+0x9e/0x150 SyS_read+0x46/0xa0 entry_SYSCALL_64_fastpath+0x1a/0xa5 Code: 81 e6 00 c0 3f 00 81 fe 00 00 16 00 0f 85 9f 01 00 00 0f b7 13 65 ff 05 5e 07 dc 7e 66 c1 ea 02 81 e2 ff 01 00 00 49 8b 54 d4 08 <8b> 4a 48 41 0f af ce 81 e1 ff 0f 00 00 41 89 c9 48 c7 c3 a0 70 RIP: zs_map_object+0xb9/0x260 RSP: ffffc90001683988 CR2: 0000306d61727a77 He bisected the problem is [1]. After commit cf8e0fedf078 ("mm/zsmalloc: simplify zs_max_alloc_size handling"), zram doesn't use double pointer for pool->size_class any more in zs_create_pool so counter function zs_destroy_pool don't need to free it, either. Otherwise, it does kfree wrong address and then, kernel goes Oops. Link: http://lkml.kernel.org/r/20170725062650.GA12134@bbox Fixes: cf8e0fedf078 ("mm/zsmalloc: simplify zs_max_alloc_size handling") Signed-off-by: Minchan Kim Reported-by: Mike Galbraith Tested-by: Mike Galbraith Reviewed-by: Sergey Senozhatsky Cc: Jerome Marchand Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/zsmalloc.c | 1 - 1 file changed, 1 deletion(-) (limited to 'mm') diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 013eea76685e..308acb9d814b 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -2453,7 +2453,6 @@ void zs_destroy_pool(struct zs_pool *pool) } destroy_cache(pool); - kfree(pool->size_class); kfree(pool->name); kfree(pool); } -- cgit From b0ba2d0faf6520dd0e534a9410114e043e9ab084 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Wed, 2 Aug 2017 13:32:09 -0700 Subject: mm/page_io.c: fix oops during block io poll in swapin path When a thread is OOM-killed during swap_readpage() operation, an oops occurs because end_swap_bio_read() is calling wake_up_process() based on an assumption that the thread which called swap_readpage() is still alive. Out of memory: Kill process 525 (polkitd) score 0 or sacrifice child Killed process 525 (polkitd) total-vm:528128kB, anon-rss:0kB, file-rss:4kB, shmem-rss:0kB oom_reaper: reaped process 525 (polkitd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp ppdev pcspkr vmw_balloon sg shpchp vmw_vmci parport_pc parport i2c_piix4 ip_tables xfs libcrc32c sd_mod sr_mod cdrom ata_generic pata_acpi vmwgfx ahci libahci drm_kms_helper ata_piix syscopyarea sysfillrect sysimgblt fb_sys_fops mptspi scsi_transport_spi ttm e1000 mptscsih drm mptbase i2c_core libata serio_raw CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0-rc2-next-20170725 #129 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 task: ffffffffb7c16500 task.stack: ffffffffb7c00000 RIP: 0010:__lock_acquire+0x151/0x12f0 Call Trace: lock_acquire+0x59/0x80 _raw_spin_lock_irqsave+0x3b/0x4f try_to_wake_up+0x3b/0x410 wake_up_process+0x10/0x20 end_swap_bio_read+0x6f/0xf0 bio_endio+0x92/0xb0 blk_update_request+0x88/0x270 scsi_end_request+0x32/0x1c0 scsi_io_completion+0x209/0x680 scsi_finish_command+0xd4/0x120 scsi_softirq_done+0x120/0x140 __blk_mq_complete_request_remote+0xe/0x10 flush_smp_call_function_queue+0x51/0x120 generic_smp_call_function_single_interrupt+0xe/0x20 smp_trace_call_function_single_interrupt+0x22/0x30 smp_call_function_single_interrupt+0x9/0x10 call_function_single_interrupt+0xa7/0xb0 RIP: 0010:native_safe_halt+0x6/0x10 default_idle+0xe/0x20 arch_cpu_idle+0xa/0x10 default_idle_call+0x1e/0x30 do_idle+0x187/0x200 cpu_startup_entry+0x6e/0x70 rest_init+0xd0/0xe0 start_kernel+0x456/0x477 x86_64_start_reservations+0x24/0x26 x86_64_start_kernel+0xf7/0x11a secondary_startup_64+0xa5/0xa5 Code: c3 49 81 3f 20 9e 0b b8 41 bc 00 00 00 00 44 0f 45 e2 83 fe 01 0f 87 62 ff ff ff 89 f0 49 8b 44 c7 08 48 85 c0 0f 84 52 ff ff ff ff 80 98 01 00 00 8b 3d 5a 49 c4 01 45 8b b3 18 0c 00 00 85 RIP: __lock_acquire+0x151/0x12f0 RSP: ffffa01f39e03c50 ---[ end trace 6c441db499169b1e ]--- Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: 0x36000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) ---[ end Kernel panic - not syncing: Fatal exception in interrupt Fix it by holding a reference to the thread. [akpm@linux-foundation.org: add comment] Fixes: 23955622ff8d231b ("swap: add block io poll in swapin path") Signed-off-by: Tetsuo Handa Reviewed-by: Shaohua Li Cc: Tim Chen Cc: Huang Ying Cc: Jens Axboe Cc: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_io.c | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'mm') diff --git a/mm/page_io.c b/mm/page_io.c index b6c4ac388209..5f61b54ee1f3 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -22,6 +22,7 @@ #include #include #include +#include #include static struct bio *get_swap_bio(gfp_t gfp_flags, @@ -136,6 +137,7 @@ out: WRITE_ONCE(bio->bi_private, NULL); bio_put(bio); wake_up_process(waiter); + put_task_struct(waiter); } int generic_swapfile_activate(struct swap_info_struct *sis, @@ -378,6 +380,11 @@ int swap_readpage(struct page *page, bool do_poll) goto out; } bdev = bio->bi_bdev; + /* + * Keep this task valid during swap readpage because the oom killer may + * attempt to access it in the page fault retry time check. + */ + get_task_struct(current); bio->bi_private = current; bio_set_op_attrs(bio, REQ_OP_READ, 0); count_vm_event(PSWPIN); -- cgit From 167d0f258fedbfc859ad4105b1ea236818d41bdd Mon Sep 17 00:00:00 2001 From: Heiko Carstens Date: Wed, 2 Aug 2017 13:32:12 -0700 Subject: mm: take memory hotplug lock within numa_zonelist_order_handler() Andre Wild reported the following warning: WARNING: CPU: 2 PID: 1205 at kernel/cpu.c:240 lockdep_assert_cpus_held+0x4c/0x60 Modules linked in: CPU: 2 PID: 1205 Comm: bash Not tainted 4.13.0-rc2-00022-gfd2b2c57ec20 #10 Hardware name: IBM 2964 N96 702 (z/VM 6.4.0) task: 00000000701d8100 task.stack: 0000000073594000 Krnl PSW : 0704f00180000000 0000000000145e24 (lockdep_assert_cpus_held+0x4c/0x60) ... Call Trace: lockdep_assert_cpus_held+0x42/0x60) stop_machine_cpuslocked+0x62/0xf0 build_all_zonelists+0x92/0x150 numa_zonelist_order_handler+0x102/0x150 proc_sys_call_handler.isra.12+0xda/0x118 proc_sys_write+0x34/0x48 __vfs_write+0x3c/0x178 vfs_write+0xbc/0x1a0 SyS_write+0x66/0xc0 system_call+0xc4/0x2b0 locks held by bash/1205: #0: (sb_writers#4){.+.+.+}, at: vfs_write+0xa6/0x1a0 #1: (zl_order_mutex){+.+...}, at: numa_zonelist_order_handler+0x44/0x150 #2: (zonelists_mutex){+.+...}, at: numa_zonelist_order_handler+0xf4/0x150 Last Breaking-Event-Address: lockdep_assert_cpus_held+0x48/0x60 This can be easily triggered with e.g. echo n > /proc/sys/vm/numa_zonelist_order In commit 3f906ba23689a ("mm/memory-hotplug: switch locking to a percpu rwsem") memory hotplug locking was changed to fix a potential deadlock. This also switched the stop_machine() invocation within build_all_zonelists() to stop_machine_cpuslocked() which now expects that online cpus are locked when being called. This assumption is not true if build_all_zonelists() is being called from numa_zonelist_order_handler(). In order to fix this simply add a mem_hotplug_begin()/mem_hotplug_done() pair to numa_zonelist_order_handler(). Link: http://lkml.kernel.org/r/20170726111738.38768-1-heiko.carstens@de.ibm.com Fixes: 3f906ba23689a ("mm/memory-hotplug: switch locking to a percpu rwsem") Signed-off-by: Heiko Carstens Reported-by: Andre Wild Acked-by: Michal Hocko Cc: KAMEZAWA Hiroyuki Cc: Thomas Gleixner Cc: Vlastimil Babka Cc: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'mm') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6d30e914afb6..fc32aa81f359 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4891,9 +4891,11 @@ int numa_zonelist_order_handler(struct ctl_table *table, int write, NUMA_ZONELIST_ORDER_LEN); user_zonelist_order = oldval; } else if (oldval != user_zonelist_order) { + mem_hotplug_begin(); mutex_lock(&zonelists_mutex); build_all_zonelists(NULL, NULL); mutex_unlock(&zonelists_mutex); + mem_hotplug_done(); } } out: -- cgit From d507e2ebd2c7be9138e5cf5c0cb1931c90c42ab1 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Thu, 10 Aug 2017 15:23:31 -0700 Subject: mm: fix global NR_SLAB_.*CLAIMABLE counter reads As Tetsuo points out: "Commit 385386cff4c6 ("mm: vmstat: move slab statistics from zone to node counters") broke "Slab:" field of /proc/meminfo . It shows nearly 0kB" In addition to /proc/meminfo, this problem also affects the slab counters OOM/allocation failure info dumps, can cause early -ENOMEM from overcommit protection, and miscalculate image size requirements during suspend-to-disk. This is because the patch in question switched the slab counters from the zone level to the node level, but forgot to update the global accessor functions to read the aggregate node data instead of the aggregate zone data. Use global_node_page_state() to access the global slab counters. Fixes: 385386cff4c6 ("mm: vmstat: move slab statistics from zone to node counters") Link: http://lkml.kernel.org/r/20170801134256.5400-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Reported-by: Tetsuo Handa Acked-by: Michal Hocko Cc: Josef Bacik Cc: Vladimir Davydov Cc: Stefan Agner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 9 +++++---- mm/util.c | 2 +- 2 files changed, 6 insertions(+), 5 deletions(-) (limited to 'mm') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index fc32aa81f359..626a430e32d1 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4458,8 +4458,9 @@ long si_mem_available(void) * Part of the reclaimable slab consists of items that are in use, * and cannot be freed. Cap this estimate at the low watermark. */ - available += global_page_state(NR_SLAB_RECLAIMABLE) - - min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, wmark_low); + available += global_node_page_state(NR_SLAB_RECLAIMABLE) - + min(global_node_page_state(NR_SLAB_RECLAIMABLE) / 2, + wmark_low); if (available < 0) available = 0; @@ -4602,8 +4603,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) global_node_page_state(NR_FILE_DIRTY), global_node_page_state(NR_WRITEBACK), global_node_page_state(NR_UNSTABLE_NFS), - global_page_state(NR_SLAB_RECLAIMABLE), - global_page_state(NR_SLAB_UNRECLAIMABLE), + global_node_page_state(NR_SLAB_RECLAIMABLE), + global_node_page_state(NR_SLAB_UNRECLAIMABLE), global_node_page_state(NR_FILE_MAPPED), global_node_page_state(NR_SHMEM), global_page_state(NR_PAGETABLE), diff --git a/mm/util.c b/mm/util.c index 7b07ec852e01..9ecddf568fe3 100644 --- a/mm/util.c +++ b/mm/util.c @@ -633,7 +633,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin) * which are reclaimable, under pressure. The dentry * cache and most inode caches should fall into this */ - free += global_page_state(NR_SLAB_RECLAIMABLE); + free += global_node_page_state(NR_SLAB_RECLAIMABLE); /* * Leave reserved pages. The pages are not for anonymous pages. -- cgit From 75dddef32514f7aa58930bde6a1263253bc3d4ba Mon Sep 17 00:00:00 2001 From: Jonathan Toppins Date: Thu, 10 Aug 2017 15:23:35 -0700 Subject: mm: ratelimit PFNs busy info message The RDMA subsystem can generate several thousand of these messages per second eventually leading to a kernel crash. Ratelimit these messages to prevent this crash. Doug said: "I've been carrying a version of this for several kernel versions. I don't remember when they started, but we have one (and only one) class of machines: Dell PE R730xd, that generate these errors. When it happens, without a rate limit, we get rcu timeouts and kernel oopses. With the rate limit, we just get a lot of annoying kernel messages but the machine continues on, recovers, and eventually the memory operations all succeed" And: "> Well... why are all these EBUSY's occurring? It sounds inefficient > (at least) but if it is expected, normal and unavoidable then > perhaps we should just remove that message altogether? I don't have an answer to that question. To be honest, I haven't looked real hard. We never had this at all, then it started out of the blue, but only on our Dell 730xd machines (and it hits all of them), but no other classes or brands of machines. And we have our 730xd machines loaded up with different brands and models of cards (for instance one dedicated to mlx4 hardware, one for qib, one for mlx5, an ocrdma/cxgb4 combo, etc), so the fact that it hit all of the machines meant it wasn't tied to any particular brand/model of RDMA hardware. To me, it always smelled of a hardware oddity specific to maybe the CPUs or mainboard chipsets in these machines, so given that I'm not an mm expert anyway, I never chased it down. A few other relevant details: it showed up somewhere around 4.8/4.9 or thereabouts. It never happened before, but the prinkt has been there since the 3.18 days, so possibly the test to trigger this message was changed, or something else in the allocator changed such that the situation started happening on these machines? And, like I said, it is specific to our 730xd machines (but they are all identical, so that could mean it's something like their specific ram configuration is causing the allocator to hit this on these machine but not on other machines in the cluster, I don't want to say it's necessarily the model of chipset or CPU, there are other bits of identicalness between these machines)" Link: http://lkml.kernel.org/r/499c0f6cc10d6eb829a67f2a4d75b4228a9b356e.1501695897.git.jtoppins@redhat.com Signed-off-by: Jonathan Toppins Reviewed-by: Doug Ledford Tested-by: Doug Ledford Cc: Michal Hocko Cc: Vlastimil Babka Cc: Mel Gorman Cc: Hillf Danton Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 626a430e32d1..6d00f746c2fd 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7669,7 +7669,7 @@ int alloc_contig_range(unsigned long start, unsigned long end, /* Make sure the range is really isolated. */ if (test_pages_isolated(outer_start, end, false)) { - pr_info("%s: [%lx, %lx) PFNs busy\n", + pr_info_ratelimited("%s: [%lx, %lx) PFNs busy\n", __func__, outer_start, end); ret = -EBUSY; goto done; -- cgit From 5af10dfd0afc559bb4b0f7e3e8227a1578333995 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Thu, 10 Aug 2017 15:23:38 -0700 Subject: userfaultfd: hugetlbfs: remove superfluous page unlock in VM_SHARED case huge_add_to_page_cache->add_to_page_cache implicitly unlocks the page before returning in case of errors. The error returned was -EEXIST by running UFFDIO_COPY on a non-hole offset of a VM_SHARED hugetlbfs mapping. It was an userland bug that triggered it and the kernel must cope with it returning -EEXIST from ioctl(UFFDIO_COPY) as expected. page dumped because: VM_BUG_ON_PAGE(!PageLocked(page)) kernel BUG at mm/filemap.c:964! invalid opcode: 0000 [#1] SMP CPU: 1 PID: 22582 Comm: qemu-system-x86 Not tainted 4.11.11-300.fc26.x86_64 #1 RIP: unlock_page+0x4a/0x50 Call Trace: hugetlb_mcopy_atomic_pte+0xc0/0x320 mcopy_atomic+0x96f/0xbe0 userfaultfd_ioctl+0x218/0xe90 do_vfs_ioctl+0xa5/0x600 SyS_ioctl+0x79/0x90 entry_SYSCALL_64_fastpath+0x1a/0xa9 Link: http://lkml.kernel.org/r/20170802165145.22628-2-aarcange@redhat.com Signed-off-by: Andrea Arcangeli Tested-by: Maxime Coquelin Reviewed-by: Mike Kravetz Cc: "Dr. David Alan Gilbert" Cc: Mike Rapoport Cc: Alexey Perevalov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/hugetlb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a1a0ac0ad6f6..31e207cb399b 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4062,9 +4062,9 @@ out: return ret; out_release_unlock: spin_unlock(ptl); -out_release_nounlock: if (vm_shared) unlock_page(page); +out_release_nounlock: put_page(page); goto out; } -- cgit From 16af97dc5a8975371a83d9e30a64038b48f40a2d Mon Sep 17 00:00:00 2001 From: Nadav Amit Date: Thu, 10 Aug 2017 15:23:56 -0700 Subject: mm: migrate: prevent racy access to tlb_flush_pending Patch series "fixes of TLB batching races", v6. It turns out that Linux TLB batching mechanism suffers from various races. Races that are caused due to batching during reclamation were recently handled by Mel and this patch-set deals with others. The more fundamental issue is that concurrent updates of the page-tables allow for TLB flushes to be batched on one core, while another core changes the page-tables. This other core may assume a PTE change does not require a flush based on the updated PTE value, while it is unaware that TLB flushes are still pending. This behavior affects KSM (which may result in memory corruption) and MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior). A proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED. Memory corruption in KSM is harder to produce in practice, but was observed by hacking the kernel and adding a delay before flushing and replacing the KSM page. Finally, there is also one memory barrier missing, which may affect architectures with weak memory model. This patch (of 7): Setting and clearing mm->tlb_flush_pending can be performed by multiple threads, since mmap_sem may only be acquired for read in task_numa_work(). If this happens, tlb_flush_pending might be cleared while one of the threads still changes PTEs and batches TLB flushes. This can lead to the same race between migration and change_protection_range() that led to the introduction of tlb_flush_pending. The result of this race was data corruption, which means that this patch also addresses a theoretically possible data corruption. An actual data corruption was not observed, yet the race was was confirmed by adding assertion to check tlb_flush_pending is not set by two threads, adding artificial latency in change_protection_range() and using sysctl to reduce kernel.numa_balancing_scan_delay_ms. Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com Fixes: 20841405940e ("mm: fix TLB flush race between migration, and change_protection_range") Signed-off-by: Nadav Amit Acked-by: Mel Gorman Acked-by: Rik van Riel Acked-by: Minchan Kim Cc: Andy Lutomirski Cc: Hugh Dickins Cc: "David S. Miller" Cc: Andrea Arcangeli Cc: Heiko Carstens Cc: Ingo Molnar Cc: Jeff Dike Cc: Martin Schwidefsky Cc: Mel Gorman Cc: Russell King Cc: Sergey Senozhatsky Cc: Tony Luck Cc: Yoshinori Sato Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/debug.c | 2 +- mm/mprotect.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/debug.c b/mm/debug.c index db1cd26d8752..d70103bb4731 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -159,7 +159,7 @@ void dump_mm(const struct mm_struct *mm) mm->numa_next_scan, mm->numa_scan_offset, mm->numa_scan_seq, #endif #if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION) - mm->tlb_flush_pending, + atomic_read(&mm->tlb_flush_pending), #endif mm->def_flags, &mm->def_flags ); diff --git a/mm/mprotect.c b/mm/mprotect.c index 4180ad8cc9c5..bd0f409922cb 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -244,7 +244,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma, BUG_ON(addr >= end); pgd = pgd_offset(mm, addr); flush_cache_range(vma, addr, end); - set_tlb_flush_pending(mm); + inc_tlb_flush_pending(mm); do { next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(pgd)) @@ -256,7 +256,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma, /* Only flush the TLB if we actually modified any entries: */ if (pages) flush_tlb_range(vma, start, end); - clear_tlb_flush_pending(mm); + dec_tlb_flush_pending(mm); return pages; } -- cgit From a9b802500ebbff1544519a2969323b719dac21f0 Mon Sep 17 00:00:00 2001 From: Nadav Amit Date: Thu, 10 Aug 2017 15:24:02 -0700 Subject: Revert "mm: numa: defer TLB flush for THP migration as long as possible" While deferring TLB flushes is a good practice, the reverted patch caused pending TLB flushes to be checked while the page-table lock is not taken. As a result, in architectures with weak memory model (PPC), Linux may miss a memory-barrier, miss the fact TLB flushes are pending, and cause (in theory) a memory corruption. Since the alternative of using smp_mb__after_unlock_lock() was considered a bit open-coded, and the performance impact is expected to be small, the previous patch is reverted. This reverts b0943d61b8fa ("mm: numa: defer TLB flush for THP migration as long as possible"). Link: http://lkml.kernel.org/r/20170802000818.4760-4-namit@vmware.com Signed-off-by: Nadav Amit Suggested-by: Mel Gorman Acked-by: Mel Gorman Acked-by: Rik van Riel Cc: Minchan Kim Cc: Sergey Senozhatsky Cc: Andy Lutomirski Cc: "David S. Miller" Cc: Andrea Arcangeli Cc: Heiko Carstens Cc: Hugh Dickins Cc: Ingo Molnar Cc: Jeff Dike Cc: Martin Schwidefsky Cc: Mel Gorman Cc: Nadav Amit Cc: Russell King Cc: Tony Luck Cc: Yoshinori Sato Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/huge_memory.c | 7 +++++++ mm/migrate.c | 6 ------ 2 files changed, 7 insertions(+), 6 deletions(-) (limited to 'mm') diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 86975dec0ba1..216114f6ef0b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1495,6 +1495,13 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd) goto clear_pmdnuma; } + /* + * The page_table_lock above provides a memory barrier + * with change_protection_range. + */ + if (mm_tlb_flush_pending(vma->vm_mm)) + flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE); + /* * Migrate the THP to the requested node, returns with page unlocked * and access rights restored. diff --git a/mm/migrate.c b/mm/migrate.c index 627671551873..d68a41da6abb 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1937,12 +1937,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, put_page(new_page); goto out_fail; } - /* - * We are not sure a pending tlb flush here is for a huge page - * mapping or not. Hence use the tlb range variant - */ - if (mm_tlb_flush_pending(mm)) - flush_tlb_range(vma, mmun_start, mmun_end); /* Prepare a page as a migration target */ __SetPageLocked(new_page); -- cgit From 56236a59556cfd3bae7bffb7e5f438b5ef0af880 Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Thu, 10 Aug 2017 15:24:05 -0700 Subject: mm: refactor TLB gathering API This patch is a preparatory patch for solving race problems caused by TLB batch. For that, we will increase/decrease TLB flush pending count of mm_struct whenever tlb_[gather|finish]_mmu is called. Before making it simple, this patch separates architecture specific part and rename it to arch_tlb_[gather|finish]_mmu and generic part just calls it. It shouldn't change any behavior. Link: http://lkml.kernel.org/r/20170802000818.4760-5-namit@vmware.com Signed-off-by: Minchan Kim Signed-off-by: Nadav Amit Acked-by: Mel Gorman Cc: Ingo Molnar Cc: Russell King Cc: Tony Luck Cc: Martin Schwidefsky Cc: "David S. Miller" Cc: Heiko Carstens Cc: Yoshinori Sato Cc: Jeff Dike Cc: Andrea Arcangeli Cc: Andy Lutomirski Cc: Hugh Dickins Cc: Mel Gorman Cc: Nadav Amit Cc: Rik van Riel Cc: Sergey Senozhatsky Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory.c | 28 +++++++++++++++++++++------- 1 file changed, 21 insertions(+), 7 deletions(-) (limited to 'mm') diff --git a/mm/memory.c b/mm/memory.c index f65beaad319b..34cba5113e06 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -215,12 +215,8 @@ static bool tlb_next_batch(struct mmu_gather *tlb) return true; } -/* tlb_gather_mmu - * Called to initialize an (on-stack) mmu_gather structure for page-table - * tear-down from @mm. The @fullmm argument is used when @mm is without - * users and we're going to destroy the full address space (exit/execve). - */ -void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start, unsigned long end) +void arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, + unsigned long start, unsigned long end) { tlb->mm = mm; @@ -275,7 +271,8 @@ void tlb_flush_mmu(struct mmu_gather *tlb) * Called at the end of the shootdown operation to free up any resources * that were required. */ -void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) +void arch_tlb_finish_mmu(struct mmu_gather *tlb, + unsigned long start, unsigned long end) { struct mmu_gather_batch *batch, *next; @@ -398,6 +395,23 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table) #endif /* CONFIG_HAVE_RCU_TABLE_FREE */ +/* tlb_gather_mmu + * Called to initialize an (on-stack) mmu_gather structure for page-table + * tear-down from @mm. The @fullmm argument is used when @mm is without + * users and we're going to destroy the full address space (exit/execve). + */ +void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + arch_tlb_gather_mmu(tlb, mm, start, end); +} + +void tlb_finish_mmu(struct mmu_gather *tlb, + unsigned long start, unsigned long end) +{ + arch_tlb_finish_mmu(tlb, start, end); +} + /* * Note: this doesn't free the actual pages themselves. That * has been handled earlier when unmapping all the memory regions. -- cgit From 0a2dd266dd6b7a31503b5bbe63af05961a6b446d Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Thu, 10 Aug 2017 15:24:09 -0700 Subject: mm: make tlb_flush_pending global Currently, tlb_flush_pending is used only for CONFIG_[NUMA_BALANCING| COMPACTION] but upcoming patches to solve subtle TLB flush batching problem will use it regardless of compaction/NUMA so this patch doesn't remove the dependency. [akpm@linux-foundation.org: remove more ifdefs from world's ugliest printk statement] Link: http://lkml.kernel.org/r/20170802000818.4760-6-namit@vmware.com Signed-off-by: Minchan Kim Signed-off-by: Nadav Amit Acked-by: Mel Gorman Cc: "David S. Miller" Cc: Andrea Arcangeli Cc: Andy Lutomirski Cc: Heiko Carstens Cc: Hugh Dickins Cc: Ingo Molnar Cc: Jeff Dike Cc: Martin Schwidefsky Cc: Mel Gorman Cc: Nadav Amit Cc: Rik van Riel Cc: Russell King Cc: Sergey Senozhatsky Cc: Tony Luck Cc: Yoshinori Sato Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/debug.c | 4 ---- 1 file changed, 4 deletions(-) (limited to 'mm') diff --git a/mm/debug.c b/mm/debug.c index d70103bb4731..5715448ab0b5 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -124,9 +124,7 @@ void dump_mm(const struct mm_struct *mm) #ifdef CONFIG_NUMA_BALANCING "numa_next_scan %lu numa_scan_offset %lu numa_scan_seq %d\n" #endif -#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION) "tlb_flush_pending %d\n" -#endif "def_flags: %#lx(%pGv)\n", mm, mm->mmap, mm->vmacache_seqnum, mm->task_size, @@ -158,9 +156,7 @@ void dump_mm(const struct mm_struct *mm) #ifdef CONFIG_NUMA_BALANCING mm->numa_next_scan, mm->numa_scan_offset, mm->numa_scan_seq, #endif -#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION) atomic_read(&mm->tlb_flush_pending), -#endif mm->def_flags, &mm->def_flags ); } -- cgit From 99baac21e4585f4258f919502c6e23f1e5edc98c Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Thu, 10 Aug 2017 15:24:12 -0700 Subject: mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem Nadav reported parallel MADV_DONTNEED on same range has a stale TLB problem and Mel fixed it[1] and found same problem on MADV_FREE[2]. Quote from Mel Gorman: "The race in question is CPU 0 running madv_free and updating some PTEs while CPU 1 is also running madv_free and looking at the same PTEs. CPU 1 may have writable TLB entries for a page but fail the pte_dirty check (because CPU 0 has updated it already) and potentially fail to flush. Hence, when madv_free on CPU 1 returns, there are still potentially writable TLB entries and the underlying PTE is still present so that a subsequent write does not necessarily propagate the dirty bit to the underlying PTE any more. Reclaim at some unknown time at the future may then see that the PTE is still clean and discard the page even though a write has happened in the meantime. I think this is possible but I could have missed some protection in madv_free that prevents it happening." This patch aims for solving both problems all at once and is ready for other problem with KSM, MADV_FREE and soft-dirty story[3]. TLB batch API(tlb_[gather|finish]_mmu] uses [inc|dec]_tlb_flush_pending and mmu_tlb_flush_pending so that when tlb_finish_mmu is called, we can catch there are parallel threads going on. In that case, forcefully, flush TLB to prevent for user to access memory via stale TLB entry although it fail to gather page table entry. I confirmed this patch works with [4] test program Nadav gave so this patch supersedes "mm: Always flush VMA ranges affected by zap_page_range v2" in current mmotm. NOTE: This patch modifies arch-specific TLB gathering interface(x86, ia64, s390, sh, um). It seems most of architecture are straightforward but s390 need to be careful because tlb_flush_mmu works only if mm->context.flush_mm is set to non-zero which happens only a pte entry really is cleared by ptep_get_and_clear and friends. However, this problem never changes the pte entries but need to flush to prevent memory access from stale tlb. [1] http://lkml.kernel.org/r/20170725101230.5v7gvnjmcnkzzql3@techsingularity.net [2] http://lkml.kernel.org/r/20170725100722.2dxnmgypmwnrfawp@suse.de [3] http://lkml.kernel.org/r/BD3A0EBE-ECF4-41D4-87FA-C755EA9AB6BD@gmail.com [4] https://patchwork.kernel.org/patch/9861621/ [minchan@kernel.org: decrease tlb flush pending count in tlb_finish_mmu] Link: http://lkml.kernel.org/r/20170808080821.GA31730@bbox Link: http://lkml.kernel.org/r/20170802000818.4760-7-namit@vmware.com Signed-off-by: Minchan Kim Signed-off-by: Nadav Amit Reported-by: Nadav Amit Reported-by: Mel Gorman Acked-by: Mel Gorman Cc: Ingo Molnar Cc: Russell King Cc: Tony Luck Cc: Martin Schwidefsky Cc: "David S. Miller" Cc: Heiko Carstens Cc: Yoshinori Sato Cc: Jeff Dike Cc: Andrea Arcangeli Cc: Andy Lutomirski Cc: Hugh Dickins Cc: Mel Gorman Cc: Nadav Amit Cc: Rik van Riel Cc: Sergey Senozhatsky Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/memory.c b/mm/memory.c index 34cba5113e06..e158f7ac6730 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -272,10 +272,13 @@ void tlb_flush_mmu(struct mmu_gather *tlb) * that were required. */ void arch_tlb_finish_mmu(struct mmu_gather *tlb, - unsigned long start, unsigned long end) + unsigned long start, unsigned long end, bool force) { struct mmu_gather_batch *batch, *next; + if (force) + __tlb_adjust_range(tlb, start, end - start); + tlb_flush_mmu(tlb); /* keep the page table cache within bounds */ @@ -404,12 +407,23 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start, unsigned long end) { arch_tlb_gather_mmu(tlb, mm, start, end); + inc_tlb_flush_pending(tlb->mm); } void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) { - arch_tlb_finish_mmu(tlb, start, end); + /* + * If there are parallel threads are doing PTE changes on same range + * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB + * flush by batching, a thread has stable TLB entry can fail to flush + * the TLB by observing pte_none|!pte_dirty, for example so flush TLB + * forcefully if we detect parallel PTE batching threads. + */ + bool force = mm_tlb_flush_nested(tlb->mm); + + arch_tlb_finish_mmu(tlb, start, end, force); + dec_tlb_flush_pending(tlb->mm); } /* -- cgit From b3a81d0841a954a3d3e782059960f53e63b37066 Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Thu, 10 Aug 2017 15:24:15 -0700 Subject: mm: fix KSM data corruption Nadav reported KSM can corrupt the user data by the TLB batching race[1]. That means data user written can be lost. Quote from Nadav Amit: "For this race we need 4 CPUs: CPU0: Caches a writable and dirty PTE entry, and uses the stale value for write later. CPU1: Runs madvise_free on the range that includes the PTE. It would clear the dirty-bit. It batches TLB flushes. CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs soft-dirty. We care about the fact that it clears the PTE write-bit, and of course, batches TLB flushes. CPU3: Runs KSM. Our purpose is to pass the following test in write_protect_page(): if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) || (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte))) Since it will avoid TLB flush. And we want to do it while the PTE is stale. Later, and before replacing the page, we would be able to change the page. Note that all the operations the CPU1-3 perform canhappen in parallel since they only acquire mmap_sem for read. We start with two identical pages. Everything below regards the same page/PTE. CPU0 CPU1 CPU2 CPU3 ---- ---- ---- ---- Write the same value on page [cache PTE as dirty in TLB] MADV_FREE pte_mkclean() 4 > clear_refs pte_wrprotect() write_protect_page() [ success, no flush ] pages_indentical() [ ok ] Write to page different value [Ok, using stale PTE] replace_page() Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late. CPU0 already wrote on the page, but KSM ignored this write, and it got lost" In above scenario, MADV_FREE is fixed by changing TLB batching API including [set|clear]_tlb_flush_pending. Remained thing is soft-dirty part. This patch changes soft-dirty uses TLB batching API instead of flush_tlb_mm and KSM checks pending TLB flush by using mm_tlb_flush_pending so that it will flush TLB to avoid data lost if there are other parallel threads pending TLB flush. [1] http://lkml.kernel.org/r/BD3A0EBE-ECF4-41D4-87FA-C755EA9AB6BD@gmail.com Link: http://lkml.kernel.org/r/20170802000818.4760-8-namit@vmware.com Signed-off-by: Minchan Kim Signed-off-by: Nadav Amit Reported-by: Nadav Amit Tested-by: Nadav Amit Reviewed-by: Andrea Arcangeli Cc: Mel Gorman Cc: Hugh Dickins Cc: "David S. Miller" Cc: Andy Lutomirski Cc: Heiko Carstens Cc: Ingo Molnar Cc: Jeff Dike Cc: Martin Schwidefsky Cc: Mel Gorman Cc: Nadav Amit Cc: Rik van Riel Cc: Russell King Cc: Sergey Senozhatsky Cc: Tony Luck Cc: Yoshinori Sato Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/ksm.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/ksm.c b/mm/ksm.c index 4dc92f138786..db20f8436bc3 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -1038,7 +1038,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, goto out_unlock; if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) || - (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte))) { + (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)) || + mm_tlb_flush_pending(mm)) { pte_t entry; swapped = PageSwapCache(page); -- cgit From af54aed94bf3a1cf0b847bbbf00f9a58e278b338 Mon Sep 17 00:00:00 2001 From: Wei Wang Date: Thu, 10 Aug 2017 15:24:21 -0700 Subject: mm/balloon_compaction.c: don't zero ballooned pages Revert commit bb01b64cfab7 ("mm/balloon_compaction.c: enqueue zero page to balloon device")' Zeroing ballon pages is rather time consuming, especially when a lot of pages are in flight. E.g. 7GB worth of ballooned memory takes 2.8s with __GFP_ZERO while it takes ~491ms without it. The original commit argued that zeroing will help ksmd to merge these pages on the host but this argument is assuming that the host actually marks balloon pages for ksm which is not universally true. So we pay performance penalty for something that even might not be used in the end which is wrong. The host can zero out pages on its own when there is a need. [mhocko@kernel.org: new changelog text] Link: http://lkml.kernel.org/r/1501761557-9758-1-git-send-email-wei.w.wang@intel.com Fixes: bb01b64cfab7 ("mm/balloon_compaction.c: enqueue zero page to balloon device") Signed-off-by: Wei Wang Acked-by: Michael S. Tsirkin Acked-by: Michal Hocko Cc: zhenwei.pi Cc: David Hildenbrand Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/balloon_compaction.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c index 9075aa54e955..b06d9fe23a28 100644 --- a/mm/balloon_compaction.c +++ b/mm/balloon_compaction.c @@ -24,7 +24,7 @@ struct page *balloon_page_enqueue(struct balloon_dev_info *b_dev_info) { unsigned long flags; struct page *page = alloc_page(balloon_mapping_gfp_mask() | - __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_ZERO); + __GFP_NOMEMALLOC | __GFP_NORETRY); if (!page) return NULL; -- cgit From d041353dc98a6339182cd6f628b4c8f111278cb3 Mon Sep 17 00:00:00 2001 From: Cong Wang Date: Thu, 10 Aug 2017 15:24:24 -0700 Subject: mm: fix list corruptions on shmem shrinklist We saw many list corruption warnings on shmem shrinklist: WARNING: CPU: 18 PID: 177 at lib/list_debug.c:59 __list_del_entry+0x9e/0xc0 list_del corruption. prev->next should be ffff9ae5694b82d8, but was ffff9ae5699ba960 Modules linked in: intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel raid0 dcdbas shpchp wmi hed i2c_i801 ioatdma lpc_ich i2c_smbus acpi_cpufreq tcp_diag inet_diag sch_fq_codel ipmi_si ipmi_devintf ipmi_msghandler igb ptp crc32c_intel pps_core i2c_algo_bit i2c_core dca ipv6 crc_ccitt CPU: 18 PID: 177 Comm: kswapd1 Not tainted 4.9.34-t3.el7.twitter.x86_64 #1 Hardware name: Dell Inc. PowerEdge C6220/0W6W6G, BIOS 2.2.3 11/07/2013 Call Trace: dump_stack+0x4d/0x66 __warn+0xcb/0xf0 warn_slowpath_fmt+0x4f/0x60 __list_del_entry+0x9e/0xc0 shmem_unused_huge_shrink+0xfa/0x2e0 shmem_unused_huge_scan+0x20/0x30 super_cache_scan+0x193/0x1a0 shrink_slab.part.41+0x1e3/0x3f0 shrink_slab+0x29/0x30 shrink_node+0xf9/0x2f0 kswapd+0x2d8/0x6c0 kthread+0xd7/0xf0 ret_from_fork+0x22/0x30 WARNING: CPU: 23 PID: 639 at lib/list_debug.c:33 __list_add+0x89/0xb0 list_add corruption. prev->next should be next (ffff9ae5699ba960), but was ffff9ae5694b82d8. (prev=ffff9ae5694b82d8). Modules linked in: intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel raid0 dcdbas shpchp wmi hed i2c_i801 ioatdma lpc_ich i2c_smbus acpi_cpufreq tcp_diag inet_diag sch_fq_codel ipmi_si ipmi_devintf ipmi_msghandler igb ptp crc32c_intel pps_core i2c_algo_bit i2c_core dca ipv6 crc_ccitt CPU: 23 PID: 639 Comm: systemd-udevd Tainted: G W 4.9.34-t3.el7.twitter.x86_64 #1 Hardware name: Dell Inc. PowerEdge C6220/0W6W6G, BIOS 2.2.3 11/07/2013 Call Trace: dump_stack+0x4d/0x66 __warn+0xcb/0xf0 warn_slowpath_fmt+0x4f/0x60 __list_add+0x89/0xb0 shmem_setattr+0x204/0x230 notify_change+0x2ef/0x440 do_truncate+0x5d/0x90 path_openat+0x331/0x1190 do_filp_open+0x7e/0xe0 do_sys_open+0x123/0x200 SyS_open+0x1e/0x20 do_syscall_64+0x61/0x170 entry_SYSCALL64_slow_path+0x25/0x25 The problem is that shmem_unused_huge_shrink() moves entries from the global sbinfo->shrinklist to its local lists and then releases the spinlock. However, a parallel shmem_setattr() could access one of these entries directly and add it back to the global shrinklist if it is removed, with the spinlock held. The logic itself looks solid since an entry could be either in a local list or the global list, otherwise it is removed from one of them by list_del_init(). So probably the race condition is that, one CPU is in the middle of INIT_LIST_HEAD() but the other CPU calls list_empty() which returns true too early then the following list_add_tail() sees a corrupted entry. list_empty_careful() is designed to fix this situation. [akpm@linux-foundation.org: add comments] Link: http://lkml.kernel.org/r/20170803054630.18775-1-xiyou.wangcong@gmail.com Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure") Signed-off-by: Cong Wang Acked-by: Linus Torvalds Acked-by: Kirill A. Shutemov Cc: Hugh Dickins Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/shmem.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/shmem.c b/mm/shmem.c index b0aa6075d164..6540e5982444 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1022,7 +1022,11 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr) */ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE)) { spin_lock(&sbinfo->shrinklist_lock); - if (list_empty(&info->shrinklist)) { + /* + * _careful to defend against unlocked access to + * ->shrink_list in shmem_unused_huge_shrink() + */ + if (list_empty_careful(&info->shrinklist)) { list_add_tail(&info->shrinklist, &sbinfo->shrinklist); sbinfo->shrinklist_len++; @@ -1817,7 +1821,11 @@ alloc_nohuge: page = shmem_alloc_and_acct_page(gfp, info, sbinfo, * to shrink under memory pressure. */ spin_lock(&sbinfo->shrinklist_lock); - if (list_empty(&info->shrinklist)) { + /* + * _careful to defend against unlocked access to + * ->shrink_list in shmem_unused_huge_shrink() + */ + if (list_empty_careful(&info->shrinklist)) { list_add_tail(&info->shrinklist, &sbinfo->shrinklist); sbinfo->shrinklist_len++; -- cgit From aac2fea94f7a3df8ad1eeb477eb2643f81fd5393 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Thu, 10 Aug 2017 15:24:27 -0700 Subject: rmap: do not call mmu_notifier_invalidate_page() under ptl MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MMU notifiers can sleep, but in page_mkclean_one() we call mmu_notifier_invalidate_page() under page table lock. Let's instead use mmu_notifier_invalidate_range() outside page_vma_mapped_walk() loop. [jglisse@redhat.com: try_to_unmap_one() do not call mmu_notifier under ptl] Link: http://lkml.kernel.org/r/20170809204333.27485-1-jglisse@redhat.com Link: http://lkml.kernel.org/r/20170804134928.l4klfcnqatni7vsc@black.fi.intel.com Fixes: c7ab0d2fdc84 ("mm: convert try_to_unmap_one() to use page_vma_mapped_walk()") Signed-off-by: Kirill A. Shutemov Signed-off-by: Jérôme Glisse Reported-by: axie Cc: Alex Deucher Cc: "Writer, Tim" Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/rmap.c | 52 ++++++++++++++++++++++++++++++---------------------- 1 file changed, 30 insertions(+), 22 deletions(-) (limited to 'mm') diff --git a/mm/rmap.c b/mm/rmap.c index c8993c63eb25..c1286d47aa1f 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -888,10 +888,10 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, .flags = PVMW_SYNC, }; int *cleaned = arg; + bool invalidation_needed = false; while (page_vma_mapped_walk(&pvmw)) { int ret = 0; - address = pvmw.address; if (pvmw.pte) { pte_t entry; pte_t *pte = pvmw.pte; @@ -899,11 +899,11 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, if (!pte_dirty(*pte) && !pte_write(*pte)) continue; - flush_cache_page(vma, address, pte_pfn(*pte)); - entry = ptep_clear_flush(vma, address, pte); + flush_cache_page(vma, pvmw.address, pte_pfn(*pte)); + entry = ptep_clear_flush(vma, pvmw.address, pte); entry = pte_wrprotect(entry); entry = pte_mkclean(entry); - set_pte_at(vma->vm_mm, address, pte, entry); + set_pte_at(vma->vm_mm, pvmw.address, pte, entry); ret = 1; } else { #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE @@ -913,11 +913,11 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, if (!pmd_dirty(*pmd) && !pmd_write(*pmd)) continue; - flush_cache_page(vma, address, page_to_pfn(page)); - entry = pmdp_huge_clear_flush(vma, address, pmd); + flush_cache_page(vma, pvmw.address, page_to_pfn(page)); + entry = pmdp_huge_clear_flush(vma, pvmw.address, pmd); entry = pmd_wrprotect(entry); entry = pmd_mkclean(entry); - set_pmd_at(vma->vm_mm, address, pmd, entry); + set_pmd_at(vma->vm_mm, pvmw.address, pmd, entry); ret = 1; #else /* unexpected pmd-mapped page? */ @@ -926,11 +926,16 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, } if (ret) { - mmu_notifier_invalidate_page(vma->vm_mm, address); (*cleaned)++; + invalidation_needed = true; } } + if (invalidation_needed) { + mmu_notifier_invalidate_range(vma->vm_mm, address, + address + (1UL << compound_order(page))); + } + return true; } @@ -1323,7 +1328,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, }; pte_t pteval; struct page *subpage; - bool ret = true; + bool ret = true, invalidation_needed = false; enum ttu_flags flags = (enum ttu_flags)arg; /* munlock has nothing to gain from examining un-locked vmas */ @@ -1363,11 +1368,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, VM_BUG_ON_PAGE(!pvmw.pte, page); subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte); - address = pvmw.address; - if (!(flags & TTU_IGNORE_ACCESS)) { - if (ptep_clear_flush_young_notify(vma, address, + if (ptep_clear_flush_young_notify(vma, pvmw.address, pvmw.pte)) { ret = false; page_vma_mapped_walk_done(&pvmw); @@ -1376,7 +1379,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, } /* Nuke the page table entry. */ - flush_cache_page(vma, address, pte_pfn(*pvmw.pte)); + flush_cache_page(vma, pvmw.address, pte_pfn(*pvmw.pte)); if (should_defer_flush(mm, flags)) { /* * We clear the PTE but do not flush so potentially @@ -1386,11 +1389,12 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, * transition on a cached TLB entry is written through * and traps if the PTE is unmapped. */ - pteval = ptep_get_and_clear(mm, address, pvmw.pte); + pteval = ptep_get_and_clear(mm, pvmw.address, + pvmw.pte); set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); } else { - pteval = ptep_clear_flush(vma, address, pvmw.pte); + pteval = ptep_clear_flush(vma, pvmw.address, pvmw.pte); } /* Move the dirty bit to the page. Now the pte is gone. */ @@ -1405,12 +1409,12 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, if (PageHuge(page)) { int nr = 1 << compound_order(page); hugetlb_count_sub(nr, mm); - set_huge_swap_pte_at(mm, address, + set_huge_swap_pte_at(mm, pvmw.address, pvmw.pte, pteval, vma_mmu_pagesize(vma)); } else { dec_mm_counter(mm, mm_counter(page)); - set_pte_at(mm, address, pvmw.pte, pteval); + set_pte_at(mm, pvmw.address, pvmw.pte, pteval); } } else if (pte_unused(pteval)) { @@ -1434,7 +1438,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, swp_pte = swp_entry_to_pte(entry); if (pte_soft_dirty(pteval)) swp_pte = pte_swp_mksoft_dirty(swp_pte); - set_pte_at(mm, address, pvmw.pte, swp_pte); + set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); } else if (PageAnon(page)) { swp_entry_t entry = { .val = page_private(subpage) }; pte_t swp_pte; @@ -1460,7 +1464,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, * If the page was redirtied, it cannot be * discarded. Remap the page to page table. */ - set_pte_at(mm, address, pvmw.pte, pteval); + set_pte_at(mm, pvmw.address, pvmw.pte, pteval); SetPageSwapBacked(page); ret = false; page_vma_mapped_walk_done(&pvmw); @@ -1468,7 +1472,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, } if (swap_duplicate(entry) < 0) { - set_pte_at(mm, address, pvmw.pte, pteval); + set_pte_at(mm, pvmw.address, pvmw.pte, pteval); ret = false; page_vma_mapped_walk_done(&pvmw); break; @@ -1484,14 +1488,18 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, swp_pte = swp_entry_to_pte(entry); if (pte_soft_dirty(pteval)) swp_pte = pte_swp_mksoft_dirty(swp_pte); - set_pte_at(mm, address, pvmw.pte, swp_pte); + set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); } else dec_mm_counter(mm, mm_counter_file(page)); discard: page_remove_rmap(subpage, PageHuge(page)); put_page(page); - mmu_notifier_invalidate_page(mm, address); + invalidation_needed = true; } + + if (invalidation_needed) + mmu_notifier_invalidate_range(mm, address, + address + (1UL << compound_order(page))); return ret; } -- cgit From 739f79fc9db1b38f96b5a5109b247a650fbebf6d Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Fri, 18 Aug 2017 15:15:48 -0700 Subject: mm: memcontrol: fix NULL pointer crash in test_clear_page_writeback() Jaegeuk and Brad report a NULL pointer crash when writeback ending tries to update the memcg stats: BUG: unable to handle kernel NULL pointer dereference at 00000000000003b0 IP: test_clear_page_writeback+0x12e/0x2c0 [...] RIP: 0010:test_clear_page_writeback+0x12e/0x2c0 Call Trace: end_page_writeback+0x47/0x70 f2fs_write_end_io+0x76/0x180 [f2fs] bio_endio+0x9f/0x120 blk_update_request+0xa8/0x2f0 scsi_end_request+0x39/0x1d0 scsi_io_completion+0x211/0x690 scsi_finish_command+0xd9/0x120 scsi_softirq_done+0x127/0x150 __blk_mq_complete_request_remote+0x13/0x20 flush_smp_call_function_queue+0x56/0x110 generic_smp_call_function_single_interrupt+0x13/0x30 smp_call_function_single_interrupt+0x27/0x40 call_function_single_interrupt+0x89/0x90 RIP: 0010:native_safe_halt+0x6/0x10 (gdb) l *(test_clear_page_writeback+0x12e) 0xffffffff811bae3e is in test_clear_page_writeback (./include/linux/memcontrol.h:619). 614 mod_node_page_state(page_pgdat(page), idx, val); 615 if (mem_cgroup_disabled() || !page->mem_cgroup) 616 return; 617 mod_memcg_state(page->mem_cgroup, idx, val); 618 pn = page->mem_cgroup->nodeinfo[page_to_nid(page)]; 619 this_cpu_add(pn->lruvec_stat->count[idx], val); 620 } 621 622 unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, 623 gfp_t gfp_mask, The issue is that writeback doesn't hold a page reference and the page might get freed after PG_writeback is cleared (and the mapping is unlocked) in test_clear_page_writeback(). The stat functions looking up the page's node or zone are safe, as those attributes are static across allocation and free cycles. But page->mem_cgroup is not, and it will get cleared if we race with truncation or migration. It appears this race window has been around for a while, but less likely to trigger when the memcg stats were updated first thing after PG_writeback is cleared. Recent changes reshuffled this code to update the global node stats before the memcg ones, though, stretching the race window out to an extent where people can reproduce the problem. Update test_clear_page_writeback() to look up and pin page->mem_cgroup before clearing PG_writeback, then not use that pointer afterward. It is a partial revert of 62cccb8c8e7a ("mm: simplify lock_page_memcg()") but leaves the pageref-holding callsites that aren't affected alone. Link: http://lkml.kernel.org/r/20170809183825.GA26387@cmpxchg.org Fixes: 62cccb8c8e7a ("mm: simplify lock_page_memcg()") Signed-off-by: Johannes Weiner Reported-by: Jaegeuk Kim Tested-by: Jaegeuk Kim Reported-by: Bradley Bolen Tested-by: Brad Bolen Cc: Vladimir Davydov Cc: Michal Hocko Cc: [4.6+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 43 +++++++++++++++++++++++++++++++------------ mm/page-writeback.c | 15 ++++++++++++--- 2 files changed, 43 insertions(+), 15 deletions(-) (limited to 'mm') diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 3df3c04d73ab..e09741af816f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1611,9 +1611,13 @@ cleanup: * @page: the page * * This function protects unlocked LRU pages from being moved to - * another cgroup and stabilizes their page->mem_cgroup binding. + * another cgroup. + * + * It ensures lifetime of the returned memcg. Caller is responsible + * for the lifetime of the page; __unlock_page_memcg() is available + * when @page might get freed inside the locked section. */ -void lock_page_memcg(struct page *page) +struct mem_cgroup *lock_page_memcg(struct page *page) { struct mem_cgroup *memcg; unsigned long flags; @@ -1622,18 +1626,24 @@ void lock_page_memcg(struct page *page) * The RCU lock is held throughout the transaction. The fast * path can get away without acquiring the memcg->move_lock * because page moving starts with an RCU grace period. - */ + * + * The RCU lock also protects the memcg from being freed when + * the page state that is going to change is the only thing + * preventing the page itself from being freed. E.g. writeback + * doesn't hold a page reference and relies on PG_writeback to + * keep off truncation, migration and so forth. + */ rcu_read_lock(); if (mem_cgroup_disabled()) - return; + return NULL; again: memcg = page->mem_cgroup; if (unlikely(!memcg)) - return; + return NULL; if (atomic_read(&memcg->moving_account) <= 0) - return; + return memcg; spin_lock_irqsave(&memcg->move_lock, flags); if (memcg != page->mem_cgroup) { @@ -1649,18 +1659,18 @@ again: memcg->move_lock_task = current; memcg->move_lock_flags = flags; - return; + return memcg; } EXPORT_SYMBOL(lock_page_memcg); /** - * unlock_page_memcg - unlock a page->mem_cgroup binding - * @page: the page + * __unlock_page_memcg - unlock and unpin a memcg + * @memcg: the memcg + * + * Unlock and unpin a memcg returned by lock_page_memcg(). */ -void unlock_page_memcg(struct page *page) +void __unlock_page_memcg(struct mem_cgroup *memcg) { - struct mem_cgroup *memcg = page->mem_cgroup; - if (memcg && memcg->move_lock_task == current) { unsigned long flags = memcg->move_lock_flags; @@ -1672,6 +1682,15 @@ void unlock_page_memcg(struct page *page) rcu_read_unlock(); } + +/** + * unlock_page_memcg - unlock a page->mem_cgroup binding + * @page: the page + */ +void unlock_page_memcg(struct page *page) +{ + __unlock_page_memcg(page->mem_cgroup); +} EXPORT_SYMBOL(unlock_page_memcg); /* diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 96e93b214d31..bf050ab025b7 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2724,9 +2724,12 @@ EXPORT_SYMBOL(clear_page_dirty_for_io); int test_clear_page_writeback(struct page *page) { struct address_space *mapping = page_mapping(page); + struct mem_cgroup *memcg; + struct lruvec *lruvec; int ret; - lock_page_memcg(page); + memcg = lock_page_memcg(page); + lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); if (mapping && mapping_use_writeback_tags(mapping)) { struct inode *inode = mapping->host; struct backing_dev_info *bdi = inode_to_bdi(inode); @@ -2754,12 +2757,18 @@ int test_clear_page_writeback(struct page *page) } else { ret = TestClearPageWriteback(page); } + /* + * NOTE: Page might be free now! Writeback doesn't hold a page + * reference on its own, it relies on truncation to wait for + * the clearing of PG_writeback. The below can only access + * page state that is static across allocation cycles. + */ if (ret) { - dec_lruvec_page_state(page, NR_WRITEBACK); + dec_lruvec_state(lruvec, NR_WRITEBACK); dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); inc_node_page_state(page, NR_WRITTEN); } - unlock_page_memcg(page); + __unlock_page_memcg(memcg); return ret; } -- cgit From 3010f876500f9ba921afaeccec30c45ca6584dc8 Mon Sep 17 00:00:00 2001 From: Pavel Tatashin Date: Fri, 18 Aug 2017 15:16:05 -0700 Subject: mm: discard memblock data later There is existing use after free bug when deferred struct pages are enabled: The memblock_add() allocates memory for the memory array if more than 128 entries are needed. See comment in e820__memblock_setup(): * The bootstrap memblock region count maximum is 128 entries * (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries * than that - so allow memblock resizing. This memblock memory is freed here: free_low_memory_core_early() We access the freed memblock.memory later in boot when deferred pages are initialized in this path: deferred_init_memmap() for_each_mem_pfn_range() __next_mem_pfn_range() type = &memblock.memory; One possible explanation for why this use-after-free hasn't been hit before is that the limit of INIT_MEMBLOCK_REGIONS has never been exceeded at least on systems where deferred struct pages were enabled. Tested by reducing INIT_MEMBLOCK_REGIONS down to 4 from the current 128, and verifying in qemu that this code is getting excuted and that the freed pages are sane. Link: http://lkml.kernel.org/r/1502485554-318703-2-git-send-email-pasha.tatashin@oracle.com Fixes: 7e18adb4f80b ("mm: meminit: initialise remaining struct pages in parallel with kswapd") Signed-off-by: Pavel Tatashin Reviewed-by: Steven Sistare Reviewed-by: Daniel Jordan Reviewed-by: Bob Picco Acked-by: Michal Hocko Cc: Mel Gorman Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memblock.c | 38 +++++++++++++++++--------------------- mm/nobootmem.c | 16 ---------------- mm/page_alloc.c | 4 ++++ 3 files changed, 21 insertions(+), 37 deletions(-) (limited to 'mm') diff --git a/mm/memblock.c b/mm/memblock.c index 2cb25fe4452c..bf14aea6ab70 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -285,31 +285,27 @@ static void __init_memblock memblock_remove_region(struct memblock_type *type, u } #ifdef CONFIG_ARCH_DISCARD_MEMBLOCK - -phys_addr_t __init_memblock get_allocated_memblock_reserved_regions_info( - phys_addr_t *addr) -{ - if (memblock.reserved.regions == memblock_reserved_init_regions) - return 0; - - *addr = __pa(memblock.reserved.regions); - - return PAGE_ALIGN(sizeof(struct memblock_region) * - memblock.reserved.max); -} - -phys_addr_t __init_memblock get_allocated_memblock_memory_regions_info( - phys_addr_t *addr) +/** + * Discard memory and reserved arrays if they were allocated + */ +void __init memblock_discard(void) { - if (memblock.memory.regions == memblock_memory_init_regions) - return 0; + phys_addr_t addr, size; - *addr = __pa(memblock.memory.regions); + if (memblock.reserved.regions != memblock_reserved_init_regions) { + addr = __pa(memblock.reserved.regions); + size = PAGE_ALIGN(sizeof(struct memblock_region) * + memblock.reserved.max); + __memblock_free_late(addr, size); + } - return PAGE_ALIGN(sizeof(struct memblock_region) * - memblock.memory.max); + if (memblock.memory.regions == memblock_memory_init_regions) { + addr = __pa(memblock.memory.regions); + size = PAGE_ALIGN(sizeof(struct memblock_region) * + memblock.memory.max); + __memblock_free_late(addr, size); + } } - #endif /** diff --git a/mm/nobootmem.c b/mm/nobootmem.c index 36454d0f96ee..3637809a18d0 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -146,22 +146,6 @@ static unsigned long __init free_low_memory_core_early(void) NULL) count += __free_memory_core(start, end); -#ifdef CONFIG_ARCH_DISCARD_MEMBLOCK - { - phys_addr_t size; - - /* Free memblock.reserved array if it was allocated */ - size = get_allocated_memblock_reserved_regions_info(&start); - if (size) - count += __free_memory_core(start, start + size); - - /* Free memblock.memory array if it was allocated */ - size = get_allocated_memblock_memory_regions_info(&start); - if (size) - count += __free_memory_core(start, start + size); - } -#endif - return count; } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6d00f746c2fd..1bad301820c7 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1584,6 +1584,10 @@ void __init page_alloc_init_late(void) /* Reinit limits that are based on free pages after the kernel is up */ files_maxfiles_init(); #endif +#ifdef CONFIG_ARCH_DISCARD_MEMBLOCK + /* Discard memblock private memory */ + memblock_discard(); +#endif for_each_populated_zone(zone) set_zone_contiguous(zone); -- cgit From f6ba488073fe8159851fe398cc3c5ee383bb4c7a Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Fri, 18 Aug 2017 15:16:08 -0700 Subject: slub: fix per memcg cache leak on css offline To avoid a possible deadlock, sysfs_slab_remove() schedules an asynchronous work to delete sysfs entries corresponding to the kmem cache. To ensure the cache isn't freed before the work function is called, it takes a reference to the cache kobject. The reference is supposed to be released by the work function. However, the work function (sysfs_slab_remove_workfn()) does nothing in case the cache sysfs entry has already been deleted, leaking the kobject and the corresponding cache. This may happen on a per memcg cache destruction, because sysfs entries of a per memcg cache are deleted on memcg offline if the cache is empty (see __kmemcg_cache_deactivate()). The kmemleak report looks like this: unreferenced object 0xffff9f798a79f540 (size 32): comm "kworker/1:4", pid 15416, jiffies 4307432429 (age 28687.554s) hex dump (first 32 bytes): 6b 6d 61 6c 6c 6f 63 2d 31 36 28 31 35 39 39 3a kmalloc-16(1599: 6e 65 77 72 6f 6f 74 29 00 23 6b c0 ff ff ff ff newroot).#k..... backtrace: kmemleak_alloc+0x4a/0xa0 __kmalloc_track_caller+0x148/0x2c0 kvasprintf+0x66/0xd0 kasprintf+0x49/0x70 memcg_create_kmem_cache+0xe6/0x160 memcg_kmem_cache_create_func+0x20/0x110 process_one_work+0x205/0x5d0 worker_thread+0x4e/0x3a0 kthread+0x109/0x140 ret_from_fork+0x2a/0x40 unreferenced object 0xffff9f79b6136840 (size 416): comm "kworker/1:4", pid 15416, jiffies 4307432429 (age 28687.573s) hex dump (first 32 bytes): 40 fb 80 c2 3e 33 00 00 00 00 00 40 00 00 00 00 @...>3.....@.... 00 00 00 00 00 00 00 00 10 00 00 00 10 00 00 00 ................ backtrace: kmemleak_alloc+0x4a/0xa0 kmem_cache_alloc+0x128/0x280 create_cache+0x3b/0x1e0 memcg_create_kmem_cache+0x118/0x160 memcg_kmem_cache_create_func+0x20/0x110 process_one_work+0x205/0x5d0 worker_thread+0x4e/0x3a0 kthread+0x109/0x140 ret_from_fork+0x2a/0x40 Fix the leak by adding the missing call to kobject_put() to sysfs_slab_remove_workfn(). Link: http://lkml.kernel.org/r/20170812181134.25027-1-vdavydov.dev@gmail.com Fixes: 3b7b314053d02 ("slub: make sysfs file removal asynchronous") Signed-off-by: Vladimir Davydov Reported-by: Andrei Vagin Tested-by: Andrei Vagin Acked-by: Tejun Heo Acked-by: David Rientjes Cc: Michal Hocko Cc: Johannes Weiner Cc: Christoph Lameter Cc: Pekka Enberg Cc: Joonsoo Kim Cc: [4.12.x] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/slub.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/slub.c b/mm/slub.c index 1d3f9835f4ea..e8b4e31162ca 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -5642,13 +5642,14 @@ static void sysfs_slab_remove_workfn(struct work_struct *work) * A cache is never shut down before deactivation is * complete, so no need to worry about synchronization. */ - return; + goto out; #ifdef CONFIG_MEMCG kset_unregister(s->memcg_kset); #endif kobject_uevent(&s->kobj, KOBJ_REMOVE); kobject_del(&s->kobj); +out: kobject_put(&s->kobj); } -- cgit From 5b53a6ea886700a128b697a6fe8375340dea2c30 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 18 Aug 2017 15:16:12 -0700 Subject: mm: fix double mmap_sem unlock on MMF_UNSTABLE enforced SIGBUS Tetsuo Handa has noticed that MMF_UNSTABLE SIGBUS path in handle_mm_fault causes a lockdep splat Out of memory: Kill process 1056 (a.out) score 603 or sacrifice child Killed process 1056 (a.out) total-vm:4268108kB, anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB a.out (1169) used greatest stack depth: 11664 bytes left DEBUG_LOCKS_WARN_ON(depth <= 0) ------------[ cut here ]------------ WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 lock_release+0x172/0x1e0 CPU: 6 PID: 1339 Comm: a.out Not tainted 4.13.0-rc3-next-20170803+ #142 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 RIP: 0010:lock_release+0x172/0x1e0 Call Trace: up_read+0x1a/0x40 __do_page_fault+0x28e/0x4c0 do_page_fault+0x30/0x80 page_fault+0x28/0x30 The reason is that the page fault path might have dropped the mmap_sem and returned with VM_FAULT_RETRY. MMF_UNSTABLE check however rewrites the error path to VM_FAULT_SIGBUS and we always expect mmap_sem taken in that path. Fix this by taking mmap_sem when VM_FAULT_RETRY is held in the MMF_UNSTABLE path. We cannot simply add VM_FAULT_SIGBUS to the existing error code because all arch specific page fault handlers and g-u-p would have to learn a new error code combination. Link: http://lkml.kernel.org/r/20170807113839.16695-2-mhocko@kernel.org Fixes: 3f70dc38cec2 ("mm: make sure that kthreads will not refault oom reaped memory") Reported-by: Tetsuo Handa Signed-off-by: Michal Hocko Acked-by: David Rientjes Cc: Andrea Argangeli Cc: "Kirill A. Shutemov" Cc: Oleg Nesterov Cc: Wenwei Tao Cc: [4.9+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/memory.c b/mm/memory.c index e158f7ac6730..c717b5bcc80e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3910,8 +3910,18 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address, * further. */ if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR) - && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) + && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) { + + /* + * We are going to enforce SIGBUS but the PF path might have + * dropped the mmap_sem already so take it again so that + * we do not break expectations of all arch specific PF paths + * and g-u-p + */ + if (ret & VM_FAULT_RETRY) + down_read(&vma->vm_mm->mmap_sem); ret = VM_FAULT_SIGBUS; + } return ret; } -- cgit From 6b31d5955cb29a51c5baffee382f213d75e98fb8 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 18 Aug 2017 15:16:15 -0700 Subject: mm, oom: fix potential data corruption when oom_reaper races with writer Wenwei Tao has noticed that our current assumption that the oom victim is dying and never doing any visible changes after it dies, and so the oom_reaper can tear it down, is not entirely true. __task_will_free_mem consider a task dying when SIGNAL_GROUP_EXIT is set but do_group_exit sends SIGKILL to all threads _after_ the flag is set. So there is a race window when some threads won't have fatal_signal_pending while the oom_reaper could start unmapping the address space. Moreover some paths might not check for fatal signals before each PF/g-u-p/copy_from_user. We already have a protection for oom_reaper vs. PF races by checking MMF_UNSTABLE. This has been, however, checked only for kernel threads (use_mm users) which can outlive the oom victim. A simple fix would be to extend the current check in handle_mm_fault for all tasks but that wouldn't be sufficient because the current check assumes that a kernel thread would bail out after EFAULT from get_user*/copy_from_user and never re-read the same address which would succeed because the PF path has established page tables already. This seems to be the case for the only existing use_mm user currently (virtio driver) but it is rather fragile in general. This is even more fragile in general for more complex paths such as generic_perform_write which can re-read the same address more times (e.g. iov_iter_copy_from_user_atomic to fail and then iov_iter_fault_in_readable on retry). Therefore we have to implement MMF_UNSTABLE protection in a robust way and never make a potentially corrupted content visible. That requires to hook deeper into the PF path and check for the flag _every time_ before a pte for anonymous memory is established (that means all !VM_SHARED mappings). The corruption can be triggered artificially (http://lkml.kernel.org/r/201708040646.v746kkhC024636@www262.sakura.ne.jp) but there doesn't seem to be any real life bug report. The race window should be quite tight to trigger most of the time. Link: http://lkml.kernel.org/r/20170807113839.16695-3-mhocko@kernel.org Fixes: aac453635549 ("mm, oom: introduce oom reaper") Signed-off-by: Michal Hocko Reported-by: Wenwei Tao Tested-by: Tetsuo Handa Cc: "Kirill A. Shutemov" Cc: Andrea Argangeli Cc: David Rientjes Cc: Oleg Nesterov Cc: Tetsuo Handa Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/huge_memory.c | 30 ++++++++++++++++++++++-------- mm/memory.c | 46 ++++++++++++++++++++-------------------------- 2 files changed, 42 insertions(+), 34 deletions(-) (limited to 'mm') diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 216114f6ef0b..90731e3b7e58 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -32,6 +32,7 @@ #include #include #include +#include #include #include @@ -550,6 +551,7 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page, struct mem_cgroup *memcg; pgtable_t pgtable; unsigned long haddr = vmf->address & HPAGE_PMD_MASK; + int ret = 0; VM_BUG_ON_PAGE(!PageCompound(page), page); @@ -561,9 +563,8 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page, pgtable = pte_alloc_one(vma->vm_mm, haddr); if (unlikely(!pgtable)) { - mem_cgroup_cancel_charge(page, memcg, true); - put_page(page); - return VM_FAULT_OOM; + ret = VM_FAULT_OOM; + goto release; } clear_huge_page(page, haddr, HPAGE_PMD_NR); @@ -576,13 +577,14 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page, vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); if (unlikely(!pmd_none(*vmf->pmd))) { - spin_unlock(vmf->ptl); - mem_cgroup_cancel_charge(page, memcg, true); - put_page(page); - pte_free(vma->vm_mm, pgtable); + goto unlock_release; } else { pmd_t entry; + ret = check_stable_address_space(vma->vm_mm); + if (ret) + goto unlock_release; + /* Deliver the page fault to userland */ if (userfaultfd_missing(vma)) { int ret; @@ -610,6 +612,15 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page, } return 0; +unlock_release: + spin_unlock(vmf->ptl); +release: + if (pgtable) + pte_free(vma->vm_mm, pgtable); + mem_cgroup_cancel_charge(page, memcg, true); + put_page(page); + return ret; + } /* @@ -688,7 +699,10 @@ int do_huge_pmd_anonymous_page(struct vm_fault *vmf) ret = 0; set = false; if (pmd_none(*vmf->pmd)) { - if (userfaultfd_missing(vma)) { + ret = check_stable_address_space(vma->vm_mm); + if (ret) { + spin_unlock(vmf->ptl); + } else if (userfaultfd_missing(vma)) { spin_unlock(vmf->ptl); ret = handle_userfault(vmf, VM_UFFD_MISSING); VM_BUG_ON(ret & VM_FAULT_FALLBACK); diff --git a/mm/memory.c b/mm/memory.c index c717b5bcc80e..fe2fba27ded2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -68,6 +68,7 @@ #include #include #include +#include #include #include @@ -2893,6 +2894,7 @@ static int do_anonymous_page(struct vm_fault *vmf) struct vm_area_struct *vma = vmf->vma; struct mem_cgroup *memcg; struct page *page; + int ret = 0; pte_t entry; /* File mapping without ->vm_ops ? */ @@ -2925,6 +2927,9 @@ static int do_anonymous_page(struct vm_fault *vmf) vmf->address, &vmf->ptl); if (!pte_none(*vmf->pte)) goto unlock; + ret = check_stable_address_space(vma->vm_mm); + if (ret) + goto unlock; /* Deliver the page fault to userland, check inside PT lock */ if (userfaultfd_missing(vma)) { pte_unmap_unlock(vmf->pte, vmf->ptl); @@ -2959,6 +2964,10 @@ static int do_anonymous_page(struct vm_fault *vmf) if (!pte_none(*vmf->pte)) goto release; + ret = check_stable_address_space(vma->vm_mm); + if (ret) + goto release; + /* Deliver the page fault to userland, check inside PT lock */ if (userfaultfd_missing(vma)) { pte_unmap_unlock(vmf->pte, vmf->ptl); @@ -2978,7 +2987,7 @@ setpte: update_mmu_cache(vma, vmf->address, vmf->pte); unlock: pte_unmap_unlock(vmf->pte, vmf->ptl); - return 0; + return ret; release: mem_cgroup_cancel_charge(page, memcg, false); put_page(page); @@ -3252,7 +3261,7 @@ int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg, int finish_fault(struct vm_fault *vmf) { struct page *page; - int ret; + int ret = 0; /* Did we COW the page? */ if ((vmf->flags & FAULT_FLAG_WRITE) && @@ -3260,7 +3269,15 @@ int finish_fault(struct vm_fault *vmf) page = vmf->cow_page; else page = vmf->page; - ret = alloc_set_pte(vmf, vmf->memcg, page); + + /* + * check even for read faults because we might have lost our CoWed + * page + */ + if (!(vmf->vma->vm_flags & VM_SHARED)) + ret = check_stable_address_space(vmf->vma->vm_mm); + if (!ret) + ret = alloc_set_pte(vmf, vmf->memcg, page); if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl); return ret; @@ -3900,29 +3917,6 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address, mem_cgroup_oom_synchronize(false); } - /* - * This mm has been already reaped by the oom reaper and so the - * refault cannot be trusted in general. Anonymous refaults would - * lose data and give a zero page instead e.g. This is especially - * problem for use_mm() because regular tasks will just die and - * the corrupted data will not be visible anywhere while kthread - * will outlive the oom victim and potentially propagate the date - * further. - */ - if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR) - && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) { - - /* - * We are going to enforce SIGBUS but the PF path might have - * dropped the mmap_sem already so take it again so that - * we do not break expectations of all arch specific PF paths - * and g-u-p - */ - if (ret & VM_FAULT_RETRY) - down_read(&vma->vm_mm->mmap_sem); - ret = VM_FAULT_SIGBUS; - } - return ret; } EXPORT_SYMBOL_GPL(handle_mm_fault); -- cgit From da094e42848e3c36feaa3b5271e53983fd45424f Mon Sep 17 00:00:00 2001 From: Prakash Gupta Date: Fri, 18 Aug 2017 15:16:21 -0700 Subject: mm/cma_debug.c: fix stack corruption due to sprintf usage name[] in cma_debugfs_add_one() can only accommodate 16 chars including NULL to store sprintf output. It's common for cma device name to be larger than 15 chars. This can cause stack corrpution. If the gcc stack protector is turned on, this can cause a panic due to stack corruption. Below is one example trace: Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffff8e69a75730 Call trace: dump_backtrace+0x0/0x2c4 show_stack+0x20/0x28 dump_stack+0xb8/0xf4 panic+0x154/0x2b0 print_tainted+0x0/0xc0 cma_debugfs_init+0x274/0x290 do_one_initcall+0x5c/0x168 kernel_init_freeable+0x1c8/0x280 Fix the short sprintf buffer in cma_debugfs_add_one() by using scnprintf() instead of sprintf(). Link: http://lkml.kernel.org/r/1502446217-21840-1-git-send-email-guptap@codeaurora.org Fixes: f318dd083c81 ("cma: Store a name in the cma structure") Signed-off-by: Prakash Gupta Acked-by: Laura Abbott Cc: Greg Kroah-Hartman Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/cma_debug.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/cma_debug.c b/mm/cma_debug.c index 595b757bef72..c03ccbc405a0 100644 --- a/mm/cma_debug.c +++ b/mm/cma_debug.c @@ -167,7 +167,7 @@ static void cma_debugfs_add_one(struct cma *cma, int idx) char name[16]; int u32s; - sprintf(name, "cma-%s", cma->name); + scnprintf(name, sizeof(name), "cma-%s", cma->name); tmp = debugfs_create_dir(name, cma_debugfs_root); -- cgit From 73223e4e2e3867ebf033a5a8eb2e5df0158ccc99 Mon Sep 17 00:00:00 2001 From: zhong jiang Date: Fri, 18 Aug 2017 15:16:24 -0700 Subject: mm/mempolicy: fix use after free when calling get_mempolicy I hit a use after free issue when executing trinity and repoduced it with KASAN enabled. The related call trace is as follows. BUG: KASan: use after free in SyS_get_mempolicy+0x3c8/0x960 at addr ffff8801f582d766 Read of size 2 by task syz-executor1/798 INFO: Allocated in mpol_new.part.2+0x74/0x160 age=3 cpu=1 pid=799 __slab_alloc+0x768/0x970 kmem_cache_alloc+0x2e7/0x450 mpol_new.part.2+0x74/0x160 mpol_new+0x66/0x80 SyS_mbind+0x267/0x9f0 system_call_fastpath+0x16/0x1b INFO: Freed in __mpol_put+0x2b/0x40 age=4 cpu=1 pid=799 __slab_free+0x495/0x8e0 kmem_cache_free+0x2f3/0x4c0 __mpol_put+0x2b/0x40 SyS_mbind+0x383/0x9f0 system_call_fastpath+0x16/0x1b INFO: Slab 0xffffea0009cb8dc0 objects=23 used=8 fp=0xffff8801f582de40 flags=0x200000000004080 INFO: Object 0xffff8801f582d760 @offset=5984 fp=0xffff8801f582d600 Bytes b4 ffff8801f582d750: ae 01 ff ff 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ Object ffff8801f582d760: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk Object ffff8801f582d770: 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkk. Redzone ffff8801f582d778: bb bb bb bb bb bb bb bb ........ Padding ffff8801f582d8b8: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ Memory state around the buggy address: ffff8801f582d600: fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc fc ffff8801f582d680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff8801f582d700: fc fc fc fc fc fc fc fc fc fc fc fc fb fb fb fc !shared memory policy is not protected against parallel removal by other thread which is normally protected by the mmap_sem. do_get_mempolicy, however, drops the lock midway while we can still access it later. Early premature up_read is a historical artifact from times when put_user was called in this path see https://lwn.net/Articles/124754/ but that is gone since 8bccd85ffbaf ("[PATCH] Implement sys_* do_* layering in the memory policy layer."). but when we have the the current mempolicy ref count model. The issue was introduced accordingly. Fix the issue by removing the premature release. Link: http://lkml.kernel.org/r/1502950924-27521-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhong jiang Acked-by: Michal Hocko Cc: Minchan Kim Cc: Vlastimil Babka Cc: David Rientjes Cc: Mel Gorman Cc: [2.6+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mempolicy.c | 5 ----- 1 file changed, 5 deletions(-) (limited to 'mm') diff --git a/mm/mempolicy.c b/mm/mempolicy.c index d911fa5cb2a7..618ab125228b 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -861,11 +861,6 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, *policy |= (pol->flags & MPOL_MODE_FLAGS); } - if (vma) { - up_read(¤t->mm->mmap_sem); - vma = NULL; - } - err = 0; if (nmask) { if (mpol_store_user_nodemask(pol)) { -- cgit From 704b862f9efd6d4c87a8d0a344dda19bda9c6b69 Mon Sep 17 00:00:00 2001 From: Laura Abbott Date: Fri, 18 Aug 2017 15:16:27 -0700 Subject: mm/vmalloc.c: don't unconditonally use __GFP_HIGHMEM Commit 19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly") added use of __GFP_HIGHMEM for allocations. vmalloc_32 may use GFP_DMA/GFP_DMA32 which does not play nice with __GFP_HIGHMEM and will trigger a BUG in gfp_zone. Only add __GFP_HIGHMEM if we aren't using GFP_DMA/GFP_DMA32. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1482249 Link: http://lkml.kernel.org/r/20170816220705.31374-1-labbott@redhat.com Fixes: 19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly") Signed-off-by: Laura Abbott Acked-by: Michal Hocko Cc: Vlastimil Babka Cc: "Kirill A. Shutemov" Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmalloc.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) (limited to 'mm') diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 8698c1c86c4d..a47e3894c775 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1671,7 +1671,10 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, struct page **pages; unsigned int nr_pages, array_size, i; const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO; - const gfp_t alloc_mask = gfp_mask | __GFP_HIGHMEM | __GFP_NOWARN; + const gfp_t alloc_mask = gfp_mask | __GFP_NOWARN; + const gfp_t highmem_mask = (gfp_mask & (GFP_DMA | GFP_DMA32)) ? + 0 : + __GFP_HIGHMEM; nr_pages = get_vm_area_size(area) >> PAGE_SHIFT; array_size = (nr_pages * sizeof(struct page *)); @@ -1679,7 +1682,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, area->nr_pages = nr_pages; /* Please note that the recursion is strictly bounded. */ if (array_size > PAGE_SIZE) { - pages = __vmalloc_node(array_size, 1, nested_gfp|__GFP_HIGHMEM, + pages = __vmalloc_node(array_size, 1, nested_gfp|highmem_mask, PAGE_KERNEL, node, area->caller); } else { pages = kmalloc_node(array_size, nested_gfp, node); @@ -1700,9 +1703,9 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, } if (node == NUMA_NO_NODE) - page = alloc_page(alloc_mask); + page = alloc_page(alloc_mask|highmem_mask); else - page = alloc_pages_node(node, alloc_mask, 0); + page = alloc_pages_node(node, alloc_mask|highmem_mask, 0); if (unlikely(!page)) { /* Successfully allocated i pages, free them in __vunmap() */ @@ -1710,7 +1713,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, goto fail; } area->pages[i] = page; - if (gfpflags_allow_blocking(gfp_mask)) + if (gfpflags_allow_blocking(gfp_mask|highmem_mask)) cond_resched(); } -- cgit From 197e7e521384a23b9e585178f3f11c9fa08274b9 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Sun, 20 Aug 2017 13:26:27 -0700 Subject: Sanitize 'move_pages()' permission checks The 'move_paghes()' system call was introduced long long ago with the same permission checks as for sending a signal (except using CAP_SYS_NICE instead of CAP_SYS_KILL for the overriding capability). That turns out to not be a great choice - while the system call really only moves physical page allocations around (and you need other capabilities to do a lot of it), you can check the return value to map out some the virtual address choices and defeat ASLR of a binary that still shares your uid. So change the access checks to the more common 'ptrace_may_access()' model instead. This tightens the access checks for the uid, and also effectively changes the CAP_SYS_NICE check to CAP_SYS_PTRACE, but it's unlikely that anybody really _uses_ this legacy system call any more (we hav ebetter NUMA placement models these days), so I expect nobody to notice. Famous last words. Reported-by: Otto Ebeling Acked-by: Eric W. Biederman Cc: Willy Tarreau Cc: stable@kernel.org Signed-off-by: Linus Torvalds --- mm/migrate.c | 11 +++-------- 1 file changed, 3 insertions(+), 8 deletions(-) (limited to 'mm') diff --git a/mm/migrate.c b/mm/migrate.c index d68a41da6abb..e84eeb4e4356 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -41,6 +41,7 @@ #include #include #include +#include #include @@ -1652,7 +1653,6 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long, nr_pages, const int __user *, nodes, int __user *, status, int, flags) { - const struct cred *cred = current_cred(), *tcred; struct task_struct *task; struct mm_struct *mm; int err; @@ -1676,14 +1676,9 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long, nr_pages, /* * Check if this process has the right to modify the specified - * process. The right exists if the process has administrative - * capabilities, superuser privileges or the same - * userid as the target process. + * process. Use the regular "ptrace_may_access()" checks. */ - tcred = __task_cred(task); - if (!uid_eq(cred->euid, tcred->suid) && !uid_eq(cred->euid, tcred->uid) && - !uid_eq(cred->uid, tcred->suid) && !uid_eq(cred->uid, tcred->uid) && - !capable(CAP_SYS_NICE)) { + if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) { rcu_read_unlock(); err = -EPERM; goto out; -- cgit From 556b969a1cfe2686aae149137fa1dfcac0eefe54 Mon Sep 17 00:00:00 2001 From: Chen Yu Date: Fri, 25 Aug 2017 15:55:30 -0700 Subject: PM/hibernate: touch NMI watchdog when creating snapshot There is a problem that when counting the pages for creating the hibernation snapshot will take significant amount of time, especially on system with large memory. Since the counting job is performed with irq disabled, this might lead to NMI lockup. The following warning were found on a system with 1.5TB DRAM: Freezing user space processes ... (elapsed 0.002 seconds) done. OOM killer disabled. PM: Preallocating image memory... NMI watchdog: Watchdog detected hard LOCKUP on cpu 27 CPU: 27 PID: 3128 Comm: systemd-sleep Not tainted 4.13.0-0.rc2.git0.1.fc27.x86_64 #1 task: ffff9f01971ac000 task.stack: ffffb1a3f325c000 RIP: 0010:memory_bm_find_bit+0xf4/0x100 Call Trace: swsusp_set_page_free+0x2b/0x30 mark_free_pages+0x147/0x1c0 count_data_pages+0x41/0xa0 hibernate_preallocate_memory+0x80/0x450 hibernation_snapshot+0x58/0x410 hibernate+0x17c/0x310 state_store+0xdf/0xf0 kobj_attr_store+0xf/0x20 sysfs_kf_write+0x37/0x40 kernfs_fop_write+0x11c/0x1a0 __vfs_write+0x37/0x170 vfs_write+0xb1/0x1a0 SyS_write+0x55/0xc0 entry_SYSCALL_64_fastpath+0x1a/0xa5 ... done (allocated 6590003 pages) PM: Allocated 26360012 kbytes in 19.89 seconds (1325.28 MB/s) It has taken nearly 20 seconds(2.10GHz CPU) thus the NMI lockup was triggered. In case the timeout of the NMI watch dog has been set to 1 second, a safe interval should be 6590003/20 = 320k pages in theory. However there might also be some platforms running at a lower frequency, so feed the watchdog every 100k pages. [yu.c.chen@intel.com: simplification] Link: http://lkml.kernel.org/r/1503460079-29721-1-git-send-email-yu.c.chen@intel.com [yu.c.chen@intel.com: use interval of 128k instead of 100k to avoid modulus] Link: http://lkml.kernel.org/r/1503328098-5120-1-git-send-email-yu.c.chen@intel.com Signed-off-by: Chen Yu Reported-by: Jan Filipcewicz Suggested-by: Michal Hocko Reviewed-by: Michal Hocko Acked-by: Rafael J. Wysocki Cc: Mel Gorman Cc: Vlastimil Babka Cc: Len Brown Cc: Dan Williams Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1bad301820c7..7a58eb5757e3 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -66,6 +66,7 @@ #include #include #include +#include #include #include @@ -2535,9 +2536,14 @@ void drain_all_pages(struct zone *zone) #ifdef CONFIG_HIBERNATION +/* + * Touch the watchdog for every WD_PAGE_COUNT pages. + */ +#define WD_PAGE_COUNT (128*1024) + void mark_free_pages(struct zone *zone) { - unsigned long pfn, max_zone_pfn; + unsigned long pfn, max_zone_pfn, page_count = WD_PAGE_COUNT; unsigned long flags; unsigned int order, t; struct page *page; @@ -2552,6 +2558,11 @@ void mark_free_pages(struct zone *zone) if (pfn_valid(pfn)) { page = pfn_to_page(pfn); + if (!--page_count) { + touch_nmi_watchdog(); + page_count = WD_PAGE_COUNT; + } + if (page_zone(page) != zone) continue; @@ -2565,8 +2576,13 @@ void mark_free_pages(struct zone *zone) unsigned long i; pfn = page_to_pfn(page); - for (i = 0; i < (1UL << order); i++) + for (i = 0; i < (1UL << order); i++) { + if (!--page_count) { + touch_nmi_watchdog(); + page_count = WD_PAGE_COUNT; + } swsusp_set_page_free(pfn_to_page(pfn + i)); + } } } spin_unlock_irqrestore(&zone->lock, flags); -- cgit From 435c0b87d661da83771c30ed775f7c37eed193fb Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Fri, 25 Aug 2017 15:55:33 -0700 Subject: mm, shmem: fix handling /sys/kernel/mm/transparent_hugepage/shmem_enabled /sys/kernel/mm/transparent_hugepage/shmem_enabled controls if we want to allocate huge pages when allocate pages for private in-kernel shmem mount. Unfortunately, as Dan noticed, I've screwed it up and the only way to make kernel allocate huge page for the mount is to use "force" there. All other values will be effectively ignored. Link: http://lkml.kernel.org/r/20170822144254.66431-1-kirill.shutemov@linux.intel.com Fixes: 5a6e75f8110c ("shmem: prepare huge= mount option and sysfs knob") Signed-off-by: Kirill A. Shutemov Reported-by: Dan Carpenter Cc: stable [4.8+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/shmem.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/shmem.c b/mm/shmem.c index 6540e5982444..fbcb3c96a186 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -3967,7 +3967,7 @@ int __init shmem_init(void) } #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE - if (has_transparent_hugepage() && shmem_huge < SHMEM_HUGE_DENY) + if (has_transparent_hugepage() && shmem_huge > SHMEM_HUGE_DENY) SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge; else shmem_huge = 0; /* just in case it was patched */ @@ -4028,7 +4028,7 @@ static ssize_t shmem_enabled_store(struct kobject *kobj, return -EINVAL; shmem_huge = huge; - if (shmem_huge < SHMEM_HUGE_DENY) + if (shmem_huge > SHMEM_HUGE_DENY) SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge; return count; } -- cgit From 263630e8d176d87308481ebdcd78ef9426739c6b Mon Sep 17 00:00:00 2001 From: Eric Biggers Date: Fri, 25 Aug 2017 15:55:39 -0700 Subject: mm/madvise.c: fix freeing of locked page with MADV_FREE If madvise(..., MADV_FREE) split a transparent hugepage, it called put_page() before unlock_page(). This was wrong because put_page() can free the page, e.g. if a concurrent madvise(..., MADV_DONTNEED) has removed it from the memory mapping. put_page() then rightfully complained about freeing a locked page. Fix this by moving the unlock_page() before put_page(). This bug was found by syzkaller, which encountered the following splat: BUG: Bad page state in process syzkaller412798 pfn:1bd800 page:ffffea0006f60000 count:0 mapcount:0 mapping: (null) index:0x20a00 flags: 0x200000000040019(locked|uptodate|dirty|swapbacked) raw: 0200000000040019 0000000000000000 0000000000020a00 00000000ffffffff raw: ffffea0006f60020 ffffea0006f60020 0000000000000000 0000000000000000 page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set bad because of flags: 0x1(locked) Modules linked in: CPU: 1 PID: 3037 Comm: syzkaller412798 Not tainted 4.13.0-rc5+ #35 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:52 bad_page+0x230/0x2b0 mm/page_alloc.c:565 free_pages_check_bad+0x1f0/0x2e0 mm/page_alloc.c:943 free_pages_check mm/page_alloc.c:952 [inline] free_pages_prepare mm/page_alloc.c:1043 [inline] free_pcp_prepare mm/page_alloc.c:1068 [inline] free_hot_cold_page+0x8cf/0x12b0 mm/page_alloc.c:2584 __put_single_page mm/swap.c:79 [inline] __put_page+0xfb/0x160 mm/swap.c:113 put_page include/linux/mm.h:814 [inline] madvise_free_pte_range+0x137a/0x1ec0 mm/madvise.c:371 walk_pmd_range mm/pagewalk.c:50 [inline] walk_pud_range mm/pagewalk.c:108 [inline] walk_p4d_range mm/pagewalk.c:134 [inline] walk_pgd_range mm/pagewalk.c:160 [inline] __walk_page_range+0xc3a/0x1450 mm/pagewalk.c:249 walk_page_range+0x200/0x470 mm/pagewalk.c:326 madvise_free_page_range.isra.9+0x17d/0x230 mm/madvise.c:444 madvise_free_single_vma+0x353/0x580 mm/madvise.c:471 madvise_dontneed_free mm/madvise.c:555 [inline] madvise_vma mm/madvise.c:664 [inline] SYSC_madvise mm/madvise.c:832 [inline] SyS_madvise+0x7d3/0x13c0 mm/madvise.c:760 entry_SYSCALL_64_fastpath+0x1f/0xbe Here is a C reproducer: #define _GNU_SOURCE #include #include #include #define MADV_FREE 8 #define PAGE_SIZE 4096 static void *mapping; static const size_t mapping_size = 0x1000000; static void *madvise_thrproc(void *arg) { madvise(mapping, mapping_size, (long)arg); } int main(void) { pthread_t t[2]; for (;;) { mapping = mmap(NULL, mapping_size, PROT_WRITE, MAP_POPULATE|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); munmap(mapping + mapping_size / 2, PAGE_SIZE); pthread_create(&t[0], 0, madvise_thrproc, (void*)MADV_DONTNEED); pthread_create(&t[1], 0, madvise_thrproc, (void*)MADV_FREE); pthread_join(t[0], NULL); pthread_join(t[1], NULL); munmap(mapping, mapping_size); } } Note: to see the splat, CONFIG_TRANSPARENT_HUGEPAGE=y and CONFIG_DEBUG_VM=y are needed. Google Bug Id: 64696096 Link: http://lkml.kernel.org/r/20170823205235.132061-1-ebiggers3@gmail.com Fixes: 854e9ed09ded ("mm: support madvise(MADV_FREE)") Signed-off-by: Eric Biggers Acked-by: David Rientjes Acked-by: Minchan Kim Acked-by: Michal Hocko Cc: Dmitry Vyukov Cc: Hugh Dickins Cc: Andrea Arcangeli Cc: [v4.5+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/madvise.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/madvise.c b/mm/madvise.c index 47d8d8a25eae..23ed525bc2bc 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -368,8 +368,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, pte_offset_map_lock(mm, pmd, addr, &ptl); goto out; } - put_page(page); unlock_page(page); + put_page(page); pte = pte_offset_map_lock(mm, pmd, addr, &ptl); pte--; addr -= PAGE_SIZE; -- cgit From 91b540f98872a206ea1c49e4aa6ea8eed0886644 Mon Sep 17 00:00:00 2001 From: Pavel Tatashin Date: Fri, 25 Aug 2017 15:55:46 -0700 Subject: mm/memblock.c: reversed logic in memblock_discard() In recently introduced memblock_discard() there is a reversed logic bug. Memory is freed of static array instead of dynamically allocated one. Link: http://lkml.kernel.org/r/1503511441-95478-2-git-send-email-pasha.tatashin@oracle.com Fixes: 3010f876500f ("mm: discard memblock data later") Signed-off-by: Pavel Tatashin Reported-by: Woody Suwalski Tested-by: Woody Suwalski Acked-by: Michal Hocko Cc: Vlastimil Babka Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memblock.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/memblock.c b/mm/memblock.c index bf14aea6ab70..91205780e6b1 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -299,7 +299,7 @@ void __init memblock_discard(void) __memblock_free_late(addr, size); } - if (memblock.memory.regions == memblock_memory_init_regions) { + if (memblock.memory.regions != memblock_memory_init_regions) { addr = __pa(memblock.memory.regions); size = PAGE_ALIGN(sizeof(struct memblock_region) * memblock.memory.max); -- cgit From 3510ca20ece0150af6b10c77a74ff1b5c198e3e2 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Sun, 27 Aug 2017 13:55:12 -0700 Subject: Minor page waitqueue cleanups Tim Chen and Kan Liang have been battling a customer load that shows extremely long page wakeup lists. The cause seems to be constant NUMA migration of a hot page that is shared across a lot of threads, but the actual root cause for the exact behavior has not been found. Tim has a patch that batches the wait list traversal at wakeup time, so that we at least don't get long uninterruptible cases where we traverse and wake up thousands of processes and get nasty latency spikes. That is likely 4.14 material, but we're still discussing the page waitqueue specific parts of it. In the meantime, I've tried to look at making the page wait queues less expensive, and failing miserably. If you have thousands of threads waiting for the same page, it will be painful. We'll need to try to figure out the NUMA balancing issue some day, in addition to avoiding the excessive spinlock hold times. That said, having tried to rewrite the page wait queues, I can at least fix up some of the braindamage in the current situation. In particular: (a) we don't want to continue walking the page wait list if the bit we're waiting for already got set again (which seems to be one of the patterns of the bad load). That makes no progress and just causes pointless cache pollution chasing the pointers. (b) we don't want to put the non-locking waiters always on the front of the queue, and the locking waiters always on the back. Not only is that unfair, it means that we wake up thousands of reading threads that will just end up being blocked by the writer later anyway. Also add a comment about the layout of 'struct wait_page_key' - there is an external user of it in the cachefiles code that means that it has to match the layout of 'struct wait_bit_key' in the two first members. It so happens to match, because 'struct page *' and 'unsigned long *' end up having the same values simply because the page flags are the first member in struct page. Cc: Tim Chen Cc: Kan Liang Cc: Mel Gorman Cc: Christopher Lameter Cc: Andi Kleen Cc: Davidlohr Bueso Cc: Peter Zijlstra Signed-off-by: Linus Torvalds --- mm/filemap.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) (limited to 'mm') diff --git a/mm/filemap.c b/mm/filemap.c index a49702445ce0..baba290c276b 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -885,6 +885,7 @@ void __init pagecache_init(void) page_writeback_init(); } +/* This has the same layout as wait_bit_key - see fs/cachefiles/rdwr.c */ struct wait_page_key { struct page *page; int bit_nr; @@ -909,8 +910,10 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, if (wait_page->bit_nr != key->bit_nr) return 0; + + /* Stop walking if it's locked */ if (test_bit(key->bit_nr, &key->page->flags)) - return 0; + return -1; return autoremove_wake_function(wait, mode, sync, key); } @@ -964,6 +967,7 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q, int ret = 0; init_wait(wait); + wait->flags = lock ? WQ_FLAG_EXCLUSIVE : 0; wait->func = wake_page_function; wait_page.page = page; wait_page.bit_nr = bit_nr; @@ -972,10 +976,7 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q, spin_lock_irq(&q->lock); if (likely(list_empty(&wait->entry))) { - if (lock) - __add_wait_queue_entry_tail_exclusive(q, wait); - else - __add_wait_queue(q, wait); + __add_wait_queue_entry_tail(q, wait); SetPageWaiters(page); } -- cgit From a8b169afbf06a678437632709caac98e16f99263 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Sun, 27 Aug 2017 16:25:09 -0700 Subject: Avoid page waitqueue race leaving possible page locker waiting The "lock_page_killable()" function waits for exclusive access to the page lock bit using the WQ_FLAG_EXCLUSIVE bit in the waitqueue entry set. That means that if it gets woken up, other waiters may have been skipped. That, in turn, means that if it sees the page being unlocked, it *must* take that lock and return success, even if a lethal signal is also pending. So instead of checking for lethal signals first, we need to check for them after we've checked the actual bit that we were waiting for. Even if that might then delay the killing of the process. This matches the order of the old "wait_on_bit_lock()" infrastructure that the page locking used to use (and is still used in a few other areas). Note that if we still return an error after having unsuccessfully tried to acquire the page lock, that is ok: that means that some other thread was able to get ahead of us and lock the page, and when that other thread then unlocks the page, the wakeup event will be repeated. So any other pending waiters will now get properly woken up. Fixes: 62906027091f ("mm: add PageWaiters indicating tasks are waiting for a page bit") Cc: Nick Piggin Cc: Peter Zijlstra Cc: Mel Gorman Cc: Jan Kara Cc: Davidlohr Bueso Cc: Andi Kleen Signed-off-by: Linus Torvalds --- mm/filemap.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) (limited to 'mm') diff --git a/mm/filemap.c b/mm/filemap.c index baba290c276b..0b41c8cbeabc 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -986,10 +986,6 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q, if (likely(test_bit(bit_nr, &page->flags))) { io_schedule(); - if (unlikely(signal_pending_state(state, current))) { - ret = -EINTR; - break; - } } if (lock) { @@ -999,6 +995,11 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q, if (!test_bit(bit_nr, &page->flags)) break; } + + if (unlikely(signal_pending_state(state, current))) { + ret = -EINTR; + break; + } } finish_wait(q, wait); -- cgit From 9c3a815f471a84811cf8021cf64aae3b8081dfde Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Mon, 28 Aug 2017 16:45:40 -0700 Subject: page waitqueue: always add new entries at the end Commit 3510ca20ece0 ("Minor page waitqueue cleanups") made the page queue code always add new waiters to the back of the queue, which helps upcoming patches to batch the wakeups for some horrid loads where the wait queues grow to thousands of entries. However, I forgot about the nasrt add_page_wait_queue() special case code that is only used by the cachefiles code. That one still continued to add the new wait queue entries at the beginning of the list. Fix it, because any sane batched wakeup will require that we don't suddenly start getting new entries at the beginning of the list that we already handled in a previous batch. [ The current code always does the whole list while holding the lock, so wait queue ordering doesn't matter for correctness, but even then it's better to add later entries at the end from a fairness standpoint ] Signed-off-by: Linus Torvalds --- mm/filemap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/filemap.c b/mm/filemap.c index 0b41c8cbeabc..65b4b6e7f7bd 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1041,7 +1041,7 @@ void add_page_wait_queue(struct page *page, wait_queue_entry_t *waiter) unsigned long flags; spin_lock_irqsave(&q->lock, flags); - __add_wait_queue(q, waiter); + __add_wait_queue_entry_tail(q, waiter); SetPageWaiters(page); spin_unlock_irqrestore(&q->lock, flags); } -- cgit From 785373b4c38719f4af6775845df6be1dfaea120f Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Tue, 29 Aug 2017 09:11:06 -0700 Subject: Revert "rmap: do not call mmu_notifier_invalidate_page() under ptl" MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This reverts commit aac2fea94f7a3df8ad1eeb477eb2643f81fd5393. It turns out that that patch was complete and utter garbage, and broke KVM, resulting in odd oopses. Quoting Andrea Arcangeli: "The aforementioned commit has 3 bugs. 1) mmu_notifier_invalidate_range cannot be used in replacement of mmu_notifier_invalidate_range_start/end. For KVM mmu_notifier_invalidate_range is a noop and rightfully so. A MMU notifier implementation has to implement either ->invalidate_range method or the invalidate_range_start/end methods, not both. And if you implement invalidate_range_start/end like KVM is forced to do, calling mmu_notifier_invalidate_range in common code is a noop for KVM. For those MMU notifiers that can get away only implementing ->invalidate_range, the ->invalidate_range is implicitly called by mmu_notifier_invalidate_range_end(). And only those secondary MMUs that share the same pagetable with the primary MMU (like AMD iommuv2) can get away only implementing ->invalidate_range. So all cases (THP on/off) are broken right now. To fix this is enough to replace mmu_notifier_invalidate_range with mmu_notifier_invalidate_range_start;mmu_notifier_invalidate_range_end. Either that or call multiple mmu_notifier_invalidate_page like before. 2) address + (1UL << compound_order(page) is buggy, it should be PAGE_SIZE << compound_order(page), it's bytes not pages, 2M not 512. 3) The whole invalidate_range thing was an attempt to call a single invalidate while walking multiple 4k ptes that maps the same THP (after a pmd virtual split without physical compound page THP split). It's unclear if the rmap_walk will always provide an address that is 2M aligned as parameter to try_to_unmap_one, in presence of THP. I think it needs also an address &= (PAGE_SIZE << compound_order(page)) - 1 to be safe" In general, we should stop making excuses for horrible MMU notifier users. It's much more important that the core VM is sane and safe, than letting MMU notifiers sleep. So if some MMU notifier is sleeping under a spinlock, we need to fix the notifier, not try to make excuses for that garbage in the core VM. Reported-and-tested-by: Bernhard Held Reported-and-tested-by: Adam Borowski Cc: Andrea Arcangeli Cc: Radim Krčmář Cc: Wanpeng Li Cc: Paolo Bonzini Cc: Takashi Iwai Cc: Nadav Amit Cc: Mike Galbraith Cc: Kirill A. Shutemov Cc: Jérôme Glisse Cc: axie Cc: Andrew Morton Signed-off-by: Linus Torvalds --- mm/rmap.c | 52 ++++++++++++++++++++++------------------------------ 1 file changed, 22 insertions(+), 30 deletions(-) (limited to 'mm') diff --git a/mm/rmap.c b/mm/rmap.c index c1286d47aa1f..c8993c63eb25 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -888,10 +888,10 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, .flags = PVMW_SYNC, }; int *cleaned = arg; - bool invalidation_needed = false; while (page_vma_mapped_walk(&pvmw)) { int ret = 0; + address = pvmw.address; if (pvmw.pte) { pte_t entry; pte_t *pte = pvmw.pte; @@ -899,11 +899,11 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, if (!pte_dirty(*pte) && !pte_write(*pte)) continue; - flush_cache_page(vma, pvmw.address, pte_pfn(*pte)); - entry = ptep_clear_flush(vma, pvmw.address, pte); + flush_cache_page(vma, address, pte_pfn(*pte)); + entry = ptep_clear_flush(vma, address, pte); entry = pte_wrprotect(entry); entry = pte_mkclean(entry); - set_pte_at(vma->vm_mm, pvmw.address, pte, entry); + set_pte_at(vma->vm_mm, address, pte, entry); ret = 1; } else { #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE @@ -913,11 +913,11 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, if (!pmd_dirty(*pmd) && !pmd_write(*pmd)) continue; - flush_cache_page(vma, pvmw.address, page_to_pfn(page)); - entry = pmdp_huge_clear_flush(vma, pvmw.address, pmd); + flush_cache_page(vma, address, page_to_pfn(page)); + entry = pmdp_huge_clear_flush(vma, address, pmd); entry = pmd_wrprotect(entry); entry = pmd_mkclean(entry); - set_pmd_at(vma->vm_mm, pvmw.address, pmd, entry); + set_pmd_at(vma->vm_mm, address, pmd, entry); ret = 1; #else /* unexpected pmd-mapped page? */ @@ -926,16 +926,11 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, } if (ret) { + mmu_notifier_invalidate_page(vma->vm_mm, address); (*cleaned)++; - invalidation_needed = true; } } - if (invalidation_needed) { - mmu_notifier_invalidate_range(vma->vm_mm, address, - address + (1UL << compound_order(page))); - } - return true; } @@ -1328,7 +1323,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, }; pte_t pteval; struct page *subpage; - bool ret = true, invalidation_needed = false; + bool ret = true; enum ttu_flags flags = (enum ttu_flags)arg; /* munlock has nothing to gain from examining un-locked vmas */ @@ -1368,9 +1363,11 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, VM_BUG_ON_PAGE(!pvmw.pte, page); subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte); + address = pvmw.address; + if (!(flags & TTU_IGNORE_ACCESS)) { - if (ptep_clear_flush_young_notify(vma, pvmw.address, + if (ptep_clear_flush_young_notify(vma, address, pvmw.pte)) { ret = false; page_vma_mapped_walk_done(&pvmw); @@ -1379,7 +1376,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, } /* Nuke the page table entry. */ - flush_cache_page(vma, pvmw.address, pte_pfn(*pvmw.pte)); + flush_cache_page(vma, address, pte_pfn(*pvmw.pte)); if (should_defer_flush(mm, flags)) { /* * We clear the PTE but do not flush so potentially @@ -1389,12 +1386,11 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, * transition on a cached TLB entry is written through * and traps if the PTE is unmapped. */ - pteval = ptep_get_and_clear(mm, pvmw.address, - pvmw.pte); + pteval = ptep_get_and_clear(mm, address, pvmw.pte); set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); } else { - pteval = ptep_clear_flush(vma, pvmw.address, pvmw.pte); + pteval = ptep_clear_flush(vma, address, pvmw.pte); } /* Move the dirty bit to the page. Now the pte is gone. */ @@ -1409,12 +1405,12 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, if (PageHuge(page)) { int nr = 1 << compound_order(page); hugetlb_count_sub(nr, mm); - set_huge_swap_pte_at(mm, pvmw.address, + set_huge_swap_pte_at(mm, address, pvmw.pte, pteval, vma_mmu_pagesize(vma)); } else { dec_mm_counter(mm, mm_counter(page)); - set_pte_at(mm, pvmw.address, pvmw.pte, pteval); + set_pte_at(mm, address, pvmw.pte, pteval); } } else if (pte_unused(pteval)) { @@ -1438,7 +1434,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, swp_pte = swp_entry_to_pte(entry); if (pte_soft_dirty(pteval)) swp_pte = pte_swp_mksoft_dirty(swp_pte); - set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); + set_pte_at(mm, address, pvmw.pte, swp_pte); } else if (PageAnon(page)) { swp_entry_t entry = { .val = page_private(subpage) }; pte_t swp_pte; @@ -1464,7 +1460,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, * If the page was redirtied, it cannot be * discarded. Remap the page to page table. */ - set_pte_at(mm, pvmw.address, pvmw.pte, pteval); + set_pte_at(mm, address, pvmw.pte, pteval); SetPageSwapBacked(page); ret = false; page_vma_mapped_walk_done(&pvmw); @@ -1472,7 +1468,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, } if (swap_duplicate(entry) < 0) { - set_pte_at(mm, pvmw.address, pvmw.pte, pteval); + set_pte_at(mm, address, pvmw.pte, pteval); ret = false; page_vma_mapped_walk_done(&pvmw); break; @@ -1488,18 +1484,14 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, swp_pte = swp_entry_to_pte(entry); if (pte_soft_dirty(pteval)) swp_pte = pte_swp_mksoft_dirty(swp_pte); - set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); + set_pte_at(mm, address, pvmw.pte, swp_pte); } else dec_mm_counter(mm, mm_counter_file(page)); discard: page_remove_rmap(subpage, PageHuge(page)); put_page(page); - invalidation_needed = true; + mmu_notifier_invalidate_page(mm, address); } - - if (invalidation_needed) - mmu_notifier_invalidate_range(mm, address, - address + (1UL << compound_order(page))); return ret; } -- cgit From a4d1a885251382250ec315482bdd8ca52dd61e6a Mon Sep 17 00:00:00 2001 From: Jérôme Glisse Date: Thu, 31 Aug 2017 17:17:26 -0400 Subject: dax: update to new mmu_notifier semantic MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Replace all mmu_notifier_invalidate_page() calls by *_invalidate_range() and make sure it is bracketed by calls to *_invalidate_range_start()/end(). Note that because we can not presume the pmd value or pte value we have to assume the worst and unconditionaly report an invalidation as happening. Signed-off-by: Jérôme Glisse Cc: Dan Williams Cc: Ross Zwisler Cc: Bernhard Held Cc: Adam Borowski Cc: Andrea Arcangeli Cc: Radim Krčmář Cc: Wanpeng Li Cc: Paolo Bonzini Cc: Takashi Iwai Cc: Nadav Amit Cc: Mike Galbraith Cc: Kirill A. Shutemov Cc: axie Cc: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory.c | 26 +++++++++++++++++++++----- 1 file changed, 21 insertions(+), 5 deletions(-) (limited to 'mm') diff --git a/mm/memory.c b/mm/memory.c index fe2fba27ded2..56e48e4593cb 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4008,7 +4008,8 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address) #endif /* __PAGETABLE_PMD_FOLDED */ static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address, - pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp) + unsigned long *start, unsigned long *end, + pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp) { pgd_t *pgd; p4d_t *p4d; @@ -4035,17 +4036,29 @@ static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address, if (!pmdpp) goto out; + if (start && end) { + *start = address & PMD_MASK; + *end = *start + PMD_SIZE; + mmu_notifier_invalidate_range_start(mm, *start, *end); + } *ptlp = pmd_lock(mm, pmd); if (pmd_huge(*pmd)) { *pmdpp = pmd; return 0; } spin_unlock(*ptlp); + if (start && end) + mmu_notifier_invalidate_range_end(mm, *start, *end); } if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) goto out; + if (start && end) { + *start = address & PAGE_MASK; + *end = *start + PAGE_SIZE; + mmu_notifier_invalidate_range_start(mm, *start, *end); + } ptep = pte_offset_map_lock(mm, pmd, address, ptlp); if (!pte_present(*ptep)) goto unlock; @@ -4053,6 +4066,8 @@ static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address, return 0; unlock: pte_unmap_unlock(ptep, *ptlp); + if (start && end) + mmu_notifier_invalidate_range_end(mm, *start, *end); out: return -EINVAL; } @@ -4064,20 +4079,21 @@ static inline int follow_pte(struct mm_struct *mm, unsigned long address, /* (void) is needed to make gcc happy */ (void) __cond_lock(*ptlp, - !(res = __follow_pte_pmd(mm, address, ptepp, NULL, - ptlp))); + !(res = __follow_pte_pmd(mm, address, NULL, NULL, + ptepp, NULL, ptlp))); return res; } int follow_pte_pmd(struct mm_struct *mm, unsigned long address, + unsigned long *start, unsigned long *end, pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp) { int res; /* (void) is needed to make gcc happy */ (void) __cond_lock(*ptlp, - !(res = __follow_pte_pmd(mm, address, ptepp, pmdpp, - ptlp))); + !(res = __follow_pte_pmd(mm, address, start, end, + ptepp, pmdpp, ptlp))); return res; } EXPORT_SYMBOL(follow_pte_pmd); -- cgit From 369ea8242c0fb5239b4ddf0dc568f694bd244de4 Mon Sep 17 00:00:00 2001 From: Jérôme Glisse Date: Thu, 31 Aug 2017 17:17:27 -0400 Subject: mm/rmap: update to new mmu_notifier semantic v2 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Replace all mmu_notifier_invalidate_page() calls by *_invalidate_range() and make sure it is bracketed by calls to *_invalidate_range_start()/end(). Note that because we can not presume the pmd value or pte value we have to assume the worst and unconditionaly report an invalidation as happening. Changed since v2: - try_to_unmap_one() only one call to mmu_notifier_invalidate_range() - compute end with PAGE_SIZE << compound_order(page) - fix PageHuge() case in try_to_unmap_one() Signed-off-by: Jérôme Glisse Reviewed-by: Andrea Arcangeli Cc: Dan Williams Cc: Ross Zwisler Cc: Bernhard Held Cc: Adam Borowski Cc: Radim Krčmář Cc: Wanpeng Li Cc: Paolo Bonzini Cc: Takashi Iwai Cc: Nadav Amit Cc: Mike Galbraith Cc: Kirill A. Shutemov Cc: axie Cc: Andrew Morton Signed-off-by: Linus Torvalds --- mm/rmap.c | 35 ++++++++++++++++++++++++++++++++--- 1 file changed, 32 insertions(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/rmap.c b/mm/rmap.c index c8993c63eb25..c570f82e6827 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -887,11 +887,21 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, .address = address, .flags = PVMW_SYNC, }; + unsigned long start = address, end; int *cleaned = arg; + /* + * We have to assume the worse case ie pmd for invalidation. Note that + * the page can not be free from this function. + */ + end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page))); + mmu_notifier_invalidate_range_start(vma->vm_mm, start, end); + while (page_vma_mapped_walk(&pvmw)) { + unsigned long cstart, cend; int ret = 0; - address = pvmw.address; + + cstart = address = pvmw.address; if (pvmw.pte) { pte_t entry; pte_t *pte = pvmw.pte; @@ -904,6 +914,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, entry = pte_wrprotect(entry); entry = pte_mkclean(entry); set_pte_at(vma->vm_mm, address, pte, entry); + cend = cstart + PAGE_SIZE; ret = 1; } else { #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE @@ -918,6 +929,8 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, entry = pmd_wrprotect(entry); entry = pmd_mkclean(entry); set_pmd_at(vma->vm_mm, address, pmd, entry); + cstart &= PMD_MASK; + cend = cstart + PMD_SIZE; ret = 1; #else /* unexpected pmd-mapped page? */ @@ -926,11 +939,13 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, } if (ret) { - mmu_notifier_invalidate_page(vma->vm_mm, address); + mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend); (*cleaned)++; } } + mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); + return true; } @@ -1324,6 +1339,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, pte_t pteval; struct page *subpage; bool ret = true; + unsigned long start = address, end; enum ttu_flags flags = (enum ttu_flags)arg; /* munlock has nothing to gain from examining un-locked vmas */ @@ -1335,6 +1351,14 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, flags & TTU_MIGRATION, page); } + /* + * We have to assume the worse case ie pmd for invalidation. Note that + * the page can not be free in this function as call of try_to_unmap() + * must hold a reference on the page. + */ + end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page))); + mmu_notifier_invalidate_range_start(vma->vm_mm, start, end); + while (page_vma_mapped_walk(&pvmw)) { /* * If the page is mlock()d, we cannot swap it out. @@ -1445,6 +1469,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, if (unlikely(PageSwapBacked(page) != PageSwapCache(page))) { WARN_ON_ONCE(1); ret = false; + /* We have to invalidate as we cleared the pte */ page_vma_mapped_walk_done(&pvmw); break; } @@ -1490,8 +1515,12 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, discard: page_remove_rmap(subpage, PageHuge(page)); put_page(page); - mmu_notifier_invalidate_page(mm, address); + mmu_notifier_invalidate_range(mm, address, + address + PAGE_SIZE); } + + mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); + return ret; } -- cgit From 5f32b265400de723ab0db23101a75ac073bdd980 Mon Sep 17 00:00:00 2001 From: Jérôme Glisse Date: Thu, 31 Aug 2017 17:17:38 -0400 Subject: mm/mmu_notifier: kill invalidate_page MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The invalidate_page callback suffered from two pitfalls. First it used to happen after the page table lock was release and thus a new page might have setup before the call to invalidate_page() happened. This is in a weird way fixed by commit c7ab0d2fdc84 ("mm: convert try_to_unmap_one() to use page_vma_mapped_walk()") that moved the callback under the page table lock but this also broke several existing users of the mmu_notifier API that assumed they could sleep inside this callback. The second pitfall was invalidate_page() being the only callback not taking a range of address in respect to invalidation but was giving an address and a page. Lots of the callback implementers assumed this could never be THP and thus failed to invalidate the appropriate range for THP. By killing this callback we unify the mmu_notifier callback API to always take a virtual address range as input. Finally this also simplifies the end user life as there is now two clear choices: - invalidate_range_start()/end() callback (which allow you to sleep) - invalidate_range() where you can not sleep but happen right after page table update under page table lock Signed-off-by: Jérôme Glisse Cc: Bernhard Held Cc: Adam Borowski Cc: Andrea Arcangeli Cc: Radim Krčmář Cc: Wanpeng Li Cc: Paolo Bonzini Cc: Takashi Iwai Cc: Nadav Amit Cc: Mike Galbraith Cc: Kirill A. Shutemov Cc: axie Cc: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mmu_notifier.c | 14 -------------- 1 file changed, 14 deletions(-) (limited to 'mm') diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 54ca54562928..314285284e6e 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -174,20 +174,6 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address, srcu_read_unlock(&srcu, id); } -void __mmu_notifier_invalidate_page(struct mm_struct *mm, - unsigned long address) -{ - struct mmu_notifier *mn; - int id; - - id = srcu_read_lock(&srcu); - hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) { - if (mn->ops->invalidate_page) - mn->ops->invalidate_page(mn, mm, address); - } - srcu_read_unlock(&srcu, id); -} - void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, unsigned long start, unsigned long end) { -- cgit From e746bf730a76fe53b82c9e6b6da72d58e9ae3565 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Thu, 31 Aug 2017 16:15:20 -0700 Subject: mm,page_alloc: don't call __node_reclaim() with oom_lock held. We are doing a last second memory allocation attempt before calling out_of_memory(). But since slab shrinker functions might indirectly wait for other thread's __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory allocations via sleeping locks, calling slab shrinker functions from node_reclaim() from get_page_from_freelist() with oom_lock held has possibility of deadlock. Therefore, make sure that last second memory allocation attempt does not call slab shrinker functions. Link: http://lkml.kernel.org/r/1503577106-9196-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa Acked-by: Michal Hocko Cc: Mel Gorman Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7a58eb5757e3..1423da8dd16f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3291,10 +3291,13 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, /* * Go through the zonelist yet one more time, keep very high watermark * here, this is only to catch a parallel oom killing, we must fail if - * we're still under heavy pressure. + * we're still under heavy pressure. But make sure that this reclaim + * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY + * allocation which will never fail due to oom_lock already held. */ - page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order, - ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac); + page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) & + ~__GFP_DIRECT_RECLAIM, order, + ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac); if (page) goto out; -- cgit From c461ad6a63b37ba74632e90c063d14823c884247 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 31 Aug 2017 16:15:30 -0700 Subject: mm, madvise: ensure poisoned pages are removed from per-cpu lists Wendy Wang reported off-list that a RAS HWPOISON-SOFT test case failed and bisected it to the commit 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP"). The problem is that a page that was poisoned with madvise() is reused. The commit removed a check that would trigger if DEBUG_VM was enabled but re-enabling the check only fixes the problem as a side-effect by printing a bad_page warning and recovering. The root of the problem is that an madvise() can leave a poisoned page on the per-cpu list. This patch drains all per-cpu lists after pages are poisoned so that they will not be reused. Wendy reports that the test case in question passes with this patch applied. While this could be done in a targeted fashion, it is over-complicated for such a rare operation. Link: http://lkml.kernel.org/r/20170828133414.7qro57jbepdcyz5x@techsingularity.net Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP") Signed-off-by: Mel Gorman Reported-by: Wang, Wendy Tested-by: Wang, Wendy Acked-by: David Rientjes Acked-by: Vlastimil Babka Cc: "Hansen, Dave" Cc: "Luck, Tony" Cc: Naoya Horiguchi Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/madvise.c | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'mm') diff --git a/mm/madvise.c b/mm/madvise.c index 23ed525bc2bc..4d7d1e5ddba9 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -613,6 +613,7 @@ static int madvise_inject_error(int behavior, unsigned long start, unsigned long end) { struct page *page; + struct zone *zone; if (!capable(CAP_SYS_ADMIN)) return -EPERM; @@ -646,6 +647,11 @@ static int madvise_inject_error(int behavior, if (ret) return ret; } + + /* Ensure that all poisoned pages are removed from per-cpu lists */ + for_each_populated_zone(zone) + drain_all_pages(zone); + return 0; } #endif -- cgit