summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2021-09-03mm/page_alloc: make alloc_node_mem_map() __init rather than __refMike Rapoport
alloc_node_mem_map() is never only called from free_area_init_node() that is an __init function. Make the actual alloc_node_mem_map() also __init and its stub version static inline. Link: https://lkml.kernel.org/r/20210716064124.31865-1-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/page_alloc.c: fix 'zone_id' may be used uninitialized in this function ↵Nico Pache
warning When compiling with -Werror, cc1 will warn that 'zone_id' may be used uninitialized in this function warning. Initialize the zone_id as 0. Its safe to assume that if the code reaches this point it has at least one numa node with memory, so no need for an assertion before init_unavilable_range. Link: https://lkml.kernel.org/r/20210716210336.1114114-1-npache@redhat.com Fixes: 122e093c1734 ("mm/page_alloc: fix memory map initialization for descending nodes") Signed-off-by: Nico Pache <npache@redhat.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03memblock: stop poisoning raw allocationsMike Rapoport
Functions memblock_alloc_exact_nid_raw() and memblock_alloc_try_nid_raw() are intended for early memory allocation without overhead of zeroing the allocated memory. Since these functions were used to allocate the memory map, they have ended up with addition of a call to page_init_poison() that poisoned the allocated memory when CONFIG_PAGE_POISON was set. Since the memory map is allocated using a dedicated memmep_alloc() function that takes care of the poisoning, remove page poisoning from the memblock_alloc_*_raw() functions. Link: https://lkml.kernel.org/r/20210714123739.16493-5-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Michal Simek <monstr@monstr.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm: introduce memmap_alloc() to unify memory map allocationMike Rapoport
There are several places that allocate memory for the memory map: alloc_node_mem_map() for FLATMEM, sparse_buffer_init() and __populate_section_memmap() for SPARSEMEM. The memory allocated in the FLATMEM case is zeroed and it is never poisoned, regardless of CONFIG_PAGE_POISON setting. The memory allocated in the SPARSEMEM cases is not zeroed and it is implicitly poisoned inside memblock if CONFIG_PAGE_POISON is set. Introduce memmap_alloc() wrapper for memblock allocators that will be used for both FLATMEM and SPARSEMEM cases and will makei memory map zeroing and poisoning consistent for different memory models. Link: https://lkml.kernel.org/r/20210714123739.16493-4-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Michal Simek <monstr@monstr.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/page_alloc: always initialize memory map for the holesMike Rapoport
Patch series "mm: ensure consistency of memory map poisoning". Currently memory map allocation for FLATMEM case does not poison the struct pages regardless of CONFIG_PAGE_POISON setting. This happens because allocation of the memory map for FLATMEM and SPARSMEM use different memblock functions and those that are used for SPARSMEM case (namely memblock_alloc_try_nid_raw() and memblock_alloc_exact_nid_raw()) implicitly poison the allocated memory. Another side effect of this implicit poisoning is that early setup code that uses the same functions to allocate memory burns cycles for the memory poisoning even if it was not intended. These patches introduce memmap_alloc() wrapper that ensure that the memory map allocation is consistent for different memory models. This patch (of 4): Currently memory map for the holes is initialized only when SPARSEMEM memory model is used. Yet, even with FLATMEM there could be holes in the physical memory layout that have memory map entries. For instance, the memory reserved using e820 API on i386 or "reserved-memory" nodes in device tree would not appear in memblock.memory and hence the struct pages for such holes will be skipped during memory map initialization. These struct pages will be zeroed because the memory map for FLATMEM systems is allocated with memblock_alloc_node() that clears the allocated memory. While zeroed struct pages do not cause immediate problems, the correct behaviour is to initialize every page using __init_single_page(). Besides, enabling page poison for FLATMEM case will trigger PF_POISONED_CHECK() unless the memory map is properly initialized. Make sure init_unavailable_range() is called for both SPARSEMEM and FLATMEM so that struct pages representing memory holes would appear as PG_Reserved with any memory layout. [rppt@kernel.org: fix microblaze] Link: https://lkml.kernel.org/r/YQWW3RCE4eWBuMu/@kernel.org Link: https://lkml.kernel.org/r/20210714123739.16493-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20210714123739.16493-2-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Cc: Michal Simek <monstr@monstr.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/kasan: move kasan.fault to mm/kasan/report.cWoody Lin
Move the boot parameter 'kasan.fault' from hw_tags.c to report.c, so it can support all KASAN modes - generic, and both tag-based. Link: https://lkml.kernel.org/r/20210713010536.3161822-1-woodylin@google.com Signed-off-by: Woody Lin <woodylin@google.com> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/vmalloc: fix wrong behavior in vreadChen Wandun
commit f608788cd2d6 ("mm/vmalloc: use rb_tree instead of list for vread() lookups") use rb_tree instread of list to speed up lookup, but function __find_vmap_area is try to find a vmap_area that include target address, if target address is smaller than the leftmost node in vmap_area_root, it will return NULL, then vread will read nothing. This behavior is different from the primitive semantics. The correct way is find the first vmap_are that bigger than target addr, that is what function find_vmap_area_exceed_addr does. Link: https://lkml.kernel.org/r/20210714015959.3204871-1-chenwandun@huawei.com Fixes: f608788cd2d6 ("mm/vmalloc: use rb_tree instead of list for vread() lookups") Signed-off-by: Chen Wandun <chenwandun@huawei.com> Reported-by: Hulk Robot <hulkci@huawei.com> Cc: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/vmalloc: remove gfpflags_allow_blocking() checkUladzislau Rezki (Sony)
Get rid of gfpflags_allow_blocking() check from the vmalloc() path as it is supposed to be sleepable anyway. Thus remove it from the alloc_vmap_area() as well as from the vm_area_alloc_pages(). Link: https://lkml.kernel.org/r/20210707182639.31282-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/vmalloc: use batched page requests in bulk-allocatorUladzislau Rezki (Sony)
In case of simultaneous vmalloc allocations, for example it is 1GB and 12 CPUs my system is able to hit "BUG: soft lockup" for !CONFIG_PREEMPT kernel. RIP: 0010:__alloc_pages_bulk+0xa9f/0xbb0 Call Trace: __vmalloc_node_range+0x11c/0x2d0 __vmalloc_node+0x4b/0x70 fix_size_alloc_test+0x44/0x60 [test_vmalloc] test_func+0xe7/0x1f0 [test_vmalloc] kthread+0x11a/0x140 ret_from_fork+0x22/0x30 To address this issue invoke a bulk-allocator many times until all pages are obtained, i.e. do batched page requests adding cond_resched() meanwhile to reschedule. Batched value is hard-coded and is 100 pages per call. Link: https://lkml.kernel.org/r/20210707182639.31282-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/sparse: clarify pgdat_to_physMiles Chen
Clarify pgdat_to_phys() by testing if pgdat == &contig_page_data when CONFIG_NUMA=n. We only expect contig_page_data in such case, so we use &contig_page_data directly instead of pgdat. No functional change intended when CONFIG_BUG_VM=n. Comment from Mark [1]: " ... and I reckon it'd be clearer and more robust to define pgdat_to_phys() in the same ifdefs as contig_page_data so that these, stay in-sync. e.g. have: | #ifdef CONFIG_NUMA | #define pgdat_to_phys(x) virt_to_phys(x) | #else /* CONFIG_NUMA */ | | extern struct pglist_data contig_page_data; | ... | #define pgdat_to_phys(x) __pa_symbol(&contig_page_data) | | #endif /* CONIFIG_NUMA */ " [1] https://lore.kernel.org/linux-arm-kernel/20210615131902.GB47121@C02TD0UTHF1T.local/ Link: https://lkml.kernel.org/r/20210723123342.26406-1-miles.chen@mediatek.com Signed-off-by: Miles Chen <miles.chen@mediatek.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03include/linux/mmzone.h: avoid a warning in sparse memory supportMatthew Wilcox
cppcheck warns that we're possibly losing information by shifting an int. It's a false positive, because we don't allow for a NUMA node ID that large, but if we ever change SECTION_NID_SHIFT, it could become a problem, and in any case this is usually a legitimate warning. Fix it by adding the necessary cast, which makes the compiler generate the right code. Link: https://lkml.kernel.org/r/YOya+aBZFFmC476e@casper.infradead.org Link: https://lkml.kernel.org/r/202107130348.6LsVT9Nc-lkp@intel.com Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm: sparse: remove __section_nr() functionOhhoon Kwon
As the last users of __section_nr() are gone, let's remove unused function __section_nr(). Link: https://lkml.kernel.org/r/20210707150212.855-4-ohoono.kwon@samsung.com Signed-off-by: Ohhoon Kwon <ohoono.kwon@samsung.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm: sparse: pass section_nr to section_mark_presentOhhoon Kwon
Patch series "mm: sparse: remove __section_nr() function", v4. This patch (of 3): With CONFIG_SPARSEMEM_EXTREME enabled, __section_nr() which converts mem_section to section_nr could be costly since it iterates all section roots to check if the given mem_section is in its range. Since both callers of section_mark_present already know section_nr, let's also pass section_nr as well as mem_section in order to reduce costly translation. Link: https://lkml.kernel.org/r/20210707150212.855-1-ohoono.kwon@samsung.com Link: https://lkml.kernel.org/r/20210707150212.855-2-ohoono.kwon@samsung.com Signed-off-by: Ohhoon Kwon <ohoono.kwon@samsung.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/bootmem_info.c: mark __init on register_page_bootmem_info_sectionMuchun Song
register_page_bootmem_info_section() is only called from __init functions, so mark it __init as well. Link: https://lkml.kernel.org/r/20210817042221.77172-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/mremap: fix memory account on do_munmap() failureChen Wandun
mremap will account the delta between new_len and old_len in vma_to_resize, and then call move_vma when expanding an existing memory mapping. In function move_vma, there are two scenarios when calling do_munmap: 1. move_page_tables from old_addr to new_addr success 2. move_page_tables from old_addr to new_addr fail In first scenario, it should account old_len if do_munmap fail, because the delta has already been accounted. In second scenario, new_addr/new_len will assign to old_addr/old_len if move_page_table fail, so do_munmap is try to unmap new_addr actually, if do_munmap fail, it should account the new_len, because error code will be return from move_vma, and delta will be unaccounted. What'more, because of new_len == old_len, so account old_len also is OK. In summary, account old_len will be correct if do_munmap fail. Link: https://lkml.kernel.org/r/20210717101942.120607-1-chenwandun@huawei.com Fixes: 51df7bcb6151 ("mm/mremap: account memory on do_munmap() failure") Signed-off-by: Chen Wandun <chenwandun@huawei.com> Acked-by: Dmitry Safonov <dima@arista.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03remap_file_pages: Use vma_lookup() instead of find_vma()Liam R. Howlett
Using vma_lookup() verifies the start address is contained in the found vma. This results in easier to read code. Link: https://lkml.kernel.org/r/20210817135234.1550204-1-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/pagemap: add mmap_assert_locked() annotations to find_vma*()Luigi Rizzo
find_vma() and variants need protection when used. This patch adds mmap_assert_lock() calls in the functions. To make sure the invariant is satisfied, we also need to add a mmap_read_lock() around the get_user_pages_remote() call in get_arg_page(). The lock is not strictly necessary because the mm has been newly created, but the extra cost is limited because the same mutex was also acquired shortly before in __bprm_mm_init(), so it is hot and uncontended. [penguin-kernel@i-love.sakura.ne.jp: TOMOYO needs the same protection which get_arg_page() needs] Link: https://lkml.kernel.org/r/58bb6bf7-a57e-8a40-e74b-39584b415152@i-love.sakura.ne.jp Link: https://lkml.kernel.org/r/20210731175341.3458608-1-lrizzo@google.com Signed-off-by: Luigi Rizzo <lrizzo@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm,do_huge_pmd_numa_page: remove unnecessary TLB flushing codeHuang Ying
Before commit c5b5a3dd2c1f ("mm: thp: refactor NUMA fault handling"), the TLB flushing is done in do_huge_pmd_numa_page() itself via flush_tlb_range(). But after commit c5b5a3dd2c1f ("mm: thp: refactor NUMA fault handling"), the TLB flushing is done in migrate_pages() as in the following code path anyway. do_huge_pmd_numa_page migrate_misplaced_page migrate_pages So now, the TLB flushing code in do_huge_pmd_numa_page() becomes unnecessary. So the code is deleted in this patch to simplify the code. This is only code cleanup, there's no visible performance difference. The mmu_notifier_invalidate_range() in do_huge_pmd_numa_page() is deleted too. Because migrate_pages() takes care of that too when CPU TLB is flushed. Link: https://lkml.kernel.org/r/20210720065529.716031-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03memcg: make memcg->event_list_lock irqsafeShakeel Butt
The memcg->event_list_lock is usually taken in the normal context but when the userspace closes the corresponding eventfd, eventfd_release through memcg_event_wake takes memcg->event_list_lock with interrupts disabled. This is not an issue on its own but it creates a nested dependency from eventfd_ctx->wqh.lock to memcg->event_list_lock. Independently, for unrelated eventfd, eventfd_signal() can be called in the irq context, thus making eventfd_ctx->wqh.lock an irq lock. For example, FPGA DFL driver, VHOST VPDA driver and couple of VFIO drivers. This will force memcg->event_list_lock to be an irqsafe lock as well. One way to break the nested dependency between eventfd_ctx->wqh.lock and memcg->event_list_lock is to add an indirection. However the simplest solution would be to make memcg->event_list_lock irqsafe. This is cgroup v1 feature, is in maintenance and may get deprecated in near future. So, no need to add more code. BTW this has been discussed previously [1] but there weren't irq users of eventfd_signal() at the time. [1] https://www.spinics.net/lists/cgroups/msg06248.html Link: https://lkml.kernel.org/r/20210830172953.207257-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03memcg: fix up drain_local_stock commentMichal Hocko
Thomas and Vlastimil have noticed that the comment in drain_local_stock doesn't quite make sense. It talks about a synchronization with the memory hotplug but there is no actual memory hotplug involvement here. I meant to talk about cpu hotplug here. Fix that up and hopefuly make the comment more helpful by referencing the cpu hotplug callback as well. Link: https://lkml.kernel.org/r/YRDwOhVglJmY7ES5@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm, memcg: save some atomic ops when flush is already trueMiaohe Lin
Add 'else' to save some atomic ops in obj_stock_flush_required() when flush is already true. No functional change intended here. Link: https://lkml.kernel.org/r/20210807082835.61281-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Alex Shi <alexs@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm: memcontrol: set the correct memcg swappiness restrictionBaolin Wang
Since commit c843966c556d ("mm: allow swappiness that prefers reclaiming anon over the file workingset") has expended the swappiness value to make swap to be preferred in some systems. We should also change the memcg swappiness restriction to allow memcg swap-preferred. Link: https://lkml.kernel.org/r/d77469b90c45c49953ccbc51e54a1d465bc18f70.1627626255.git.baolin.wang@linux.alibaba.com Fixes: c843966c556d ("mm: allow swappiness that prefers reclaiming anon over the file workingset") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03memcg: replace in_interrupt() by !in_task() in active_memcg()Vasily Averin
set_active_memcg() uses in_interrupt() check to select proper storage for cgroup: pointer on task struct or per-cpu pointer. It isn't fully correct: obsoleted in_interrupt() includes tasks with disabled BH. It's better to use '!in_task()' instead. Link: https://lkml.org/lkml/2021/7/26/487 Link: https://lkml.kernel.org/r/ed4448b0-4970-616f-7368-ef9dd3cb628d@virtuozzo.com Fixes: 37d5985c003d ("mm: kmem: prepare remote memcg charging infra for interrupt contexts") Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03memcg: infrastructure to flush memcg statsShakeel Butt
At the moment memcg stats are read in four contexts: 1. memcg stat user interfaces 2. dirty throttling 3. page fault 4. memory reclaim Currently the kernel flushes the stats for first two cases. Flushing the stats for remaining two casese may have performance impact. Always flushing the memcg stats on the page fault code path may negatively impacts the performance of the applications. In addition flushing in the memory reclaim code path, though treated as slowpath, can become the source of contention for the global lock taken for stat flushing because when system or memcg is under memory pressure, many tasks may enter the reclaim path. This patch uses following mechanisms to solve these challenges: 1. Periodically flush the stats from root memcg every 2 seconds. This will time limit the out of sync stats. 2. Asynchronously flush the stats after fixed number of stat updates. In the worst case the stat can be out of sync by O(nr_cpus * BATCH) for 2 seconds. 3. For avoiding thundering herd to flush the stats particularly from the memory reclaim context, introduce memcg local spinlock and let only one flusher active at a time. This could have been done through cgroup_rstat_lock lock but that lock is used by other subsystem and for userspace reading memcg stats. So, it is better to keep flushers introduced by this patch decoupled from cgroup_rstat_lock. However we would have to use irqsafe version of rstat flush but that is fine as this code path will be flushing for whole tree and do the work for everyone. No one will be waiting for that worker. [shakeelb@google.com: fix sleep-in-wrong context bug] Link: https://lkml.kernel.org/r/20210716212137.1391164-2-shakeelb@google.com Link: https://lkml.kernel.org/r/20210714013948.270662-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03memcg: switch lruvec stats to rstatShakeel Butt
The commit 2d146aa3aa84 ("mm: memcontrol: switch to rstat") switched memcg stats to rstat infrastructure but skipped the conversion of the lruvec stats as such stats are read in the performance critical code paths and flushing stats may have impacted the performances of the applications. This patch converts the lruvec stats to rstat and later patches add mechanisms to keep the performance impact to minimum. The rstat conversion comes with the price i.e. memory cost. Effectively this patch reverts the savings done by the commit f3344adf38bd ("mm: memcontrol: optimize per-lruvec stats counter memory usage"). However this cost is justified due to negative impact of the inaccurate lruvec stats on many heuristics. One such case is reported in [1]. The memory reclaim code is filled with plethora of heuristics and many of those heuristics reads the lruvec stats. So, inaccurate stats can make such heuristics ineffective. [1] reports the impact of inaccurate lruvec stats on the "cache trim mode" heuristic. Inaccurate lruvec stats can impact the deactivation and aging anon heuristics as well. [1] https://lore.kernel.org/linux-mm/20210311004449.1170308-1-ying.huang@intel.com/ Link: https://lkml.kernel.org/r/20210716212137.1391164-1-shakeelb@google.com Link: https://lkml.kernel.org/r/20210714013948.270662-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Koutný <mkoutny@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm, memcg: inline swap-related functions to improve disabled memcg configSuren Baghdasaryan
Inline mem_cgroup_try_charge_swap, mem_cgroup_uncharge_swap and cgroup_throttle_swaprate functions to perform mem_cgroup_disabled static key check inline before calling the main body of the function. This minimizes the memcg overhead in the pagefault and exit_mmap paths when memcgs are disabled using cgroup_disable=memory command-line option. This change results in ~1% overhead reduction when running PFT test [1] comparing {CONFIG_MEMCG=n} against {CONFIG_MEMCG=y, cgroup_disable=memory} configuration on an 8-core ARM64 Android device. [1] https://lkml.org/lkml/2006/8/29/294 also used in mmtests suite Link: https://lkml.kernel.org/r/20210713010934.299876-3-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Alex Shi <alexs@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm, memcg: inline mem_cgroup_{charge/uncharge} to improve disabled memcg configSuren Baghdasaryan
Inline mem_cgroup_{charge/uncharge} and mem_cgroup_uncharge_list functions functions to perform mem_cgroup_disabled static key check inline before calling the main body of the function. This minimizes the memcg overhead in the pagefault and exit_mmap paths when memcgs are disabled using cgroup_disable=memory command-line option. This change results in ~0.4% overhead reduction when running PFT test [1] comparing {CONFIG_MEMCG=n} against {CONFIG_MEMCG=y, cgroup_disable=memory} configuration on an 8-core ARM64 Android device. [1] https://lkml.org/lkml/2006/8/29/294 also used in mmtests suite Link: https://lkml.kernel.org/r/20210713010934.299876-2-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Alex Shi <alexs@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm, memcg: add mem_cgroup_disabled checks in vmpressure and swap-related ↵Suren Baghdasaryan
functions Add mem_cgroup_disabled check in vmpressure, mem_cgroup_uncharge_swap and cgroup_throttle_swaprate functions. This minimizes the memcg overhead in the pagefault and exit_mmap paths when memcgs are disabled using cgroup_disable=memory command-line option. This change results in ~2.1% overhead reduction when running PFT test [1] comparing {CONFIG_MEMCG=n, CONFIG_MEMCG_SWAP=n} against {CONFIG_MEMCG=y, CONFIG_MEMCG_SWAP=y, cgroup_disable=memory} configuration on an 8-core ARM64 Android device. [1] https://lkml.org/lkml/2006/8/29/294 also used in mmtests suite Link: https://lkml.kernel.org/r/20210713010934.299876-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alex Shi <alexs@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03shmem: shmem_writepage() split unlikely i915 THPHugh Dickins
drivers/gpu/drm/i915/gem/i915_gem_shmem.c contains a shmem_writeback() which calls shmem_writepage() from a shrinker: that usually works well enough; but if /sys/kernel/mm/transparent_hugepage/shmem_enabled has been set to "always" (intended to be usable) or "force" (forces huge everywhere for easy testing), shmem_writepage() is surprised to be called with a huge page, and crashes on the VM_BUG_ON_PAGE(PageCompound) (I did not find out where the crash happens when CONFIG_DEBUG_VM is off). LRU page reclaim always splits the shmem huge page first: I'd prefer not to demand that of i915, so check and split compound in shmem_writepage(). Patch history: when first sent last year http://lkml.kernel.org/r/alpine.LSU.2.11.2008301401390.5954@eggly.anvils https://lore.kernel.org/linux-mm/20200919042009.bomzxmrg7%25akpm@linux-foundation.org/ Matthew Wilcox noticed that tail pages were wrongly left clean. This version brackets the split with Set and Clear PageDirty as he suggested: which works very well, even if it falls short of our aspirations. And recently I realized that the crash is not limited to the testing option "force", but affects "always" too: which is more important to fix. Link: https://lkml.kernel.org/r/bac6158c-8b3d-4dca-cffc-4982f58d9794@google.com Fixes: 2d6692e642e7 ("drm/i915: Start writeback from the shrinker") Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Yang Shi <shy828301@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03huge tmpfs: decide stat.st_blksize by shmem_is_huge()Hugh Dickins
4.18 commit 89fdcd262fd4 ("mm: shmem: make stat.st_blksize return huge page size if THP is on") added is_huge_enabled() to decide st_blksize: if hugeness is to be defined per file, that will need to be replaced by shmem_is_huge(). This does give a different answer (No) for small files on a "huge=within_size" mount: but that can be considered a minor bugfix. And a different answer (No) for default files on a "huge=advise" mount: I'm reluctant to complicate it, just to reproduce the same debatable answer as before. Link: https://lkml.kernel.org/r/af7fb3f9-4415-9e8e-fdac-b1a5253ad21@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03huge tmpfs: shmem_is_huge(vma, inode, index)Hugh Dickins
Extend shmem_huge_enabled(vma) to shmem_is_huge(vma, inode, index), so that a consistent set of checks can be applied, even when the inode is accessed through read/write syscalls (with NULL vma) instead of mmaps (the index argument is seldom of interest, but required by mount option "huge=within_size"). Clean up and rearrange the checks a little. This then replaces the checks which shmem_fault() and shmem_getpage_gfp() were making, and eliminates the SGP_HUGE and SGP_NOHUGE modes. Replace a couple of 0s by explicit SHMEM_HUGE_NEVERs; and replace the obscure !shmem_mapping() symlink check by explicit S_ISLNK() - nothing else needs that symlink check, so leave it there in shmem_getpage_gfp(). Link: https://lkml.kernel.org/r/23a77889-2ddc-b030-75cd-44ca27fd4d1@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03huge tmpfs: SGP_NOALLOC to stop collapse_file() on raceHugh Dickins
khugepaged's collapse_file() currently uses SGP_NOHUGE to tell shmem_getpage() not to try allocating a huge page, in the very unlikely event that a racing hole-punch removes the swapped or fallocated page as soon as i_pages lock is dropped. We want to consolidate shmem's huge decisions, removing SGP_HUGE and SGP_NOHUGE; but cannot quite persuade ourselves that it's okay to regress the protection in this case - Yang Shi points out that the huge page would remain indefinitely, charged to root instead of the intended memcg. collapse_file() should not even allocate a small page in this case: why proceed if someone is punching a hole? SGP_READ is almost the right flag here, except that it optimizes away from a fallocated page, with NULL to tell caller to fill with zeroes (like a hole); whereas collapse_file()'s sequence relies on using a cache page. Add SGP_NOALLOC just for this. There are too many consecutive "if (page"s there in shmem_getpage_gfp(): group it better; and fix the outdated "bring it back from swap" comment. Link: https://lkml.kernel.org/r/1355343b-acf-4653-ef79-6aee40214ac5@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03huge tmpfs: move shmem_huge_enabled() upwardsHugh Dickins
shmem_huge_enabled() is about to be enhanced into shmem_is_huge(), so that it can be used more widely throughout: before making functional changes, shift it to its final position (to avoid forward declaration). Link: https://lkml.kernel.org/r/16fec7b7-5c84-415a-8586-69d8bf6a6685@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03huge tmpfs: revert shmem's use of transhuge_vma_enabled()Hugh Dickins
5.14 commit e6be37b2e7bd ("mm/huge_memory.c: add missing read-only THP checking in transparent_hugepage_enabled()") added transhuge_vma_enabled() as a wrapper for two very different checks (one check is whether the app has marked its address range not to use THPs, the other check is whether the app is running in a hierarchy that has been marked never to use THPs). shmem_huge_enabled() prefers to show those two checks explicitly, as before. Link: https://lkml.kernel.org/r/45e5338-18d-c6f9-c17e-34f510bc1728@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03huge tmpfs: remove shrinklist addition from shmem_setattr()Hugh Dickins
There's a block of code in shmem_setattr() to add the inode to shmem_unused_huge_shrink()'s shrinklist when lowering i_size: it dates from before 5.7 changed truncation to do split_huge_page() for itself, and should have been removed at that time. I am over-stating that: split_huge_page() can fail (notably if there's an extra reference to the page at that time), so there might be value in retrying. But there were already retries as truncation worked through the tails, and this addition risks repeating unsuccessful retries indefinitely: I'd rather remove it now, and work on reducing the chance of split_huge_page() failures separately, if we need to. Link: https://lkml.kernel.org/r/b73b3492-8822-18f9-83e2-938528cdde94@google.com Fixes: 71725ed10c40 ("mm: huge tmpfs: try to split_huge_page() when punching hole") Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03huge tmpfs: fix split_huge_page() after FALLOC_FL_KEEP_SIZEHugh Dickins
A successful shmem_fallocate() guarantees that the extent has been reserved, even beyond i_size when the FALLOC_FL_KEEP_SIZE flag was used. But that guarantee is broken by shmem_unused_huge_shrink()'s attempts to split huge pages and free their excess beyond i_size; and by other uses of split_huge_page() near i_size. It's sad to add a shmem inode field just for this, but I did not find a better way to keep the guarantee. A flag to say KEEP_SIZE has been used would be cheaper, but I'm averse to unclearable flags. The fallocend field is not perfect either (many disjoint ranges might be fallocated), but good enough; and gains another use later on. Link: https://lkml.kernel.org/r/ca9a146-3a59-6cd3-7f28-e9a044bb1052@google.com Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure") Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03huge tmpfs: fix fallocate(vanilla) advance over huge pagesHugh Dickins
Patch series "huge tmpfs: shmem_is_huge() fixes and cleanups". A series of huge tmpfs fixes and cleanups. This patch (of 9): shmem_fallocate() goes to a lot of trouble to leave its newly allocated pages !Uptodate, partly to identify and undo them on failure, partly to leave the overhead of clearing them until later. But the huge page case did not skip to the end of the extent, walked through the tail pages one by one, and appeared to work just fine: but in doing so, cleared and Uptodated the huge page, so there was no way to undo it on failure. And by setting Uptodate too soon, it messed up both its nr_falloced and nr_unswapped counts, so that the intended "time to give up" heuristic did not work at all. Now advance immediately to the end of the huge extent, with a comment on why this is more than just an optimization. But although this speeds up huge tmpfs fallocation, it does leave the clearing until first use, and some users may have come to appreciate slow fallocate but fast first use: if they complain, then we can consider adding a pass to clear at the end. Link: https://lkml.kernel.org/r/da632211-8e3e-6b1-aee-ab24734429a0@google.com Link: https://lkml.kernel.org/r/16201bd2-70e-37e2-e89b-5f929430da@google.com Fixes: 800d8c63b2e9 ("shmem: add huge pages support") Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03shmem: include header file to declare swap_infoMiaohe Lin
It's bad to extern swap_info[] in .c. Include corresponding header file instead. Link: https://lkml.kernel.org/r/20210812120350.49801-5-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03shmem: remove unneeded function forward declarationMiaohe Lin
The forward declaration for shmem_should_replace_page() and shmem_replace_page() is unnecessary. Remove them. Link: https://lkml.kernel.org/r/20210812120350.49801-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03shmem: remove unneeded header fileMiaohe Lin
mfill_atomic_install_pte() is introduced to install pte and update mmu cache since commit bf6ebd97aba0 ("userfaultfd/shmem: modify shmem_mfill_atomic_pte to use install_pte()"). So we should remove tlbflush.h as update_mmu_cache() is not called here now. Link: https://lkml.kernel.org/r/20210812120350.49801-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03shmem: remove unneeded variable retMiaohe Lin
Patch series "Cleanups for shmem". This series contains cleanups to remove unneeded variable, header file, function forward declaration and so on. More details can be found in the respective changelogs. This patch (of 4): The local variable ret is always equal to -ENOMEM and never touched. So remove it and return -ENOMEM directly to simplify the code. Link: https://lkml.kernel.org/r/20210812120350.49801-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210812120350.49801-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03shmem: use raw_spinlock_t for ->stat_lockSebastian Andrzej Siewior
Each CPU has SHMEM_INO_BATCH inodes available in `->ino_batch' which is per-CPU. Access here is serialized by disabling preemption. If the pool is empty, it gets reloaded from `->next_ino'. Access here is serialized by ->stat_lock which is a spinlock_t and can not be acquired with disabled preemption. One way around it would make per-CPU ino_batch struct containing the inode number a local_lock_t. Another solution is to promote ->stat_lock to a raw_spinlock_t. The critical sections are short. The mpol_put() must be moved outside of the critical section to avoid invoking the destructor with disabled preemption. Link: https://lkml.kernel.org/r/20210806142916.jdwkb5bx62q5fwfo@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm: delete unused get_kernel_page()John Hubbard
get_kernel_page() was added in 2012 by [1]. It was used for a while for NFS, but then in 2014, a refactoring [2] removed all callers, and it has apparently not been used since. Remove get_kernel_page() because it has no callers. [1] commit 18022c5d8627 ("mm: add get_kernel_page[s] for pinning of kernel addresses for I/O") [2] commit 91f79c43d1b5 ("new helper: iov_iter_get_pages_alloc()") Link: https://lkml.kernel.org/r/20210729221847.1165665-1-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Paris <eparis@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Xiaotian Feng <dfeng@redhat.com> Cc: Mark Salter <msalter@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03fs, mm: fix race in unlinking swapfileHugh Dickins
We had a recurring situation in which admin procedures setting up swapfiles would race with test preparation clearing away swapfiles; and just occasionally that got stuck on a swapfile "(deleted)" which could never be swapped off. That is not supposed to be possible. 2.6.28 commit f9454548e17c ("don't unlink an active swapfile") admitted that it was leaving a race window open: now close it. may_delete() makes the IS_SWAPFILE check (amongst many others) before inode_lock has been taken on target: now repeat just that simple check in vfs_unlink() and vfs_rename(), after taking inode_lock. Which goes most of the way to fixing the race, but swapon() must also check after it acquires inode_lock, that the file just opened has not already been unlinked. Link: https://lkml.kernel.org/r/e17b91ad-a578-9a15-5e3-4989e0f999b5@google.com Fixes: f9454548e17c ("don't unlink an active swapfile") Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/gup: remove try_get_page(), call try_get_compound_head() directlyJohn Hubbard
try_get_page() is very similar to try_get_compound_head(), and in fact try_get_page() has fallen a little behind in terms of maintenance: try_get_compound_head() handles speculative page references more thoroughly. There are only two try_get_page() callsites, so just call try_get_compound_head() directly from those, and remove try_get_page() entirely. Also, seeing as how this changes try_get_compound_head() into a non-static function, provide some kerneldoc documentation for it. Link: https://lkml.kernel.org/r/20210813044133.1536842-4-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/gup: small refactoring: simplify try_grab_page()John Hubbard
try_grab_page() does the same thing as try_grab_compound_head(..., refs=1, ...), just with a different API. So there is a lot of code duplication there. Change try_grab_page() to call try_grab_compound_head(), while keeping the API contract identical for callers. Also, now that try_grab_compound_head() always has a caller, remove the __maybe_unused annotation. Link: https://lkml.kernel.org/r/20210813044133.1536842-3-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/gup: documentation corrections for gup/pupJohn Hubbard
Patch series "A few gup refactorings and documentation updates", v3. While reviewing some of the other things going on around gup.c, I noticed that the documentation was wrong for a few of the routines that I wrote. And then I noticed that there was some significant code duplication too. So this fixes those issues. This is not entirely risk-free, but after looking closely at this, I think it's actually a useful improvement, getting rid of the code duplication here. This patch (of 3): The documentation for try_grab_compound_head() and try_grab_page() has fallen a little out of date. Update and clarify a few points. Also make it kerneldoc-correct, by adding @args documentation. Link: https://lkml.kernel.org/r/20210813044133.1536842-1-jhubbard@nvidia.com Link: https://lkml.kernel.org/r/20210813044133.1536842-2-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm: gup: use helper PAGE_ALIGNED in populate_vma_page_range()Miaohe Lin
Use helper PAGE_ALIGNED to check if address is aligned to PAGE_SIZE. Minor readability improvement. Link: https://lkml.kernel.org/r/20210807093620.21347-6-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm: gup: fix potential pgmap refcnt leak in __gup_device_huge()Miaohe Lin
When failed to try_grab_page, put_dev_pagemap() is missed. So pgmap refcnt will leak in this case. Also we remove the check for pgmap against NULL as it's also checked inside the put_dev_pagemap(). [akpm@linux-foundation.org: simplify, cleanup] [akpm@linux-foundation.org: fix return value] Link: https://lkml.kernel.org/r/20210807093620.21347-5-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Fixes: 3faa52c03f44 ("mm/gup: track FOLL_PIN pages") Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm: gup: remove useless BUG_ON in __get_user_pages()Miaohe Lin
Indeed, this BUG_ON couldn't catch anything useful. We are sure ret == 0 here because we would already bail out if ret != 0 and ret is untouched till here. Link: https://lkml.kernel.org/r/20210807093620.21347-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>