summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2021-02-05mm: hugetlb: fix a race between freeing and dissolving the pageMuchun Song
There is a race condition between __free_huge_page() and dissolve_free_huge_page(). CPU0: CPU1: // page_count(page) == 1 put_page(page) __free_huge_page(page) dissolve_free_huge_page(page) spin_lock(&hugetlb_lock) // PageHuge(page) && !page_count(page) update_and_free_page(page) // page is freed to the buddy spin_unlock(&hugetlb_lock) spin_lock(&hugetlb_lock) clear_page_huge_active(page) enqueue_huge_page(page) // It is wrong, the page is already freed spin_unlock(&hugetlb_lock) The race window is between put_page() and dissolve_free_huge_page(). We should make sure that the page is already on the free list when it is dissolved. As a result __free_huge_page would corrupt page(s) already in the buddy allocator. Link: https://lkml.kernel.org/r/20210115124942.46403-4-songmuchun@bytedance.com Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle hugepage") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-05mm: hugetlbfs: fix cannot migrate the fallocated HugeTLB pageMuchun Song
If a new hugetlb page is allocated during fallocate it will not be marked as active (set_page_huge_active) which will result in a later isolate_huge_page failure when the page migration code would like to move that page. Such a failure would be unexpected and wrong. Only export set_page_huge_active, just leave clear_page_huge_active as static. Because there are no external users. Link: https://lkml.kernel.org/r/20210115124942.46403-3-songmuchun@bytedance.com Fixes: 70c3547e36f5 (hugetlbfs: add hugetlbfs_fallocate()) Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-29tlb: mmu_gather: Remove start/end arguments from tlb_gather_mmu()Will Deacon
The 'start' and 'end' arguments to tlb_gather_mmu() are no longer needed now that there is a separate function for 'fullmm' flushing. Remove the unused arguments and update all callers. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Yu Zhao <yuzhao@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/CAHk-=wjQWa14_4UpfDf=fiineNP+RH74kZeDMo_f1D35xNzq9w@mail.gmail.com
2021-01-29tlb: mmu_gather: Introduce tlb_gather_mmu_fullmm()Will Deacon
Passing the range '0, -1' to tlb_gather_mmu() sets the 'fullmm' flag, which indicates that the mm_struct being operated on is going away. In this case, some architectures (such as arm64) can elide TLB invalidation by ensuring that the TLB tag (ASID) associated with this mm is not immediately reclaimed. Although this behaviour is documented in asm-generic/tlb.h, it's subtle and easily missed. Introduce tlb_gather_mmu_fullmm() to make it clearer that this is for the entire mm and WARN() if tlb_gather_mmu() is called with the 'fullmm' address range. Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Yu Zhao <yuzhao@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lkml.kernel.org/r/20210127235347.1402-4-will@kernel.org
2021-01-29tlb: mmu_gather: Remove unused start/end arguments from tlb_finish_mmu()Will Deacon
Since commit 7a30df49f63a ("mm: mmu_gather: remove __tlb_reset_range() for force flush"), the 'start' and 'end' arguments to tlb_finish_mmu() are no longer used, since we flush the whole mm in case of a nested invalidation. Remove the unused arguments and update all callers. Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Yu Zhao <yuzhao@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lkml.kernel.org/r/20210127235347.1402-3-will@kernel.org
2021-01-28Revert "mm/slub: fix a memory leak in sysfs_slab_add()"Wang Hai
This reverts commit dde3c6b72a16c2db826f54b2d49bdea26c3534a2. syzbot report a double-free bug. The following case can cause this bug. - mm/slab_common.c: create_cache(): if the __kmem_cache_create() fails, it does: out_free_cache: kmem_cache_free(kmem_cache, s); - but __kmem_cache_create() - at least for slub() - will have done sysfs_slab_add(s) -> sysfs_create_group() .. fails .. -> kobject_del(&s->kobj); .. which frees s ... We can't remove the kmem_cache_free() in create_cache(), because other error cases of __kmem_cache_create() do not free this. So, revert the commit dde3c6b72a16 ("mm/slub: fix a memory leak in sysfs_slab_add()") to fix this. Reported-by: syzbot+d0bd96b4696c1ef67991@syzkaller.appspotmail.com Fixes: dde3c6b72a16 ("mm/slub: fix a memory leak in sysfs_slab_add()") Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Wang Hai <wanghai38@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-28mm/nommu: Fix return type of filemap_map_pages()Geert Uytterhoeven
If CONFIG_MMU is not set (e.g. m68k/m5272c3_defconfig): mm/nommu.c:1671:6: error: conflicting types for ‘filemap_map_pages’ 1671 | void filemap_map_pages(struct vm_fault *vmf, | ^~~~~~~~~~~~~~~~~ In file included from mm/nommu.c:20: ./include/linux/mm.h:2578:19: note: previous declaration of ‘filemap_map_pages’ was here 2578 | extern vm_fault_t filemap_map_pages(struct vm_fault *vmf, | ^~~~~~~~~~~~~~~~~ The signature of filemap_map_pages() was changed, but the nommu implementation wasn't updated. Reported-by: noreply@ellerman.id.au Fixes: f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/r/20210128100626.2257638-1-geert@linux-m68k.org Signed-off-by: Will Deacon <will@kernel.org>
2021-01-27mm: only make map_swap_entry available for CONFIG_HIBERNATIONJens Axboe
Current tree spews this on compile: mm/swapfile.c:2290:17: warning: ‘map_swap_entry’ defined but not used [-Wunused-function] 2290 | static sector_t map_swap_entry(swp_entry_t entry, struct block_device **bdev) | ^~~~~~~~~~~~~~ if !CONFIG_HIBERNATION, as we don't use the function unless we have that config option set. Fixes: 48d15436fde6 ("mm: remove get_swap_bio") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-27mm: remove get_swap_bioChristoph Hellwig
Just reuse the block_device and sector from the swap_info structure, just as used by the SWP_SYNCHRONOUS path. Also remove the checks for NULL returns from bio_alloc as that can't happen for sleeping allocations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-26Revert "mm: fix initialization of struct page for holes in memory layout"Linus Torvalds
This reverts commit d3921cb8be29ce5668c64e23ffdaeec5f8c69399. Chris Wilson reports that it causes boot problems: "We have half a dozen or so different machines in CI that are silently failing to boot, that we believe is bisected to this patch" and the CI team confirmed that a revert fixed the issues. The cause is unknown for now, so let's revert it. Link: https://lore.kernel.org/lkml/161160687463.28991.354987542182281928@build.alporthouse.com/ Reported-and-tested-by: Chris Wilson <chris@chris-wilson.co.uk> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24block: store a block_device pointer in struct bioChristoph Hellwig
Replace the gendisk pointer in struct bio with a pointer to the newly improved struct block device. From that the gendisk can be trivially accessed with an extra indirection, but it also allows to directly look up all information related to partition remapping. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24mm/highmem: prepare for overriding set_pte_at()Thomas Gleixner
The generic kmap_local() map function uses set_pte_at(), but MIPS requires set_pte() and PowerPC wants __set_pte_at(). Provide arch_kmap_local_set_pte() and default it to set_pte_at(). Link: https://lkml.kernel.org/r/20210112170411.056306194@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Cercueil <paul@crapouillou.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24mm: fix page reference leak in soft_offline_page()Dan Williams
The conversion to move pfn_to_online_page() internal to soft_offline_page() missed that the get_user_pages() reference taken by the madvise() path needs to be dropped when pfn_to_online_page() fails. Note the direct sysfs-path to soft_offline_page() does not perform a get_user_pages() lookup. When soft_offline_page() is handed a pfn_valid() && !pfn_to_online_page() pfn the kernel hangs at dax-device shutdown due to a leaked reference. Link: https://lkml.kernel.org/r/161058501210.1840162.8108917599181157327.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: feec24a6139d ("mm, soft-offline: convert parameter to pfn") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Qian Cai <cai@lca.pw> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24kasan, mm: fix resetting page_alloc tags for HW_TAGSAndrey Konovalov
A previous commit added resetting KASAN page tags to kernel_init_free_pages() to avoid false-positives due to accesses to metadata with the hardware tag-based mode. That commit did reset page tags before the metadata access, but didn't restore them after. As the result, KASAN fails to detect bad accesses to page_alloc allocations on some configurations. Fix this by recovering the tag after the metadata access. Link: https://lkml.kernel.org/r/02b5bcd692e912c27d484030f666b350ad7e4ae4.1611074450.git.andreyknvl@google.com Fixes: aa1ef4d7b3f6 ("kasan, mm: reset tags when accessing metadata") Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Marco Elver <elver@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Branislav Rankov <Branislav.Rankov@arm.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24kasan, mm: fix conflicts with init_on_alloc/freeAndrey Konovalov
A few places where SLUB accesses object's data or metadata were missed in a previous patch. This leads to false positives with hardware tag-based KASAN when bulk allocations are used with init_on_alloc/free. Fix the false-positives by resetting pointer tags during these accesses. (The kasan_reset_tag call is removed from slab_alloc_node, as it's added into maybe_wipe_obj_freeptr.) Link: https://linux-review.googlesource.com/id/I50dd32838a666e173fe06c3c5c766f2c36aae901 Link: https://lkml.kernel.org/r/093428b5d2ca8b507f4a79f92f9929b35f7fada7.1610731872.git.andreyknvl@google.com Fixes: aa1ef4d7b3f67 ("kasan, mm: reset tags when accessing metadata") Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Alexander Potapenko <glider@google.com> Cc: Marco Elver <elver@google.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Branislav Rankov <Branislav.Rankov@arm.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24kasan: fix HW_TAGS boot parametersAndrey Konovalov
The initially proposed KASAN command line parameters are redundant. This change drops the complex "kasan.mode=off/prod/full" parameter and adds a simpler kill switch "kasan=off/on" instead. The new parameter together with the already existing ones provides a cleaner way to express the same set of features. The full set of parameters with this change: kasan=off/on - whether KASAN is enabled kasan.fault=report/panic - whether to only print a report or also panic kasan.stacktrace=off/on - whether to collect alloc/free stack traces Default values: kasan=on kasan.fault=report kasan.stacktrace=on (if CONFIG_DEBUG_KERNEL=y) kasan.stacktrace=off (otherwise) Link: https://linux-review.googlesource.com/id/Ib3694ed90b1e8ccac6cf77dfd301847af4aba7b8 Link: https://lkml.kernel.org/r/4e9c4a4bdcadc168317deb2419144582a9be6e61.1610736745.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Branislav Rankov <Branislav.Rankov@arm.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24kasan: fix incorrect arguments passing in kasan_add_zero_shadowLecopzer Chen
kasan_remove_zero_shadow() shall use original virtual address, start and size, instead of shadow address. Link: https://lkml.kernel.org/r/20210103063847.5963-1-lecopzer@gmail.com Fixes: 0207df4fa1a86 ("kernel/memremap, kasan: make ZONE_DEVICE with work with KASAN") Signed-off-by: Lecopzer Chen <lecopzer.chen@mediatek.com> Reviewed-by: Andrey Konovalov <andreyknvl@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24kasan: fix unaligned address is unhandled in kasan_remove_zero_shadowLecopzer Chen
During testing kasan_populate_early_shadow and kasan_remove_zero_shadow, if the shadow start and end address in kasan_remove_zero_shadow() is not aligned to PMD_SIZE, the remain unaligned PTE won't be removed. In the test case for kasan_remove_zero_shadow(): shadow_start: 0xffffffb802000000, shadow end: 0xffffffbfbe000000 3-level page table: PUD_SIZE: 0x40000000 PMD_SIZE: 0x200000 PAGE_SIZE: 4K 0xffffffbf80000000 ~ 0xffffffbfbdf80000 will not be removed because in kasan_remove_pud_table(), kasan_pmd_table(*pud) is true but the next address is 0xffffffbfbdf80000 which is not aligned to PUD_SIZE. In the correct condition, this should fallback to the next level kasan_remove_pmd_table() but the condition flow always continue to skip the unaligned part. Fix by correcting the condition when next and addr are neither aligned. Link: https://lkml.kernel.org/r/20210103135621.83129-1-lecopzer@gmail.com Fixes: 0207df4fa1a86 ("kernel/memremap, kasan: make ZONE_DEVICE with work with KASAN") Signed-off-by: Lecopzer Chen <lecopzer.chen@mediatek.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: YJ Chiang <yj.chiang@mediatek.com> Cc: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24mm: fix numa stats for thp migrationShakeel Butt
Currently the kernel is not correctly updating the numa stats for NR_FILE_PAGES and NR_SHMEM on THP migration. Fix that. For NR_FILE_DIRTY and NR_ZONE_WRITE_PENDING, although at the moment there is no need to handle THP migration as kernel still does not have write support for file THP but to be more future proof, this patch adds the THP support for those stats as well. Link: https://lkml.kernel.org/r/20210108155813.2914586-2-shakeelb@google.com Fixes: e71769ae52609 ("mm: enable thp migration for shmem thp") Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24mm: memcg: fix memcg file_dirty numa statShakeel Butt
The kernel updates the per-node NR_FILE_DIRTY stats on page migration but not the memcg numa stats. That was not an issue until recently the commit 5f9a4f4a7096 ("mm: memcontrol: add the missing numa_stat interface for cgroup v2") exposed numa stats for the memcg. So fix the file_dirty per-memcg numa stat. Link: https://lkml.kernel.org/r/20210108155813.2914586-1-shakeelb@google.com Fixes: 5f9a4f4a7096 ("mm: memcontrol: add the missing numa_stat interface for cgroup v2") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24mm: memcg/slab: optimize objcg stock drainingRoman Gushchin
Imran Khan reported a 16% regression in hackbench results caused by the commit f2fe7b09a52b ("mm: memcg/slab: charge individual slab objects instead of pages"). The regression is noticeable in the case of a consequent allocation of several relatively large slab objects, e.g. skb's. As soon as the amount of stocked bytes exceeds PAGE_SIZE, drain_obj_stock() and __memcg_kmem_uncharge() are called, and it leads to a number of atomic operations in page_counter_uncharge(). The corresponding call graph is below (provided by Imran Khan): |__alloc_skb | | | |__kmalloc_reserve.isra.61 | | | | | |__kmalloc_node_track_caller | | | | | | | |slab_pre_alloc_hook.constprop.88 | | | obj_cgroup_charge | | | | | | | | | |__memcg_kmem_charge | | | | | | | | | | | |page_counter_try_charge | | | | | | | | | |refill_obj_stock | | | | | | | | | | | |drain_obj_stock.isra.68 | | | | | | | | | | | | | |__memcg_kmem_uncharge | | | | | | | | | | | | | | | |page_counter_uncharge | | | | | | | | | | | | | | | | | |page_counter_cancel | | | | | | | | | | | |__slab_alloc | | | | | | | | | |___slab_alloc | | | | | | | | |slab_post_alloc_hook Instead of directly uncharging the accounted kernel memory, it's possible to refill the generic page-sized per-cpu stock instead. It's a much faster operation, especially on a default hierarchy. As a bonus, __memcg_kmem_uncharge_page() will also get faster, so the freeing of page-sized kernel allocations (e.g. large kmallocs) will become faster. A similar change has been done earlier for the socket memory by the commit 475d0487a2ad ("mm: memcontrol: use per-cpu stocks for socket memory uncharging"). Link: https://lkml.kernel.org/r/20210106042239.2860107-1-guro@fb.com Fixes: f2fe7b09a52b ("mm: memcg/slab: charge individual slab objects instead of pages") Signed-off-by: Roman Gushchin <guro@fb.com> Reported-by: Imran Khan <imran.f.khan@oracle.com> Tested-by: Imran Khan <imran.f.khan@oracle.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Michal Koutn <mkoutny@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24mm: fix initialization of struct page for holes in memory layoutMike Rapoport
There could be struct pages that are not backed by actual physical memory. This can happen when the actual memory bank is not a multiple of SECTION_SIZE or when an architecture does not register memory holes reserved by the firmware as memblock.memory. Such pages are currently initialized using init_unavailable_mem() function that iterates through PFNs in holes in memblock.memory and if there is a struct page corresponding to a PFN, the fields if this page are set to default values and the page is marked as Reserved. init_unavailable_mem() does not take into account zone and node the page belongs to and sets both zone and node links in struct page to zero. On a system that has firmware reserved holes in a zone above ZONE_DMA, for instance in a configuration below: # grep -A1 E820 /proc/iomem 7a17b000-7a216fff : Unknown E820 type 7a217000-7bffffff : System RAM unset zone link in struct page will trigger VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page); because there are pages in both ZONE_DMA32 and ZONE_DMA (unset zone link in struct page) in the same pageblock. Update init_unavailable_mem() to use zone constraints defined by an architecture to properly setup the zone link and use node ID of the adjacent range in memblock.memory to set the node link. Link: https://lkml.kernel.org/r/20210111194017.22696-3-rppt@kernel.org Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather that check each PFN") Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qian Cai <cai@lca.pw> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24fs: make helpers idmap mount awareChristian Brauner
Extend some inode methods with an additional user namespace argument. A filesystem that is aware of idmapped mounts will receive the user namespace the mount has been marked with. This can be used for additional permission checking and also to enable filesystems to translate between uids and gids if they need to. We have implemented all relevant helpers in earlier patches. As requested we simply extend the exisiting inode method instead of introducing new ones. This is a little more code churn but it's mostly mechanical and doesnt't leave us with additional inode methods. Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24stat: handle idmapped mountsChristian Brauner
The generic_fillattr() helper fills in the basic attributes associated with an inode. Enable it to handle idmapped mounts. If the inode is accessed through an idmapped mount map it into the mount's user namespace before we store the uid and gid. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-12-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24acl: handle idmapped mountsChristian Brauner
The posix acl permission checking helpers determine whether a caller is privileged over an inode according to the acls associated with the inode. Add helpers that make it possible to handle acls on idmapped mounts. The vfs and the filesystems targeted by this first iteration make use of posix_acl_fix_xattr_from_user() and posix_acl_fix_xattr_to_user() to translate basic posix access and default permissions such as the ACL_USER and ACL_GROUP type according to the initial user namespace (or the superblock's user namespace) to and from the caller's current user namespace. Adapt these two helpers to handle idmapped mounts whereby we either map from or into the mount's user namespace depending on in which direction we're translating. Similarly, cap_convert_nscap() is used by the vfs to translate user namespace and non-user namespace aware filesystem capabilities from the superblock's user namespace to the caller's user namespace. Enable it to handle idmapped mounts by accounting for the mount's user namespace. In addition the fileystems targeted in the first iteration of this patch series make use of the posix_acl_chmod() and, posix_acl_update_mode() helpers. Both helpers perform permission checks on the target inode. Let them handle idmapped mounts. These two helpers are called when posix acls are set by the respective filesystems to handle this case we extend the ->set() method to take an additional user namespace argument to pass the mount's user namespace down. Link: https://lore.kernel.org/r/20210121131959.646623-9-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24attr: handle idmapped mountsChristian Brauner
When file attributes are changed most filesystems rely on the setattr_prepare(), setattr_copy(), and notify_change() helpers for initialization and permission checking. Let them handle idmapped mounts. If the inode is accessed through an idmapped mount map it into the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Helpers that perform checks on the ia_uid and ia_gid fields in struct iattr assume that ia_uid and ia_gid are intended values and have already been mapped correctly at the userspace-kernelspace boundary as we already do today. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24inode: make init and permission helpers idmapped mount awareChristian Brauner
The inode_owner_or_capable() helper determines whether the caller is the owner of the inode or is capable with respect to that inode. Allow it to handle idmapped mounts. If the inode is accessed through an idmapped mount it according to the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Similarly, allow the inode_init_owner() helper to handle idmapped mounts. It initializes a new inode on idmapped mounts by mapping the fsuid and fsgid of the caller from the mount's user namespace. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-7-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24fs: add file and path permissions helpersChristian Brauner
Add two simple helpers to check permissions on a file and path respectively and convert over some callers. It simplifies quite a few codepaths and also reduces the churn in later patches quite a bit. Christoph also correctly points out that this makes codepaths (e.g. ioctls) way easier to follow that would otherwise have to do more complex argument passing than necessary. Link: https://lore.kernel.org/r/20210121131959.646623-4-christian.brauner@ubuntu.com Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-22mm: Make mem_obj_dump() vmalloc() dumps include start and lengthPaul E. McKenney
This commit adds the starting address and number of pages to the vmalloc() information dumped by way of vmalloc_dump_obj(). Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <linux-mm@kvack.org> Reported-by: Andrii Nakryiko <andrii@kernel.org> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-01-22mm: Make mem_dump_obj() handle vmalloc() memoryPaul E. McKenney
This commit adds vmalloc() support to mem_dump_obj(). Note that the vmalloc_dump_obj() function combines the checking and dumping, in contrast with the split between kmem_valid_obj() and kmem_dump_obj(). The reason for the difference is that the checking in the vmalloc() case involves acquiring a global lock, and redundant acquisitions of global locks should be avoided, even on not-so-fast paths. Note that this change causes on-stack variables to be reported as vmalloc() storage from kernel_clone() or similar, depending on the degree of inlining that your compiler does. This is likely more helpful than the earlier "non-paged (local) memory". Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <linux-mm@kvack.org> Reported-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-01-22mm: Make mem_dump_obj() handle NULL and zero-sized pointersPaul E. McKenney
This commit makes mem_dump_obj() call out NULL and zero-sized pointers specially instead of classifying them as non-paged memory. Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: <linux-mm@kvack.org> Reported-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-01-22mm: Add mem_dump_obj() to print source of memory blockPaul E. McKenney
There are kernel facilities such as per-CPU reference counts that give error messages in generic handlers or callbacks, whose messages are unenlightening. In the case of per-CPU reference-count underflow, this is not a problem when creating a new use of this facility because in that case the bug is almost certainly in the code implementing that new use. However, trouble arises when deploying across many systems, which might exercise corner cases that were not seen during development and testing. Here, it would be really nice to get some kind of hint as to which of several uses the underflow was caused by. This commit therefore exposes a mem_dump_obj() function that takes a pointer to memory (which must still be allocated if it has been dynamically allocated) and prints available information on where that memory came from. This pointer can reference the middle of the block as well as the beginning of the block, as needed by things like RCU callback functions and timer handlers that might not know where the beginning of the memory block is. These functions and handlers can use mem_dump_obj() to print out better hints as to where the problem might lie. The information printed can depend on kernel configuration. For example, the allocation return address can be printed only for slab and slub, and even then only when the necessary debug has been enabled. For slab, build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space to the next power of two or use the SLAB_STORE_USER when creating the kmem_cache structure. For slub, build with CONFIG_SLUB_DEBUG=y and boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create() if more focused use is desired. Also for slub, use CONFIG_STACKTRACE to enable printing of the allocation-time stack trace. Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: <linux-mm@kvack.org> Reported-by: Andrii Nakryiko <andrii@kernel.org> [ paulmck: Convert to printing and change names per Joonsoo Kim. ] [ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ] [ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ] [ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ] [ paulmck: Extract more info from !SLUB_DEBUG per Joonsoo Kim. ] [ paulmck: Explicitly check for small pointers per Naresh Kamboju. ] Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-01-21mm: Use static initialisers for immutable fields of 'struct vm_fault'Will Deacon
In preparation for const-ifying the anonymous struct field of 'struct vm_fault', ensure that it is initialised using designated initialisers. Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Will Deacon <will@kernel.org>
2021-01-21mm: Avoid modifying vmf.address in __collapse_huge_page_swapin()Will Deacon
In preparation for const-ifying the anonymous struct field of 'struct vm_fault', rework __collapse_huge_page_swapin() to avoid continuously updating vmf.address and instead populate a new 'struct vm_fault' on the stack for each page being processed. Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Will Deacon <will@kernel.org>
2021-01-21mm: Pass 'address' to map to do_set_pte() and drop FAULT_FLAG_PREFAULTWill Deacon
Rather than modifying the 'address' field of the 'struct vm_fault' passed to do_set_pte(), leave that to identify the real faulting address and pass in the virtual address to be mapped by the new pte as a separate argument. This makes FAULT_FLAG_PREFAULT redundant, as a prefault entry can be identified simply by comparing the new address parameter with the faulting address, so remove the redundant flag at the same time. Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Will Deacon <will@kernel.org>
2021-01-21mm/memblock: Fix typo in comment of memblock_phys_alloc_try_nid()Levi Yun
memblock_phys_alloc_try_nid function's comments has typo NUMA as MUMA. Correct this typo. Signed-off-by: Levi Yun <ppbuk5246@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
2021-01-20mm: Allow architectures to request 'old' entries when prefaultingWill Deacon
Commit 5c0a85fad949 ("mm: make faultaround produce old ptes") changed the "faultaround" behaviour to initialise prefaulted PTEs as 'old', since this avoids vmscan wrongly assuming that they are hot, despite having never been explicitly accessed by userspace. The change has been shown to benefit numerous arm64 micro-architectures (with hardware access flag) running Android, where both application launch latency and direct reclaim time are significantly reduced (by 10%+ and ~80% respectively). Unfortunately, commit 315d09bf30c2 ("Revert "mm: make faultaround produce old ptes"") reverted the change due to it being identified as the cause of a ~6% regression in unixbench on x86. Experiments on a variety of recent arm64 micro-architectures indicate that unixbench is not affected by the original commit, which appears to yield a 0-1% performance improvement. Since one size does not fit all for the initial state of prefaulted PTEs, introduce arch_wants_old_prefaulted_pte(), which allows an architecture to opt-in to 'old' prefaulted PTEs at runtime based on whatever criteria it may have. Cc: Jan Kara <jack@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Reported-by: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Will Deacon <will@kernel.org>
2021-01-20mm: Cleanup faultaround and finish_fault() codepathsKirill A. Shutemov
alloc_set_pte() has two users with different requirements: in the faultaround code, it called from an atomic context and PTE page table has to be preallocated. finish_fault() can sleep and allocate page table as needed. PTL locking rules are also strange, hard to follow and overkill for finish_fault(). Let's untangle the mess. alloc_set_pte() has gone now. All locking is explicit. The price is some code duplication to handle huge pages in faultaround path, but it should be fine, having overall improvement in readability. Link: https://lore.kernel.org/r/20201229132819.najtavneutnf7ajp@box Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> [will: s/from from/from/ in comment; spotted by willy] Signed-off-by: Will Deacon <will@kernel.org>
2021-01-17mm: don't put pinned pages into the swap cacheLinus Torvalds
So technically there is nothing wrong with adding a pinned page to the swap cache, but the pinning obviously means that the page can't actually be free'd right now anyway, so it's a bit pointless. However, the real problem is not with it being a bit pointless: the real issue is that after we've added it to the swap cache, we'll try to unmap the page. That will succeed, because the code in mm/rmap.c doesn't know or care about pinned pages. Even the unmapping isn't fatal per se, since the page will stay around in memory due to the pinning, and we do hold the connection to it using the swap cache. But when we then touch it next and take a page fault, the logic in do_swap_page() will map it back into the process as a possibly read-only page, and we'll then break the page association on the next COW fault. Honestly, this issue could have been fixed in any of those other places: (a) we could refuse to unmap a pinned page (which makes conceptual sense), or (b) we could make sure to re-map a pinned page writably in do_swap_page(), or (c) we could just make do_wp_page() not COW the pinned page (which was what we historically did before that "mm: do_wp_page() simplification" commit). But while all of them are equally valid models for breaking this chain, not putting pinned pages into the swap cache in the first place is the simplest one by far. It's also the safest one: the reason why do_wp_page() was changed in the first place was that getting the "can I re-use this page" wrong is so fraught with errors. If you do it wrong, you end up with an incorrectly shared page. As a result, using "page_maybe_dma_pinned()" in either do_wp_page() or do_swap_page() would be a serious bug since it is only a (very good) heuristic. Re-using the page requires a hard black-and-white rule with no room for ambiguity. In contrast, saying "this page is very likely dma pinned, so let's not add it to the swap cache and try to unmap it" is an obviously safe thing to do, and if the heuristic might very rarely be a false positive, no harm is done. Fixes: 09854ba94c6a ("mm: do_wp_page() simplification") Reported-and-tested-by: Martin Raiber <martin@urbackup.org> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-14mm: memblock: remove return value of memblock_free_all()Daeseok Youn
No one checks the return value of memblock_free_all(). Make the return value void. memblock_free_all() is used on mem_init() for each architecture, and the total count of freed pages will be added to _totalram_pages variable by calling totalram_pages_add(). so do not need to return total count of freed pages. Signed-off-by: Daeseok Youn <daeseok.youn@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
2021-01-12mm/process_vm_access.c: include compat.hAndrew Morton
Fix the build error: mm/process_vm_access.c:277:5: error: implicit declaration of function 'in_compat_syscall'; did you mean 'in_ia32_syscall'? [-Werror=implicit-function-declaration] Fixes: 38dc5079da7081e "Fix compat regression in process_vm_rw()" Reported-by: syzbot+5b0d0de84d6c65b8dd2b@syzkaller.appspotmail.com Cc: Kyle Huey <me@kylehuey.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-12mm,hwpoison: fix printing of page flagsOscar Salvador
Format %pG expects a lower case 'p' in order to print the flags. Fix it. Link: https://lkml.kernel.org/r/20210108085202.4506-1-osalvador@suse.de Fixes: 8295d535e2aa ("mm,hwpoison: refactor get_any_page") Signed-off-by: Oscar Salvador <osalvador@suse.de> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-12mm/hugetlb: fix potential missing huge page size infoMiaohe Lin
The huge page size is encoded for VM_FAULT_HWPOISON errors only. So if we return VM_FAULT_HWPOISON, huge page size would just be ignored. Link: https://lkml.kernel.org/r/20210107123449.38481-1-linmiaohe@huawei.com Fixes: aa50d3a7aa81 ("Encode huge page size for VM_FAULT_HWPOISON errors") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-12mm: migrate: initialize err in do_migrate_pagesJan Stancek
After commit 236c32eb1096 ("mm: migrate: clean up migrate_prep{_local}")', do_migrate_pages can return uninitialized variable 'err' (which is propagated to user-space as error) when 'from' and 'to' nodesets are identical. This can be reproduced with LTP migrate_pages01, which calls migrate_pages() with same set for both old/new_nodes. Add 'err' initialization back. Link: https://lkml.kernel.org/r/456a021c7ef3636d7668cec9dcb4a446a4244812.1609855564.git.jstancek@redhat.com Fixes: 236c32eb1096 ("mm: migrate: clean up migrate_prep{_local}") Signed-off-by: Jan Stancek <jstancek@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Song Liu <songliubraving@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-12mm/vmalloc.c: fix potential memory leakMiaohe Lin
In VM_MAP_PUT_PAGES case, we should put pages and free array in vfree. But we missed to set area->nr_pages in vmap(). So we would fail to put pages in __vunmap() because area->nr_pages = 0. Link: https://lkml.kernel.org/r/20210107123541.39206-1-linmiaohe@huawei.com Fixes: b944afc9d64d ("mm: add a VM_MAP_PUT_PAGES flag for vmap") Signed-off-by: Shijie Luo <luoshijie1@huawei.com> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-12arm/kasan: fix the array size of kasan_early_shadow_pte[]Hailong Liu
The size of kasan_early_shadow_pte[] now is PTRS_PER_PTE which defined to 512 for arm. This means that it only covers the prev Linux pte entries, but not the HWTABLE pte entries for arm. The reason it currently works is that the symbol kasan_early_shadow_page immediately following kasan_early_shadow_pte in memory is page aligned, which makes kasan_early_shadow_pte look like a 4KB size array. But we can't ensure the order is always right with different compiler/linker, or if more bss symbols are introduced. We had a test with QEMU + vexpress:put a 512KB-size symbol with attribute __section(".bss..page_aligned") after kasan_early_shadow_pte, and poisoned it after kasan_early_init(). Then enabled CONFIG_KASAN, it failed to boot up. Link: https://lkml.kernel.org/r/20210109044622.8312-1-hailongliiu@yeah.net Signed-off-by: Hailong Liu <liu.hailong6@zte.com.cn> Signed-off-by: Ziliang Guo <guo.ziliang@zte.com.cn> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-12mm/page_alloc: add a missing mm_page_alloc_zone_locked() tracepointHailong liu
The trace point *trace_mm_page_alloc_zone_locked()* in __rmqueue() does not currently cover all branches. Add the missing tracepoint and check the page before do that. [akpm@linux-foundation.org: use IS_ENABLED() to suppress warning] Link: https://lkml.kernel.org/r/20201228132901.41523-1-carver4lio@163.com Signed-off-by: Hailong liu <liu.hailong6@zte.com.cn> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-12mm, slub: consider rest of partial list if acquire_slab() failsJann Horn
acquire_slab() fails if there is contention on the freelist of the page (probably because some other CPU is concurrently freeing an object from the page). In that case, it might make sense to look for a different page (since there might be more remote frees to the page from other CPUs, and we don't want contention on struct page). However, the current code accidentally stops looking at the partial list completely in that case. Especially on kernels without CONFIG_NUMA set, this means that get_partial() fails and new_slab_objects() falls back to new_slab(), allocating new pages. This could lead to an unnecessary increase in memory fragmentation. Link: https://lkml.kernel.org/r/20201228130853.1871516-1-jannh@google.com Fixes: 7ced37197196 ("slub: Acquire_slab() avoid loop") Signed-off-by: Jann Horn <jannh@google.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-12mm: Close race in generic_access_physDaniel Vetter
Way back it was a reasonable assumptions that iomem mappings never change the pfn range they point at. But this has changed: - gpu drivers dynamically manage their memory nowadays, invalidating ptes with unmap_mapping_range when buffers get moved - contiguous dma allocations have moved from dedicated carvetouts to cma regions. This means if we miss the unmap the pfn might contain pagecache or anon memory (well anything allocated with GFP_MOVEABLE) - even /dev/mem now invalidates mappings when the kernel requests that iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region") Accessing pfns obtained from ptes without holding all the locks is therefore no longer a good idea. Fix this. Since ioremap might need to manipulate pagetables too we need to drop the pt lock and have a retry loop if we raced. While at it, also add kerneldoc and improve the comment for the vma_ops->access function. It's for accessing, not for moving the memory from iomem to system memory, as the old comment seemed to suggest. References: 28b2ee20c7cb ("access_process_vm device memory infrastructure") Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: Benjamin Herrensmidt <benh@kernel.crashing.org> Cc: Dave Airlie <airlied@linux.ie> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/20201127164131.2244124-8-daniel.vetter@ffwll.ch
2021-01-12media: videobuf2: Move frame_vector into media subsystemDaniel Vetter
It's the only user. This also garbage collects the CONFIG_FRAME_VECTOR symbol from all over the tree (well just one place, somehow omap media driver still had this in its Kconfig, despite not using it). Reviewed-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Acked-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Acked-by: Tomasz Figa <tfiga@chromium.org> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Pawel Osciak <pawel@osciak.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Tomasz Figa <tfiga@chromium.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/20201127164131.2244124-7-daniel.vetter@ffwll.ch