summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2022-01-15mm/vmalloc: be more explicit about supported gfp flags.Michal Hocko
Commit b7d90e7a5ea8 ("mm/vmalloc: be more explicit about supported gfp flags") has been merged prematurely without the rest of the series and without addressed review feedback from Neil. Fix that up now. Only wording is changed slightly. Link: https://lkml.kernel.org/r/20211122153233.9924-4-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Neil Brown <neilb@suse.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm/vmalloc: add support for __GFP_NOFAILMichal Hocko
Dave Chinner has mentioned that some of the xfs code would benefit from kvmalloc support for __GFP_NOFAIL because they have allocations that cannot fail and they do not fit into a single page. The large part of the vmalloc implementation already complies with the given gfp flags so there is no work for those to be done. The area and page table allocations are an exception to that. Implement a retry loop for those. Add a short sleep before retrying. 1 jiffy is a completely random timeout. Ideally the retry would wait for an explicit event - e.g. a change to the vmalloc space change if the failure was caused by the space fragmentation or depletion. But there are multiple different reasons to retry and this could become much more complex. Keep the retry simple for now and just sleep to prevent from hogging CPUs. Link: https://lkml.kernel.org/r/20211122153233.9924-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Neil Brown <neilb@suse.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm/vmalloc: alloc GFP_NO{FS,IO} for vmallocMichal Hocko
Patch series "extend vmalloc support for constrained allocations", v2. Based on a recent discussion with Dave and Neil [1] I have tried to implement NOFS, NOIO, NOFAIL support for the vmalloc to make life of kvmalloc users easier. A requirement for NOFAIL support for kvmalloc was new to me but this seems to be really needed by the xfs code. NOFS/NOIO was a known and a long term problem which was hoped to be handled by the scope API. Those scope should have been used at the reclaim recursion boundaries both to document them and also to remove the necessity of NOFS/NOIO constrains for all allocations within that scope. Instead workarounds were developed to wrap a single allocation instead (like ceph_kvmalloc). First patch implements NOFS/NOIO support for vmalloc. The second one adds NOFAIL support and the third one bundles all together into kvmalloc and drops ceph_kvmalloc which can use kvmalloc directly now. [1] http://lkml.kernel.org/r/163184741778.29351.16920832234899124642.stgit@noble.brown This patch (of 4): vmalloc historically hasn't supported GFP_NO{FS,IO} requests because page table allocations do not support externally provided gfp mask and performed GFP_KERNEL like allocations. Since few years we have scope (memalloc_no{fs,io}_{save,restore}) APIs to enforce NOFS and NOIO constrains implicitly to all allocators within the scope. There was a hope that those scopes would be defined on a higher level when the reclaim recursion boundary starts/stops (e.g. when a lock required during the memory reclaim is required etc.). It seems that not all NOFS/NOIO users have adopted this approach and instead they have taken a workaround approach to wrap a single [k]vmalloc allocation by a scope API. These workarounds do not serve the purpose of a better reclaim recursion documentation and reduction of explicit GFP_NO{FS,IO} usege so let's just provide them with the semantic they are asking for without a need for workarounds. Add support for GFP_NOFS and GFP_NOIO to vmalloc directly. All internal allocations already comply with the given gfp_mask. The only current exception is vmap_pages_range which maps kernel page tables. Infer the proper scope API based on the given gfp mask. [sfr@canb.auug.org.au: mm/vmalloc.c needs linux/sched/mm.h] Link: https://lkml.kernel.org/r/20211217232641.0148710c@canb.auug.org.au Link: https://lkml.kernel.org/r/20211122153233.9924-1-mhocko@kernel.org Link: https://lkml.kernel.org/r/20211122153233.9924-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Neil Brown <neilb@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Dave Chinner <dchinner@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm/dmapool.c: revert "make dma pool to use kmalloc_node"Christian König
This reverts commit 2618c60b8b5836 ("dma: make dma pool to use kmalloc_node"). While working myself into the dmapool code I've found this little odd kmalloc_node(). What basically happens here is that we allocate the housekeeping structure on the numa node where the device is attached to. Since the device is never doing DMA to or from that memory this doesn't seem to make sense at all. So while this doesn't seem to cause much harm it's probably cleaner to revert the change for consistency. Link: https://lkml.kernel.org/r/20211221110724.97664-1-christian.koenig@amd.com Signed-off-by: Christian König <christian.koenig@amd.com> Cc: Yinghai Lu <yinghai.lu@sun.com> Cc: Andi Kleen <ak@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm: remove the total_mapcount argument from page_trans_huge_mapcount()Matthew Wilcox (Oracle)
All callers pass NULL, so we can stop calculating the value we would store in it. Link: https://lkml.kernel.org/r/20211220205943.456187-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm: remove the total_mapcount argument from page_trans_huge_map_swapcount()Matthew Wilcox (Oracle)
Now that we don't report it to the caller of reuse_swap_page(), we don't need to request it from page_trans_huge_map_swapcount(). Link: https://lkml.kernel.org/r/20211220205943.456187-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm: remove last argument of reuse_swap_page()Matthew Wilcox (Oracle)
None of the callers care about the total_map_swapcount() any more. Link: https://lkml.kernel.org/r/20211220205943.456187-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm: page table checkPasha Tatashin
Check user page table entries at the time they are added and removed. Allows to synchronously catch memory corruption issues related to double mapping. When a pte for an anonymous page is added into page table, we verify that this pte does not already point to a file backed page, and vice versa if this is a file backed page that is being added we verify that this page does not have an anonymous mapping We also enforce that read-only sharing for anonymous pages is allowed (i.e. cow after fork). All other sharing must be for file pages. Page table check allows to protect and debug cases where "struct page" metadata became corrupted for some reason. For example, when refcnt or mapcount become invalid. Link: https://lkml.kernel.org/r/20211221154650.1047963-4-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kees Cook <keescook@chromium.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wei Xu <weixugc@google.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm: ptep_clear() page table helperPasha Tatashin
We have ptep_get_and_clear() and ptep_get_and_clear_full() helpers to clear PTE from user page tables, but there is no variant for simple clear of a present PTE from user page tables without using a low level pte_clear() which can be either native or para-virtualised. Add a new ptep_clear() that can be used in common code to clear PTEs from page table. We will need this call later in order to add a hook for page table check. Link: https://lkml.kernel.org/r/20211221154650.1047963-3-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kees Cook <keescook@chromium.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wei Xu <weixugc@google.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm: change page type prior to adding page table entryPasha Tatashin
Patch series "page table check", v3. Ensure that some memory corruptions are prevented by checking at the time of insertion of entries into user page tables that there is no illegal sharing. We have recently found a problem [1] that existed in kernel since 4.14. The problem was caused by broken page ref count and led to memory leaking from one process into another. The problem was accidentally detected by studying a dump of one process and noticing that one page contains memory that should not belong to this process. There are some other page->_refcount related problems that were recently fixed: [2], [3] which potentially could also lead to illegal sharing. In addition to hardening refcount [4] itself, this work is an attempt to prevent this class of memory corruption issues. It uses a simple state machine that is independent from regular MM logic to check for illegal sharing at time pages are inserted and removed from page tables. [1] https://lore.kernel.org/all/xr9335nxwc5y.fsf@gthelen2.svl.corp.google.com [2] https://lore.kernel.org/all/1582661774-30925-2-git-send-email-akaher@vmware.com [3] https://lore.kernel.org/all/20210622021423.154662-3-mike.kravetz@oracle.com [4] https://lore.kernel.org/all/20211221150140.988298-1-pasha.tatashin@soleen.com This patch (of 4): There are a few places where we first update the entry in the user page table, and later change the struct page to indicate that this is anonymous or file page. In most places, however, we first configure the page metadata and then insert entries into the page table. Page table check, will use the information from struct page to verify the type of entry is inserted. Change the order in all places to first update struct page, and later to update page table. This means that we first do calls that may change the type of page (anon or file): page_move_anon_rmap page_add_anon_rmap do_page_add_anon_rmap page_add_new_anon_rmap page_add_file_rmap hugepage_add_anon_rmap hugepage_add_new_anon_rmap And after that do calls that add entries to the page table: set_huge_pte_at set_pte_at Link: https://lkml.kernel.org/r/20211221154650.1047963-1-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20211221154650.1047963-2-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: David Rientjes <rientjes@google.com> Cc: Paul Turner <pjt@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Will Deacon <will@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm/oom_kill: allow process_mrelease to run under mmap_lock protectionSuren Baghdasaryan
With exit_mmap holding mmap_write_lock during free_pgtables call, process_mrelease does not need to elevate mm->mm_users in order to prevent exit_mmap from destrying pagetables while __oom_reap_task_mm is walking the VMA tree. The change prevents process_mrelease from calling the last mmput, which can lead to waiting for IO completion in exit_aio. Link: https://lkml.kernel.org/r/20211209191325.3069345-3-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian@brauner.io> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm: protect free_pgtables with mmap_lock write lock in exit_mmapSuren Baghdasaryan
oom-reaper and process_mrelease system call should protect against races with exit_mmap which can destroy page tables while they walk the VMA tree. oom-reaper protects from that race by setting MMF_OOM_VICTIM and by relying on exit_mmap to set MMF_OOM_SKIP before taking and releasing mmap_write_lock. process_mrelease has to elevate mm->mm_users to prevent such race. Both oom-reaper and process_mrelease hold mmap_read_lock when walking the VMA tree. The locking rules and mechanisms could be simpler if exit_mmap takes mmap_write_lock while executing destructive operations such as free_pgtables. Change exit_mmap to hold the mmap_write_lock when calling unlock_range, free_pgtables and remove_vma. Note also that because oom-reaper checks VM_LOCKED flag, unlock_range() should not be allowed to race with it. Before this patch, remove_vma used to be called with no locks held, however with fput being executed asynchronously and vm_ops->close not being allowed to hold mmap_lock (it is called from __split_vma with mmap_sem held for write), changing that should be fine. In most cases this lock should be uncontended. Previously, Kirill reported ~4% regression caused by a similar change [1]. We reran the same test and although the individual results are quite noisy, the percentiles show lower regression with 1.6% being the worst case [2]. The change allows oom-reaper and process_mrelease to execute safely under mmap_read_lock without worries that exit_mmap might destroy page tables from under them. [1] https://lore.kernel.org/all/20170725141723.ivukwhddk2voyhuc@node.shutemov.name/ [2] https://lore.kernel.org/all/CAJuCfpGC9-c9P40x7oy=jy5SphMcd0o0G_6U1-+JAziGKG6dGA@mail.gmail.com/ Link: https://lkml.kernel.org/r/20211209191325.3069345-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Brauner <christian@brauner.io> Cc: Christoph Hellwig <hch@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Cc: Tim Murray <timmurray@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm: move tlb_flush_pending inline helpers to mm_inline.hArnd Bergmann
linux/mm_types.h should only define structure definitions, to make it cheap to include elsewhere. The atomic_t helper function definitions are particularly large, so it's better to move the helpers using those into the existing linux/mm_inline.h and only include that where needed. As a follow-up, we may want to go through all the indirect includes in mm_types.h and reduce them as much as possible. Link: https://lkml.kernel.org/r/20211207125710.2503446-2-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Colin Cross <ccross@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm: move anon_vma declarations to linux/mm_inline.hArnd Bergmann
The patch to add anonymous vma names causes a build failure in some configurations: include/linux/mm_types.h: In function 'is_same_vma_anon_name': include/linux/mm_types.h:924:37: error: implicit declaration of function 'strcmp' [-Werror=implicit-function-declaration] 924 | return name && vma_name && !strcmp(name, vma_name); | ^~~~~~ include/linux/mm_types.h:22:1: note: 'strcmp' is defined in header '<string.h>'; did you forget to '#include <string.h>'? This should not really be part of linux/mm_types.h in the first place, as that header is meant to only contain structure defintions and need a minimum set of indirect includes itself. While the header clearly includes more than it should at this point, let's not make it worse by including string.h as well, which would pull in the expensive (compile-speed wise) fortify-string logic. Move the new functions into a separate header that only needs to be included in a couple of locations. Link: https://lkml.kernel.org/r/20211207125710.2503446-1-arnd@kernel.org Fixes: "mm: add a field to store names for private anonymous memory" Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Colin Cross <ccross@google.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm: add anonymous vma name refcountingSuren Baghdasaryan
While forking a process with high number (64K) of named anonymous vmas the overhead caused by strdup() is noticeable. Experiments with ARM64 Android device show up to 40% performance regression when forking a process with 64k unpopulated anonymous vmas using the max name lengths vs the same process with the same number of anonymous vmas having no name. Introduce anon_vma_name refcounted structure to avoid the overhead of copying vma names during fork() and when splitting named anonymous vmas. When a vma is duplicated, instead of copying the name we increment the refcount of this structure. Multiple vmas can point to the same anon_vma_name as long as they increment the refcount. The name member of anon_vma_name structure is assigned at structure allocation time and is never changed. If vma name changes then the refcount of the original structure is dropped, a new anon_vma_name structure is allocated to hold the new name and the vma pointer is updated to point to the new structure. With this approach the fork() performance regressions is reduced 3-4x times and with usecases using more reasonable number of VMAs (a few thousand) the regressions is not measurable. Link: https://lkml.kernel.org/r/20211019215511.3771969-3-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Colin Cross <ccross@google.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Glauber <jan.glauber@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Stultz <john.stultz@linaro.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rob Landley <rob@landley.net> Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com> Cc: Shaohua Li <shli@fusionio.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm: add a field to store names for private anonymous memoryColin Cross
In many userspace applications, and especially in VM based applications like Android uses heavily, there are multiple different allocators in use. At a minimum there is libc malloc and the stack, and in many cases there are libc malloc, the stack, direct syscalls to mmap anonymous memory, and multiple VM heaps (one for small objects, one for big objects, etc.). Each of these layers usually has its own tools to inspect its usage; malloc by compiling a debug version, the VM through heap inspection tools, and for direct syscalls there is usually no way to track them. On Android we heavily use a set of tools that use an extended version of the logic covered in Documentation/vm/pagemap.txt to walk all pages mapped in userspace and slice their usage by process, shared (COW) vs. unique mappings, backing, etc. This can account for real physical memory usage even in cases like fork without exec (which Android uses heavily to share as many private COW pages as possible between processes), Kernel SamePage Merging, and clean zero pages. It produces a measurement of the pages that only exist in that process (USS, for unique), and a measurement of the physical memory usage of that process with the cost of shared pages being evenly split between processes that share them (PSS). If all anonymous memory is indistinguishable then figuring out the real physical memory usage (PSS) of each heap requires either a pagemap walking tool that can understand the heap debugging of every layer, or for every layer's heap debugging tools to implement the pagemap walking logic, in which case it is hard to get a consistent view of memory across the whole system. Tracking the information in userspace leads to all sorts of problems. It either needs to be stored inside the process, which means every process has to have an API to export its current heap information upon request, or it has to be stored externally in a filesystem that somebody needs to clean up on crashes. It needs to be readable while the process is still running, so it has to have some sort of synchronization with every layer of userspace. Efficiently tracking the ranges requires reimplementing something like the kernel vma trees, and linking to it from every layer of userspace. It requires more memory, more syscalls, more runtime cost, and more complexity to separately track regions that the kernel is already tracking. This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a userspace-provided name for anonymous vmas. The names of named anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as [anon:<name>]. Userspace can set the name for a region of memory by calling prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name) Setting the name to NULL clears it. The name length limit is 80 bytes including NUL-terminator and is checked to contain only printable ascii characters (including space), except '[',']','\','$' and '`'. Ascii strings are being used to have a descriptive identifiers for vmas, which can be understood by the users reading /proc/pid/maps or /proc/pid/smaps. Names can be standardized for a given system and they can include some variable parts such as the name of the allocator or a library, tid of the thread using it, etc. The name is stored in a pointer in the shared union in vm_area_struct that points to a null terminated string. Anonymous vmas with the same name (equivalent strings) and are otherwise mergeable will be merged. The name pointers are not shared between vmas even if they contain the same name. The name pointer is stored in a union with fields that are only used on file-backed mappings, so it does not increase memory usage. CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this feature. It keeps the feature disabled by default to prevent any additional memory overhead and to avoid confusing procfs parsers on systems which are not ready to support named anonymous vmas. The patch is based on the original patch developed by Colin Cross, more specifically on its latest version [1] posted upstream by Sumit Semwal. It used a userspace pointer to store vma names. In that design, name pointers could be shared between vmas. However during the last upstreaming attempt, Kees Cook raised concerns [2] about this approach and suggested to copy the name into kernel memory space, perform validity checks [3] and store as a string referenced from vm_area_struct. One big concern is about fork() performance which would need to strdup anonymous vma names. Dave Hansen suggested experimenting with worst-case scenario of forking a process with 64k vmas having longest possible names [4]. I ran this experiment on an ARM64 Android device and recorded a worst-case regression of almost 40% when forking such a process. This regression is addressed in the followup patch which replaces the pointer to a name with a refcounted structure that allows sharing the name pointer between vmas of the same name. Instead of duplicating the string during fork() or when splitting a vma it increments the refcount. [1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/ [2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/ [3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/ [4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/ Changes for prctl(2) manual page (in the options section): PR_SET_VMA Sets an attribute specified in arg2 for virtual memory areas starting from the address specified in arg3 and spanning the size specified in arg4. arg5 specifies the value of the attribute to be set. Note that assigning an attribute to a virtual memory area might prevent it from being merged with adjacent virtual memory areas due to the difference in that attribute's value. Currently, arg2 must be one of: PR_SET_VMA_ANON_NAME Set a name for anonymous virtual memory areas. arg5 should be a pointer to a null-terminated string containing the name. The name length including null byte cannot exceed 80 bytes. If arg5 is NULL, the name of the appropriate anonymous virtual memory areas will be reset. The name can contain only printable ascii characters (including space), except '[',']','\','$' and '`'. This feature is available only if the kernel is built with the CONFIG_ANON_VMA_NAME option enabled. [surenb@google.com: docs: proc.rst: /proc/PID/maps: fix malformed table] Link: https://lkml.kernel.org/r/20211123185928.2513763-1-surenb@google.com [surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy, added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the work here was done by Colin Cross, therefore, with his permission, keeping him as the author] Link: https://lkml.kernel.org/r/20211019215511.3771969-2-surenb@google.com Signed-off-by: Colin Cross <ccross@google.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Glauber <jan.glauber@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Stultz <john.stultz@linaro.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rob Landley <rob@landley.net> Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com> Cc: Shaohua Li <shli@fusionio.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm: rearrange madvise code to allow for reuseColin Cross
Patch series "mm: rearrange madvise code to allow for reuse", v11. Avoid performance regression of the new anon vma name field refcounting it. I checked the image sizes with allnoconfig builds: unpatched Linus' ToT text data bss dec hex filename 1324759 32 73928 1398719 1557bf vmlinux After the first patch is applied (madvise refactoring) text data bss dec hex filename 1322346 32 73928 1396306 154e52 vmlinux >>> 2413 bytes decrease vs ToT <<< After all patches applied with CONFIG_ANON_VMA_NAME=n text data bss dec hex filename 1322337 32 73928 1396297 154e49 vmlinux >>> 2422 bytes decrease vs ToT <<< After all patches applied with CONFIG_ANON_VMA_NAME=y text data bss dec hex filename 1325228 32 73928 1399188 155994 vmlinux >>> 469 bytes increase vs ToT <<< This patch (of 3): Refactor the madvise syscall to allow for parts of it to be reused by a prctl syscall that affects vmas. Move the code that walks vmas in a virtual address range into a function that takes a function pointer as a parameter. The only caller for now is sys_madvise, which uses it to call madvise_vma_behavior on each vma, but the next patch will add an additional caller. Move handling all vma behaviors inside madvise_behavior, and rename it to madvise_vma_behavior. Move the code that updates the flags on a vma, including splitting or merging the vma as necessary, into a new function called madvise_update_vma. The next patch will add support for updating a new anon_name field as well. Link: https://lkml.kernel.org/r/20211019215511.3771969-1-surenb@google.com Signed-off-by: Colin Cross <ccross@google.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Jan Glauber <jan.glauber@gmail.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Rob Landley <rob@landley.net> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Kees Cook <keescook@chromium.org> Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com> Cc: David Rientjes <rientjes@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Shaohua Li <shli@fusionio.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15memcg: add per-memcg vmalloc statShakeel Butt
The kvmalloc* allocation functions can fallback to vmalloc allocations and more often on long running machines. In addition the kernel does have __GFP_ACCOUNT kvmalloc* calls. So, often on long running machines, the memory.stat does not tell the complete picture which type of memory is charged to the memcg. So add a per-memcg vmalloc stat. [shakeelb@google.com: page_memcg() within rcu lock, per Muchun] Link: https://lkml.kernel.org/r/20211222052457.1960701-1-shakeelb@google.com [akpm@linux-foundation.org: remove cast, per Muchun] [shakeelb@google.com: remove area->page[0] checks and move to page by page accounting per Michal] Link: https://lkml.kernel.org/r/20220104222341.3972772-1-shakeelb@google.com Link: https://lkml.kernel.org/r/20211221215336.1922823-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm/memcg: use struct_size() helper in kzalloc()Wang Weiyang
Make use of the struct_size() helper instead of an open-coded version, in order to avoid any potential type mistakes or integer overflows that, in the worst scenario, could lead to heap overflows. Link: https://github.com/KSPP/linux/issues/160 Link: https://lkml.kernel.org/r/20211216022024.127375-1-wangweiyang2@huawei.com Signed-off-by: Wang Weiyang <wangweiyang2@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15memcg: better bounds on the memcg stats updatesShakeel Butt
Commit 11192d9c124d ("memcg: flush stats only if updated") added tracking of memcg stats updates which is used by the readers to flush only if the updates are over a certain threshold. However each individual update can correspond to a large value change for a given stat. For example adding or removing a hugepage to an LRU changes the stat by thp_nr_pages (512 on x86_64). Treating the update related to THP as one can keep the stat off, in theory, by (thp_nr_pages * nr_cpus * CHARGE_BATCH) before flush. To handle such scenarios, this patch adds consideration of the stat update value as well instead of just the update event. In addition let the asyn flusher unconditionally flush the stats to put time limit on the stats skew and hopefully a lot less readers would need to flush. Link: https://lkml.kernel.org/r/20211118065350.697046-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: "Michal Koutný" <mkoutny@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm/memcg: add oom_group_kill memory eventDan Schatzberg
Our container agent wants to know when a container exits if it was OOM killed or not to report to the user. We use memory.oom.group = 1 to ensure that OOM kills within the container's cgroup kill everything. Existing memory.events are insufficient for knowing if this triggered: 1) Our current approach reads memory.events oom_kill and reports the container was killed if the value is non-zero. This is erroneous in some cases where containers create their children cgroups with memory.oom.group=1 as such OOM kills will get counted against the container cgroup's oom_kill counter despite not actually OOM killing the entire container. 2) Reading memory.events.local will fail to identify OOM kills in leaf cgroups (that don't set memory.oom.group) within the container cgroup. This patch adds a new oom_group_kill event when memory.oom.group triggers to allow userspace to cleanly identify when an entire cgroup is oom killed. [schatzberg.dan@gmail.com: changes from Johannes and Chris] Link: https://lkml.kernel.org/r/20211213162511.2492267-1-schatzberg.dan@gmail.com Link: https://lkml.kernel.org/r/20211203162426.3375036-1-schatzberg.dan@gmail.com Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Chris Down <chris@chrisdown.name> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Alex Shi <alexs@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm/page_counter: remove an incorrect call to propagate_protected_usage()Donghai Qiao
propagate_protected_usage() is called to propagate the usage change in the page_counter structure. But there is a call to this function from page_counter_try_charge() when there is actually no usage change. Hence this call should be removed. Link: https://lkml.kernel.org/r/20211118181125.3918222-1-dqiao@redhat.com Signed-off-by: Donghai Qiao <dqiao@redhat.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm: memcontrol: make cgroup_memory_nokmem staticMuchun Song
Commit 494c1dfe855e ("mm: memcg/slab: create a new set of kmalloc-cg-<n> caches") makes cgroup_memory_nokmem global, however, it is unnecessary because there is already a function mem_cgroup_kmem_disabled() which exports it. Just make it static and replace it with mem_cgroup_kmem_disabled() in mm/slab_common.c. Link: https://lkml.kernel.org/r/20211109065418.21693-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Chris Down <chris@chrisdown.name> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm/frontswap.c: use non-atomic '__set_bit()' when possibleChristophe JAILLET
The 'a' and 'b' bitmaps are local to this function, so no concurrent access can occur. So the non-atomic '__set_bit()' can be used to save a few cycles. Link: https://lkml.kernel.org/r/e52476da5cee57151745c5c3c934a69798dc6fa4.1638132190.git.christophe.jaillet@wanadoo.fr Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15shmem: fix a race between shmem_unused_huge_shrink and shmem_evict_inodeGang Li
Fix a data race in commit 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure"). Here are call traces causing race: Call Trace 1: shmem_unused_huge_shrink+0x3ae/0x410 ? __list_lru_walk_one.isra.5+0x33/0x160 super_cache_scan+0x17c/0x190 shrink_slab.part.55+0x1ef/0x3f0 shrink_node+0x10e/0x330 kswapd+0x380/0x740 kthread+0xfc/0x130 ? mem_cgroup_shrink_node+0x170/0x170 ? kthread_create_on_node+0x70/0x70 ret_from_fork+0x1f/0x30 Call Trace 2: shmem_evict_inode+0xd8/0x190 evict+0xbe/0x1c0 do_unlinkat+0x137/0x330 do_syscall_64+0x76/0x120 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 A simple explanation: Image there are 3 items in the local list (@list). In the first traversal, A is not deleted from @list. 1) A->B->C ^ | pos (leave) In the second traversal, B is deleted from @list. Concurrently, A is deleted from @list through shmem_evict_inode() since last reference counter of inode is dropped by other thread. Then the @list is corrupted. 2) A->B->C ^ ^ | | evict pos (drop) We should make sure the inode is either on the global list or deleted from any local list before iput(). Fixed by moving inodes back to global list before we put them. [akpm@linux-foundation.org: coding style fixes] Link: https://lkml.kernel.org/r/20211125064502.99983-1-ligang.bdlg@bytedance.com Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure") Signed-off-by: Gang Li <ligang.bdlg@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm: shmem: don't truncate page if memory failure happensYang Shi
The current behavior of memory failure is to truncate the page cache regardless of dirty or clean. If the page is dirty the later access will get the obsolete data from disk without any notification to the users. This may cause silent data loss. It is even worse for shmem since shmem is in-memory filesystem, truncating page cache means discarding data blocks. The later read would return all zero. The right approach is to keep the corrupted page in page cache, any later access would return error for syscalls or SIGBUS for page fault, until the file is truncated, hole punched or removed. The regular storage backed filesystems would be more complicated so this patch is focused on shmem. This also unblock the support for soft offlining shmem THP. [akpm@linux-foundation.org: coding style fixes] [arnd@arndb.de: fix uninitialized variable use in me_pagecache_clean()] Link: https://lkml.kernel.org/r/20211022064748.4173718-1-arnd@kernel.org [Fix invalid pointer dereference in shmem_read_mapping_page_gfp() with a slight different implementation from what Ajay Garg <ajaygargnsit@gmail.com> and Muchun Song <songmuchun@bytedance.com> proposed and reworked the error handling of shmem_write_begin() suggested by Linus] Link: https://lore.kernel.org/linux-mm/20211111084617.6746-1-ajaygargnsit@gmail.com/ Link: https://lkml.kernel.org/r/20211020210755.23964-6-shy828301@gmail.com Link: https://lkml.kernel.org/r/20211116193247.21102-1-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Ajay Garg <ajaygargnsit@gmail.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Andy Lavr <andy.lavr@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm/gup.c: stricter check on THP migration entry during follow_pmd_maskLi Xinhai
When BUG_ON check for THP migration entry, the existing code only check thp_migration_supported case, but not for !thp_migration_supported case. If !thp_migration_supported() and !pmd_present(), the original code may dead loop in theory. To make the BUG_ON check consistent, we need catch both cases. Move the BUG_ON check one step earlier, because if the bug happen we should know it instead of depend on FOLL_MIGRATION been used by caller. Because pmdval instead of *pmd is read by the is_pmd_migration_entry() check, the existing code don't help to avoid useless locking within pmd_migration_entry_wait(), so remove that check. Link: https://lkml.kernel.org/r/20211217062559.737063-1-lixinhai.lxh@gmail.com Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Zi Yan <ziy@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15gup: avoid multiple user access locking/unlocking in fault_in_{read/write}ableChristophe Leroy
fault_in_readable() and fault_in_writeable() perform __get_user() and __put_user() in a loop, implying multiple user access locking/unlocking. To avoid that, use user access blocks. Link: https://lkml.kernel.org/r/720dcf79314acca1a78fae56d478cc851952149d.1637084492.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm/truncate.c: remove unneeded variablechiminghao
Return value directly instead of taking this in another redundant variable. Link: https://lkml.kernel.org/r/20211207083222.401594-1-chi.minghao@zte.com.cn Signed-off-by: chiminghao <chi.minghao@zte.com.cn> Reported-by: Zeal Robot <zealci@zte.com.cm> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm/debug_vm_pgtable: update comments regarding migration swap entriesAnshuman Khandual
Commit 4dd845b5a3e5 ("mm/swapops: rework swap entry manipulation code") had changed migtation entry related helpers. Just update debug_vm_pgatble() synced documentation to reflect those changes. Link: https://lkml.kernel.org/r/1641880417-24848-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm,fs: split dump_mapping() out from dump_page()Matthew Wilcox (Oracle)
dump_mapping() is a big chunk of dump_page(), and it'd be handy to be able to call it when we don't have a struct page. Split it out and move it to fs/inode.c. Take the opportunity to simplify some of the debug messages a little. Link: https://lkml.kernel.org/r/20211121121056.2870061-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15kasan: fix quarantine conflicting with init_on_freeAndrey Konovalov
KASAN's quarantine might save its metadata inside freed objects. As this happens after the memory is zeroed by the slab allocator when init_on_free is enabled, the memory coming out of quarantine is not properly zeroed. This causes lib/test_meminit.c tests to fail with Generic KASAN. Zero the metadata when the object is removed from quarantine. Link: https://lkml.kernel.org/r/2805da5df4b57138fdacd671f5d227d58950ba54.1640037083.git.andreyknvl@google.com Fixes: 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options") Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15kasan: add ability to detect double-kmem_cache_destroy()Marco Elver
Because mm/slab_common.c is not instrumented with software KASAN modes, it is not possible to detect use-after-free of the kmem_cache passed into kmem_cache_destroy(). In particular, because of the s->refcount-- and subsequent early return if non-zero, KASAN would never be able to see the double-free via kmem_cache_free(kmem_cache, s). To be able to detect a double-kmem_cache_destroy(), check accessibility of the kmem_cache, and in case of failure return early. While KASAN_HW_TAGS is able to detect such bugs, by checking accessibility and returning early we fail more gracefully and also avoid corrupting reused objects (where tags mismatch). A recent case of a double-kmem_cache_destroy() was detected by KFENCE: https://lkml.kernel.org/r/0000000000003f654905c168b09d@google.com, which was not detectable by software KASAN modes. Link: https://lkml.kernel.org/r/20211119142219.1519617-1-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm/memremap: add ZONE_DEVICE support for compound pagesJoao Martins
Add a new @vmemmap_shift property for struct dev_pagemap which specifies that a devmap is composed of a set of compound pages of order @vmemmap_shift, instead of base pages. When a compound page devmap is requested, all but the first page are initialised as tail pages instead of order-0 pages. For certain ZONE_DEVICE users like device-dax which have a fixed page size, this creates an opportunity to optimize GUP and GUP-fast walkers, treating it the same way as THP or hugetlb pages. Additionally, commit 7118fc2906e2 ("hugetlb: address ref count racing in prep_compound_gigantic_page") removed set_page_count() because the setting of page ref count to zero was redundant. devmap pages don't come from page allocator though and only head page refcount is used for compound pages, hence initialize tail page count to zero. Link: https://lkml.kernel.org/r/20211202204422.26777-5-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm/page_alloc: refactor memmap_init_zone_device() page initJoao Martins
Move struct page init to an helper function __init_zone_device_page(). This is in preparation for sharing the storage for compound page metadata. Link: https://lkml.kernel.org/r/20211202204422.26777-4-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm/page_alloc: split prep_compound_page into head and tail subpartsJoao Martins
Patch series "mm, device-dax: Introduce compound pages in devmap", v7. This series converts device-dax to use compound pages, and moves away from the 'struct page per basepage on PMD/PUD' that is done today. Doing so 1) unlocks a few noticeable improvements on unpin_user_pages() and makes device-dax+altmap case 4x times faster in pinning (numbers below and in last patch) 2) as mentioned in various other threads it's one important step towards cleaning up ZONE_DEVICE refcounting. I've split the compound pages on devmap part from the rest based on recent discussions on devmap pending and future work planned[5][6]. There is consensus that device-dax should be using compound pages to represent its PMD/PUDs just like HugeTLB and THP, and that leads to less specialization of the dax parts. I will pursue the rest of the work in parallel once this part is merged, particular the GUP-{slow,fast} improvements [7] and the tail struct page deduplication memory savings part[8]. To summarize what the series does: Patch 1: Prepare hwpoisoning to work with dax compound pages. Patches 2-3: Split the current utility function of prep_compound_page() into head and tail and use those two helpers where appropriate to take advantage of caches being warm after __init_single_page(). This is used when initializing zone device when we bring up device-dax namespaces. Patches 4-10: Add devmap support for compound pages in device-dax. memmap_init_zone_device() initialize its metadata as compound pages, and it introduces a new devmap property known as vmemmap_shift which outlines how the vmemmap is structured (defaults to base pages as done today). The property describe the page order of the metadata essentially. While at it do a few cleanups in device-dax in patches 5-9. Finally enable device-dax usage of devmap @vmemmap_shift to a value based on its own @align property. @vmemmap_shift returns 0 by default (which is today's case of base pages in devmap, like fsdax or the others) and the usage of compound devmap is optional. Starting with device-dax (*not* fsdax) we enable it by default. There are a few pinning improvements particular on the unpinning case and altmap, as well as unpin_user_page_range_dirty_lock() being just as effective as THP/hugetlb[0] pages. $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S -a -n 512 -w (pin_user_pages_fast 2M pages) put:~71 ms -> put:~22 ms [altmap] (pin_user_pages_fast 2M pages) get:~524ms put:~525 ms -> get: ~127ms put:~71ms $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S -a -n 512 -w (pin_user_pages_fast 2M pages) put:~513 ms -> put:~188 ms [altmap with -m 127004] (pin_user_pages_fast 2M pages) get:~4.1 secs put:~4.12 secs -> get:~1sec put:~563ms Tested on x86 with 1Tb+ of pmem (alongside registering it with RDMA with and without altmap), alongside gup_test selftests with dynamic dax regions and static dax regions. Coupled with ndctl unit tests for dynamic dax devices that exercise all of this. Note, for dynamic dax regions I had to revert commit 8aa83e6395 ("x86/setup: Call early_reserve_memory() earlier"), it is a known issue that this commit broke efi_fake_mem=. This patch (of 11): Split the utility function prep_compound_page() into head and tail counterparts, and use them accordingly. This is in preparation for sharing the storage for compound page metadata. Link: https://lkml.kernel.org/r/20211202204422.26777-1-joao.m.martins@oracle.com Link: https://lkml.kernel.org/r/20211202204422.26777-3-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Christoph Hellwig <hch@lst.de> Cc: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm: defer kmemleak object creation of module_alloc()Kefeng Wang
Yongqiang reports a kmemleak panic when module insmod/rmmod with KASAN enabled(without KASAN_VMALLOC) on x86[1]. When the module area allocates memory, it's kmemleak_object is created successfully, but the KASAN shadow memory of module allocation is not ready, so when kmemleak scan the module's pointer, it will panic due to no shadow memory with KASAN check. module_alloc __vmalloc_node_range kmemleak_vmalloc kmemleak_scan update_checksum kasan_module_alloc kmemleak_ignore Note, there is no problem if KASAN_VMALLOC enabled, the modules area entire shadow memory is preallocated. Thus, the bug only exits on ARCH which supports dynamic allocation of module area per module load, for now, only x86/arm64/s390 are involved. Add a VM_DEFER_KMEMLEAK flags, defer vmalloc'ed object register of kmemleak in module_alloc() to fix this issue. [1] https://lore.kernel.org/all/6d41e2b9-4692-5ec4-b1cd-cbe29ae89739@huawei.com/ [wangkefeng.wang@huawei.com: fix build] Link: https://lkml.kernel.org/r/20211125080307.27225-1-wangkefeng.wang@huawei.com [akpm@linux-foundation.org: simplify ifdefs, per Andrey] Link: https://lkml.kernel.org/r/CA+fCnZcnwJHUQq34VuRxpdoY6_XbJCDJ-jopksS5Eia4PijPzw@mail.gmail.com Link: https://lkml.kernel.org/r/20211124142034.192078-1-wangkefeng.wang@huawei.com Fixes: 793213a82de4 ("s390/kasan: dynamic shadow mem allocation for modules") Fixes: 39d114ddc682 ("arm64: add KASAN support") Fixes: bebf56a1b176 ("kasan: enable instrumentation of global variables") Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reported-by: Yongqiang Liu <liuyongqiang13@huawei.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15kmemleak: fix kmemleak false positive report with HW tag-based kasan enableKuan-Ying Lee
With HW tag-based kasan enable, We will get the warning when we free object whose address starts with 0xFF. It is because kmemleak rbtree stores tagged object and this freeing object's tag does not match with rbtree object. In the example below, kmemleak rbtree stores the tagged object in the kmalloc(), and kfree() gets the pointer with 0xFF tag. Call sequence: ptr = kmalloc(size, GFP_KERNEL); page = virt_to_page(ptr); offset = offset_in_page(ptr); kfree(page_address(page) + offset); ptr = kmalloc(size, GFP_KERNEL); A sequence like that may cause the warning as following: 1) Freeing unknown object: In kfree(), we will get free unknown object warning in kmemleak_free(). Because object(0xFx) in kmemleak rbtree and pointer(0xFF) in kfree() have different tag. 2) Overlap existing: When we allocate that object with the same hw-tag again, we will find the overlap in the kmemleak rbtree and kmemleak thread will be killed. kmemleak: Freeing unknown object at 0xffff000003f88000 CPU: 5 PID: 177 Comm: cat Not tainted 5.16.0-rc1-dirty #21 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x1ac show_stack+0x1c/0x30 dump_stack_lvl+0x68/0x84 dump_stack+0x1c/0x38 kmemleak_free+0x6c/0x70 slab_free_freelist_hook+0x104/0x200 kmem_cache_free+0xa8/0x3d4 test_version_show+0x270/0x3a0 module_attr_show+0x28/0x40 sysfs_kf_seq_show+0xb0/0x130 kernfs_seq_show+0x30/0x40 seq_read_iter+0x1bc/0x4b0 seq_read_iter+0x1bc/0x4b0 kernfs_fop_read_iter+0x144/0x1c0 generic_file_splice_read+0xd0/0x184 do_splice_to+0x90/0xe0 splice_direct_to_actor+0xb8/0x250 do_splice_direct+0x88/0xd4 do_sendfile+0x2b0/0x344 __arm64_sys_sendfile64+0x164/0x16c invoke_syscall+0x48/0x114 el0_svc_common.constprop.0+0x44/0xec do_el0_svc+0x74/0x90 el0_svc+0x20/0x80 el0t_64_sync_handler+0x1a8/0x1b0 el0t_64_sync+0x1ac/0x1b0 ... kmemleak: Cannot insert 0xf2ff000003f88000 into the object search tree (overlaps existing) CPU: 5 PID: 178 Comm: cat Not tainted 5.16.0-rc1-dirty #21 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x1ac show_stack+0x1c/0x30 dump_stack_lvl+0x68/0x84 dump_stack+0x1c/0x38 create_object.isra.0+0x2d8/0x2fc kmemleak_alloc+0x34/0x40 kmem_cache_alloc+0x23c/0x2f0 test_version_show+0x1fc/0x3a0 module_attr_show+0x28/0x40 sysfs_kf_seq_show+0xb0/0x130 kernfs_seq_show+0x30/0x40 seq_read_iter+0x1bc/0x4b0 kernfs_fop_read_iter+0x144/0x1c0 generic_file_splice_read+0xd0/0x184 do_splice_to+0x90/0xe0 splice_direct_to_actor+0xb8/0x250 do_splice_direct+0x88/0xd4 do_sendfile+0x2b0/0x344 __arm64_sys_sendfile64+0x164/0x16c invoke_syscall+0x48/0x114 el0_svc_common.constprop.0+0x44/0xec do_el0_svc+0x74/0x90 el0_svc+0x20/0x80 el0t_64_sync_handler+0x1a8/0x1b0 el0t_64_sync+0x1ac/0x1b0 kmemleak: Kernel memory leak detector disabled kmemleak: Object 0xf2ff000003f88000 (size 128): kmemleak: comm "cat", pid 177, jiffies 4294921177 kmemleak: min_count = 1 kmemleak: count = 0 kmemleak: flags = 0x1 kmemleak: checksum = 0 kmemleak: backtrace: kmem_cache_alloc+0x23c/0x2f0 test_version_show+0x1fc/0x3a0 module_attr_show+0x28/0x40 sysfs_kf_seq_show+0xb0/0x130 kernfs_seq_show+0x30/0x40 seq_read_iter+0x1bc/0x4b0 kernfs_fop_read_iter+0x144/0x1c0 generic_file_splice_read+0xd0/0x184 do_splice_to+0x90/0xe0 splice_direct_to_actor+0xb8/0x250 do_splice_direct+0x88/0xd4 do_sendfile+0x2b0/0x344 __arm64_sys_sendfile64+0x164/0x16c invoke_syscall+0x48/0x114 el0_svc_common.constprop.0+0x44/0xec do_el0_svc+0x74/0x90 kmemleak: Automatic memory scanning thread ended [akpm@linux-foundation.org: whitespace tweak] Link: https://lkml.kernel.org/r/20211118054426.4123-1-Kuan-Ying.Lee@mediatek.com Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Doug Berger <opendmb@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm: slab: make slab iterator functions staticMuchun Song
There is no external users of slab_start/next/stop(), so make them static. And the memory.kmem.slabinfo is deprecated, which outputs nothing now, so move memcg_slab_show() into mm/memcontrol.c and rename it to mem_cgroup_slab_show to be consistent with other function names. Link: https://lkml.kernel.org/r/20211109133359.32881-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm/slab_common: use WARN() if cache still has objects on destroyMarco Elver
Calling kmem_cache_destroy() while the cache still has objects allocated is a kernel bug, and will usually result in the entire cache being leaked. While the message in kmem_cache_destroy() resembles a warning, it is currently not implemented using a real WARN(). This is problematic for infrastructure testing the kernel, all of which rely on the specific format of WARN()s to pick up on bugs. Some 13 years ago this used to be a simple WARN_ON() in slub, but commit d629d8195793 ("slub: improve kmem_cache_destroy() error message") changed it into an open-coded warning to avoid confusion with a bug in slub itself. Instead, turn the open-coded warning into a real WARN() with the message preserved, so that test systems can actually identify these issues, and we get all the other benefits of using a normal WARN(). The warning message is extended with "when called from <caller-ip>" to make it even clearer where the fault lies. For most configurations this is only a cosmetic change, however, note that WARN() here will now also respect panic_on_warn. Link: https://lkml.kernel.org/r/20211102170733.648216-1-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-12Merge tag 'libnvdimm-for-5.17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull dax and libnvdimm updates from Dan Williams: "The bulk of this is a rework of the dax_operations API after discovering the obstacles it posed to the work-in-progress DAX+reflink support for XFS and other copy-on-write filesystem mechanics. Primarily the need to plumb a block_device through the API to handle partition offsets was a sticking point and Christoph untangled that dependency in addition to other cleanups to make landing the DAX+reflink support easier. The DAX_PMEM_COMPAT option has been around for 4 years and not only are distributions shipping userspace that understand the current configuration API, but some are not even bothering to turn this option on anymore, so it seems a good time to remove it per the deprecation schedule. Recall that this was added after the device-dax subsystem moved from /sys/class/dax to /sys/bus/dax for its sysfs organization. All recent functionality depends on /sys/bus/dax. Some other miscellaneous cleanups and reflink prep patches are included as well. Summary: - Simplify the dax_operations API: - Eliminate bdev_dax_pgoff() in favor of the filesystem maintaining and applying a partition offset to all its DAX iomap operations. - Remove wrappers and device-mapper stacked callbacks for ->copy_from_iter() and ->copy_to_iter() in favor of moving block_device relative offset responsibility to the dax_direct_access() caller. - Remove the need for an @bdev in filesystem-DAX infrastructure - Remove unused uio helpers copy_from_iter_flushcache() and copy_mc_to_iter() as only the non-check_copy_size() versions are used for DAX. - Prepare XFS for the pending (next merge window) DAX+reflink support - Remove deprecated DEV_DAX_PMEM_COMPAT support - Cleanup a straggling misuse of the GUID api" * tag 'libnvdimm-for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (38 commits) iomap: Fix error handling in iomap_zero_iter() ACPI: NFIT: Import GUID before use dax: remove the copy_from_iter and copy_to_iter methods dax: remove the DAXDEV_F_SYNC flag dax: simplify dax_synchronous and set_dax_synchronous uio: remove copy_from_iter_flushcache() and copy_mc_to_iter() iomap: turn the byte variable in iomap_zero_iter into a ssize_t memremap: remove support for external pgmap refcounts fsdax: don't require CONFIG_BLOCK iomap: build the block based code conditionally dax: fix up some of the block device related ifdefs fsdax: shift partition offset handling into the file systems dax: return the partition offset from fs_dax_get_by_bdev iomap: add a IOMAP_DAX flag xfs: pass the mapping flags to xfs_bmbt_to_iomap xfs: use xfs_direct_write_iomap_ops for DAX zeroing xfs: move dax device handling into xfs_{alloc,free}_buftarg ext4: cleanup the dax handling in ext4_fill_super ext2: cleanup the dax handling in ext2_fill_super fsdax: decouple zeroing from the iomap buffered I/O code ...
2022-01-12Merge tag 'folio-5.17' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds
Pull folio conversion updates from Matthew Wilcox: "Convert much of the page cache to use folios This stops just short of actually enabling large folios. It converts everything that I noticed needs to be converted, but there may still be places I've overlooked which still have page size assumptions. The big change here is using large entries in the page cache XArray instead of many small entries. That only affects shmem for now, but it's a pretty big change for shmem since it changes where memory needs to be allocated (at split time instead of insertion)" * tag 'folio-5.17' of git://git.infradead.org/users/willy/pagecache: (49 commits) mm: Use multi-index entries in the page cache XArray: Add xas_advance() truncate,shmem: Handle truncates that split large folios truncate: Convert invalidate_inode_pages2_range to folios fs: Convert vfs_dedupe_file_range_compare to folios mm: Remove pagevec_remove_exceptionals() mm: Convert find_lock_entries() to use a folio_batch filemap: Return only folios from find_get_entries() filemap: Convert filemap_get_read_batch() to use a folio_batch filemap: Convert filemap_read() to use a folio truncate: Add invalidate_complete_folio2() truncate: Convert invalidate_inode_pages2_range() to use a folio truncate: Skip known-truncated indices truncate,shmem: Add truncate_inode_folio() shmem: Convert part of shmem_undo_range() to use a folio mm: Add unmap_mapping_folio() truncate: Add truncate_cleanup_folio() filemap: Add filemap_release_folio() filemap: Use a folio in filemap_page_mkwrite filemap: Use a folio in filemap_map_pages ...
2022-01-12Merge tag 'for-5.17/block-2022-01-11' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: - Unify where the struct request handling code is located in the blk-mq code (Christoph) - Header cleanups (Christoph) - Clean up the io_context handling code (Christoph, me) - Get rid of ->rq_disk in struct request (Christoph) - Error handling fix for add_disk() (Christoph) - request allocation cleanusp (Christoph) - Documentation updates (Eric, Matthew) - Remove trivial crypto unregister helper (Eric) - Reduce shared tag overhead (John) - Reduce poll_stats memory overhead (me) - Known indirect function call for dio (me) - Use atomic references for struct request (me) - Support request list issue for block and NVMe (me) - Improve queue dispatch pinning (Ming) - Improve the direct list issue code (Keith) - BFQ improvements (Jan) - Direct completion helper and use it in mmc block (Sebastian) - Use raw spinlock for the blktrace code (Wander) - fsync error handling fix (Ye) - Various fixes and cleanups (Lukas, Randy, Yang, Tetsuo, Ming, me) * tag 'for-5.17/block-2022-01-11' of git://git.kernel.dk/linux-block: (132 commits) MAINTAINERS: add entries for block layer documentation docs: block: remove queue-sysfs.rst docs: sysfs-block: document virt_boundary_mask docs: sysfs-block: document stable_writes docs: sysfs-block: fill in missing documentation from queue-sysfs.rst docs: sysfs-block: add contact for nomerges docs: sysfs-block: sort alphabetically docs: sysfs-block: move to stable directory block: don't protect submit_bio_checks by q_usage_counter block: fix old-style declaration nvme-pci: fix queue_rqs list splitting block: introduce rq_list_move block: introduce rq_list_for_each_safe macro block: move rq_list macros to blk-mq.h block: drop needless assignment in set_task_ioprio() block: remove unnecessary trailing '\' bio.h: fix kernel-doc warnings block: check minor range in device_add_disk() block: use "unsigned long" for blk_validate_block_size(). block: fix error unwinding in device_add_disk ...
2022-01-11Merge tag 'kcsan.2022.01.09a' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull KCSAN updates from Paul McKenney: "This provides KCSAN fixes and also the ability to take memory barriers into account for weakly-ordered systems. This last can increase the probability of detecting certain types of data races" * tag 'kcsan.2022.01.09a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (29 commits) kcsan: Only test clear_bit_unlock_is_negative_byte if arch defines it kcsan: Avoid nested contexts reading inconsistent reorder_access kcsan: Turn barrier instrumentation into macros kcsan: Make barrier tests compatible with lockdep kcsan: Support WEAK_MEMORY with Clang where no objtool support exists compiler_attributes.h: Add __disable_sanitizer_instrumentation objtool, kcsan: Remove memory barrier instrumentation from noinstr objtool, kcsan: Add memory barrier instrumentation to whitelist sched, kcsan: Enable memory barrier instrumentation mm, kcsan: Enable barrier instrumentation x86/qspinlock, kcsan: Instrument barrier of pv_queued_spin_unlock() x86/barriers, kcsan: Use generic instrumentation for non-smp barriers asm-generic/bitops, kcsan: Add instrumentation for barriers locking/atomics, kcsan: Add instrumentation for barriers locking/barriers, kcsan: Support generic instrumentation locking/barriers, kcsan: Add instrumentation for barriers kcsan: selftest: Add test case to check memory barrier instrumentation kcsan: Ignore GCC 11+ warnings about TSan runtime support kcsan: test: Add test cases for memory barrier instrumentation kcsan: test: Match reordered or normal accesses ...
2022-01-10Merge tag 'slab-for-5.17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: - Separate struct slab from struct page - an offshot of the page folio work. Struct page fields used by slab allocators are moved from struct page to a new struct slab, that uses the same physical storage. Similar to struct folio, it always is a head page. This brings better type safety, separation of large kmalloc allocations from true slabs, and cleanup of related objcg code. - A SLAB_MERGE_DEFAULT config optimization. * tag 'slab-for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (33 commits) mm/slob: Remove unnecessary page_mapcount_reset() function call bootmem: Use page->index instead of page->freelist zsmalloc: Stop using slab fields in struct page mm/slub: Define struct slab fields for CONFIG_SLUB_CPU_PARTIAL only when enabled mm/slub: Simplify struct slab slabs field definition mm/sl*b: Differentiate struct slab fields by sl*b implementations mm/kfence: Convert kfence_guarded_alloc() to struct slab mm/kasan: Convert to struct folio and struct slab mm/slob: Convert SLOB to use struct slab and struct folio mm/memcg: Convert slab objcgs from struct page to struct slab mm: Convert struct page to struct slab in functions used by other subsystems mm/slab: Finish struct page to struct slab conversion mm/slab: Convert most struct page to struct slab by spatch mm/slab: Convert kmem_getpages() and kmem_freepages() to struct slab mm/slub: Finish struct page to struct slab conversion mm/slub: Convert most struct page to struct slab by spatch mm/slub: Convert pfmemalloc_match() to take a struct slab mm/slub: Convert __free_slab() to use struct slab mm/slub: Convert alloc_slab_page() to return a struct slab mm/slub: Convert print_page_info() to print_slab_info() ...
2022-01-10Merge tag 'x86_sgx_for_v5.17_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 SGX updates from Borislav Petkov: - Add support for handling hw errors in SGX pages: poisoning, recovering from poison memory and error injection into SGX pages - A bunch of changes to the SGX selftests to simplify and allow of SGX features testing without the need of a whole SGX software stack - Add a sysfs attribute which is supposed to show the amount of SGX memory in a NUMA node, similar to what /proc/meminfo is to normal memory - The usual bunch of fixes and cleanups too * tag 'x86_sgx_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits) x86/sgx: Fix NULL pointer dereference on non-SGX systems selftests/sgx: Fix corrupted cpuid macro invocation x86/sgx: Add an attribute for the amount of SGX memory in a NUMA node x86/sgx: Fix minor documentation issues selftests/sgx: Add test for multiple TCS entry selftests/sgx: Enable multiple thread support selftests/sgx: Add page permission and exception test selftests/sgx: Rename test properties in preparation for more enclave tests selftests/sgx: Provide per-op parameter structs for the test enclave selftests/sgx: Add a new kselftest: Unclobbered_vdso_oversubscribed selftests/sgx: Move setup_test_encl() to each TEST_F() selftests/sgx: Encpsulate the test enclave creation selftests/sgx: Dump segments and /proc/self/maps only on failure selftests/sgx: Create a heap for the test enclave selftests/sgx: Make data measurement for an enclave segment optional selftests/sgx: Assign source for each segment selftests/sgx: Fix a benign linker warning x86/sgx: Add check for SGX pages to ghes_do_memory_failure() x86/sgx: Add hook to error injection address validation x86/sgx: Hook arch_memory_failure() into mainline code ...
2022-01-08exit: Remove profile_task_exit & profile_munmapEric W. Biederman
When I say remove I mean remove. All profile_task_exit and profile_munmap do is call a blocking notifier chain. The helpers profile_task_register and profile_task_unregister are not called anywhere in the tree. Which means this is all dead code. So remove the dead code and make it easier to read do_exit. Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lkml.kernel.org/r/20220103213312.9144-1-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-01-08signal: Have the oom killer detect coredumps using signal->core_stateEric W. Biederman
In preparation for removing the flag SIGNAL_GROUP_COREDUMP, change __task_will_free_mem to test signal->core_state instead of the flag SIGNAL_GROUP_COREDUMP. Both fields are protected by siglock and both live in signal_struct so there are no real tradeoffs here, just a change to which field is being tested. Link: https://lkml.kernel.org/r/20211213225350.27481-3-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-01-08mm: Use multi-index entries in the page cacheMatthew Wilcox (Oracle)
We currently store large folios as 2^N consecutive entries. While this consumes rather more memory than necessary, it also turns out to be buggy. A writeback operation which starts within a tail page of a dirty folio will not write back the folio as the xarray's dirty bit is only set on the head index. With multi-index entries, the dirty bit will be found no matter where in the folio the operation starts. This does end up simplifying the page cache slightly, although not as much as I had hoped. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-08truncate,shmem: Handle truncates that split large foliosMatthew Wilcox (Oracle)
Handle folio splitting in the parts of the truncation functions which already handle partial pages. Factor all that code out into a new function called truncate_inode_partial_folio(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: William Kucharski <william.kucharski@oracle.com>