summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2020-01-31mm/huge_memory.c: use head to emphasize the purpose of pageWei Yang
During split huge page, it checks the property of the page. Currently we do the check on page and head without emphasizing the check is on the compound page. In case the page passed to split_huge_page_to_list is a tail page, audience would take some time to think about whether the check is on compound page or tail page itself. To make it explicit, use head instead of page for those checks. After this, audience would be more clear about the checks are on compound page and the page is used to do the split and dump error message if failed. Link: http://lkml.kernel.org/r/20200110032610.26499-2-richardw.yang@linux.intel.com Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm/huge_memory.c: use head to check huge zero pageWei Yang
The page could be a tail page, if this is the case, this BUG_ON will never be triggered. Link: http://lkml.kernel.org/r/20200110032610.26499-1-richardw.yang@linux.intel.com Fixes: e9b61f19858a ("thp: reintroduce split_huge_page()") Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm, oom: dump stack of victim when reaping failedDavid Rientjes
When a process cannot be oom reaped, for whatever reason, currently the list of locks that are held is currently dumped to the kernel log. Much more interesting is the stack trace of the victim that cannot be reaped. If the stack trace is dumped, we have the ability to find related occurrences in the same kernel code and hopefully solve the issue that is making it wedged. Dump the stack trace when a process fails to be oom reaped. Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2001141519280.200484@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31memblock: Use __func__ in remaining memblock_dbg() call sitesAnshuman Khandual
Replace open function name strings with %s (__func__) in all remaining memblock_dbg() call sites. Link: http://lkml.kernel.org/r/1578285510-28261-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm/memblock: define memblock_physmem_add()Anshuman Khandual
On the s390 platform memblock.physmem array is being built by directly calling into memblock_add_range() which is a low level function not intended to be used outside of memblock. Hence lets conditionally add helper functions for physmem array when HAVE_MEMBLOCK_PHYS_MAP is enabled. Also use MAX_NUMNODES instead of 0 as node ID similar to memblock_add() and memblock_reserve(). Make memblock_add_range() a static function as it is no longer getting used outside of memblock. Link: http://lkml.kernel.org/r/1578283835-21969-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Collin Walling <walling@linux.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Philipp Rudo <prudo@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm/vmscan: remove unused RECLAIM_OFF/RECLAIM_ZONEAlex Shi
Commit 1b2ffb7896ad ("[PATCH] Zone reclaim: Allow modification of zone reclaim behavior")' defined RECLAIM_OFF/RECLAIM_ZONE, but never use them, so better to remove them. [dwagner@suse.de: fix sanity checks enabling] Link: http://lkml.kernel.org/r/20200116131642.642-1-dwagner@suse.de [akpm@linux-foundation.org: renumber the bits for neatness] Link: http://lkml.kernel.org/r/1579005573-58923-1-git-send-email-alex.shi@linux.alibaba.com Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Signed-off-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: "Tobin C. Harding" <tobin@kernel.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm/vmscan: remove prefetch_prev_lru_pageAlex Shi
This macro was never used in git history. So better to remove. Link: http://lkml.kernel.org/r/1579006500-127143-1-git-send-email-alex.shi@linux.alibaba.com Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm/vmscan.c: remove unused return value of shrink_nodeLiu Song
The return value of shrink_node is not used, so remove unnecessary operations. Link: http://lkml.kernel.org/r/20191128143524.3223-1-fishland@aliyun.com Signed-off-by: Liu Song <liu.song11@zte.com.cn> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm: remove "count" parameter from has_unmovable_pages()David Hildenbrand
Now that the memory isolate notifier is gone, the parameter is always 0. Drop it and cleanup has_unmovable_pages(). Link: http://lkml.kernel.org/r/20191114131911.11783-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Qian Cai <cai@lca.pw> Cc: Pingfan Liu <kernelfans@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Arun KS <arunks@codeaurora.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm: remove the memory isolate notifierDavid Hildenbrand
Luckily, we have no users left, so we can get rid of it. Cleanup set_migratetype_isolate() a little bit. Link: http://lkml.kernel.org/r/20191114131911.11783-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Qian Cai <cai@lca.pw> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Pingfan Liu <kernelfans@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm/page_alloc: skip non present sections on zone initializationKirill A. Shutemov
memmap_init_zone() can be called on the ranges with holes during the boot. It will skip any non-valid PFNs one-by-one. It works fine as long as holes are not too big. But huge holes in the memory map causes a problem. It takes over 20 seconds to walk 32TiB hole. x86-64 with 5-level paging allows for much larger holes in the memory map which would practically hang the system. Deferred struct page init doesn't help here. It only works on the present ranges. Skipping non-present sections would fix the issue. Link: http://lkml.kernel.org/r/20191230093828.24613-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: "Jin, Zhi" <zhi.jin@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm/early_ioremap.c: use %pa to print resource_size_t variablesAndy Shevchenko
%pa takes into consideration the special types such as resource_size_t. Use this specifier %instead of explicit casting. Link: http://lkml.kernel.org/r/20191209165413.56263-1-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm/page_vma_mapped.c: explicitly compare pfn for normal, hugetlbfs and THP pageLi Xinhai
When check_pte, pfn of normal, hugetlbfs and THP page need be compared. The current implementation apply comparison as - normal 4K page: page_pfn <= pfn < page_pfn + 1 - hugetlbfs page: page_pfn <= pfn < page_pfn + HPAGE_PMD_NR - THP page: page_pfn <= pfn < page_pfn + HPAGE_PMD_NR in pfn_in_hpage. For hugetlbfs page, it should be page_pfn == pfn Now, change pfn_in_hpage to pfn_is_match to highlight that comparison is not only for THP and explicitly compare for these cases. No impact upon current behavior, just make the code clear. I think it is important to make the code clear - comparing hugetlbfs page in range page_pfn <= pfn < page_pfn + HPAGE_PMD_NR is confusing. Link: http://lkml.kernel.org/r/1578737885-8890-1-git-send-email-lixinhai.lxh@gmail.com Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm/memcontrol.c: cleanup some useless codeKaitao Cheng
Compound pages handling in mem_cgroup_migrate is more convoluted than necessary. The state is duplicated in compound variable and the same could be achieved by PageTransHuge check which is trivial and hpage_nr_pages is already PageTransHuge aware. It is much simpler to just use hpage_nr_pages for nr_pages and replace the local variable by PageTransHuge check directly Link: http://lkml.kernel.org/r/20191210160450.3395-1-pilgrimtao@gmail.com Signed-off-by: Kaitao Cheng <pilgrimtao@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm/swapfile.c: swap_next should increase position indexVasily Averin
If seq_file .next fuction does not change position index, read after some lseek can generate unexpected output. In Aug 2018 NeilBrown noticed commit 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and interface") "Some ->next functions do not increment *pos when they return NULL... Note that such ->next functions are buggy and should be fixed. A simple demonstration is dd if=/proc/swaps bs=1000 skip=1 Choose any block size larger than the size of /proc/swaps. This will always show the whole last line of /proc/swaps" Described problem is still actual. If you make lseek into middle of last output line following read will output end of last line and whole last line once again. $ dd if=/proc/swaps bs=1 # usual output Filename Type Size Used Priority /dev/dm-0 partition 4194812 97536 -2 104+0 records in 104+0 records out 104 bytes copied $ dd if=/proc/swaps bs=40 skip=1 # last line was generated twice dd: /proc/swaps: cannot skip to specified offset v/dm-0 partition 4194812 97536 -2 /dev/dm-0 partition 4194812 97536 -2 3+1 records in 3+1 records out 131 bytes copied https://bugzilla.kernel.org/show_bug.cgi?id=206283 Link: http://lkml.kernel.org/r/bd8cfd7b-ac95-9b91-f9e7-e8438bd5047d@virtuozzo.com Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Jann Horn <jannh@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm, tree-wide: rename put_user_page*() to unpin_user_page*()John Hubbard
In order to provide a clearer, more symmetric API for pinning and unpinning DMA pages. This way, pin_user_pages*() calls match up with unpin_user_pages*() calls, and the API is a lot closer to being self-explanatory. Link: http://lkml.kernel.org/r/20200107224558.2362728-23-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Björn Töpel <bjorn.topel@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Leon Romanovsky <leonro@mellanox.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding "1"John Hubbard
Fix the gup benchmark flags to use the symbolic FOLL_WRITE, instead of a hard-coded "1" value. Also, clean up the filtering of gup flags a little, by just doing it once before issuing any of the get_user_pages*() calls. This makes it harder to overlook, instead of having little "gup_flags & 1" phrases in the function calls. Link: http://lkml.kernel.org/r/20200107224558.2362728-22-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Björn Töpel <bjorn.topel@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Leon Romanovsky <leonro@mellanox.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()John Hubbard
Convert process_vm_access to use the new pin_user_pages_remote() call, which sets FOLL_PIN. Setting FOLL_PIN is now required for code that requires tracking of pinned pages. Also, release the pages via put_user_page*(). Also, rename "pages" to "pinned_pages", as this makes for easier reading of process_vm_rw_single_vec(). Link: http://lkml.kernel.org/r/20200107224558.2362728-15-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Björn Töpel <bjorn.topel@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Leon Romanovsky <leonro@mellanox.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm/gup: introduce pin_user_pages*() and FOLL_PINJohn Hubbard
Introduce pin_user_pages*() variations of get_user_pages*() calls, and also pin_longterm_pages*() variations. For now, these are placeholder calls, until the various call sites are converted to use the correct get_user_pages*() or pin_user_pages*() API. These variants will eventually all set FOLL_PIN, which is also introduced, and thoroughly documented. pin_user_pages() pin_user_pages_remote() pin_user_pages_fast() All pages that are pinned via the above calls, must be unpinned via put_user_page(). The underlying rules are: * FOLL_PIN is a gup-internal flag, so the call sites should not directly set it. That behavior is enforced with assertions. * Call sites that want to indicate that they are going to do DirectIO ("DIO") or something with similar characteristics, should call a get_user_pages()-like wrapper call that sets FOLL_PIN. These wrappers will: * Start with "pin_user_pages" instead of "get_user_pages". That makes it easy to find and audit the call sites. * Set FOLL_PIN * For pages that are received via FOLL_PIN, those pages must be returned via put_user_page(). Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases in this documentation. (I've reworded it and expanded upon it.) Link: http://lkml.kernel.org/r/20200107224558.2362728-12-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> [Documentation] Reviewed-by: Jérôme Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Björn Töpel <bjorn.topel@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Leon Romanovsky <leonro@mellanox.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm/gup: allow FOLL_FORCE for get_user_pages_fast()John Hubbard
Commit 817be129e6f2 ("mm: validate get_user_pages_fast flags") allowed only FOLL_WRITE and FOLL_LONGTERM to be passed to get_user_pages_fast(). This, combined with the fact that get_user_pages_fast() falls back to "slow gup", which *does* accept FOLL_FORCE, leads to an odd situation: if you need FOLL_FORCE, you cannot call get_user_pages_fast(). There does not appear to be any reason for filtering out FOLL_FORCE. There is nothing in the _fast() implementation that requires that we avoid writing to the pages. So it appears to have been an oversight. Fix by allowing FOLL_FORCE to be set for get_user_pages_fast(). Link: http://lkml.kernel.org/r/20200107224558.2362728-9-jhubbard@nvidia.com Fixes: 817be129e6f2 ("mm: validate get_user_pages_fast flags") Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Björn Töpel <bjorn.topel@intel.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERMJohn Hubbard
As it says in the updated comment in gup.c: current FOLL_LONGTERM behavior is incompatible with FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on vmas. However, the corresponding restriction in get_user_pages_remote() was slightly stricter than is actually required: it forbade all FOLL_LONGTERM callers, but we can actually allow FOLL_LONGTERM callers that do not set the "locked" arg. Update the code and comments to loosen the restriction, allowing FOLL_LONGTERM in some cases. Also, copy the DAX check ("if a VMA is DAX, don't allow long term pinning") from the VFIO call site, all the way into the internals of get_user_pages_remote() and __gup_longterm_locked(). That is: get_user_pages_remote() calls __gup_longterm_locked(), which in turn calls check_dax_vmas(). This check will then be removed from the VFIO call site in a subsequent patch. Thanks to Jason Gunthorpe for pointing out a clean way to fix this, and to Dan Williams for helping clarify the DAX refactoring. Link: http://lkml.kernel.org/r/20200107224558.2362728-7-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Tested-by: Alex Williamson <alex.williamson@redhat.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Björn Töpel <bjorn.topel@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leon Romanovsky <leonro@mellanox.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pagesJohn Hubbard
An upcoming patch changes and complicates the refcounting and especially the "put page" aspects of it. In order to keep everything clean, refactor the devmap page release routines: * Rename put_devmap_managed_page() to page_is_devmap_managed(), and limit the functionality to "read only": return a bool, with no side effects. * Add a new routine, put_devmap_managed_page(), to handle decrementing the refcount for ZONE_DEVICE pages. * Change callers (just release_pages() and put_page()) to check page_is_devmap_managed() before calling the new put_devmap_managed_page() routine. This is a performance point: put_page() is a hot path, so we need to avoid non- inline function calls where possible. * Rename __put_devmap_managed_page() to free_devmap_managed_page(), and limit the functionality to unconditionally freeing a devmap page. This is originally based on a separate patch by Ira Weiny, which applied to an early version of the put_user_page() experiments. Since then, Jérôme Glisse suggested the refactoring described above. Link: http://lkml.kernel.org/r/20200107224558.2362728-5-jhubbard@nvidia.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Suggested-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Björn Töpel <bjorn.topel@intel.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leon Romanovsky <leonro@mellanox.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm: Cleanup __put_devmap_managed_page() vs ->page_free()Dan Williams
After the removal of the device-public infrastructure there are only 2 ->page_free() call backs in the kernel. One of those is a device-private callback in the nouveau driver, the other is a generic wakeup needed in the DAX case. In the hopes that all ->page_free() callbacks can be migrated to common core kernel functionality, move the device-private specific actions in __put_devmap_managed_page() under the is_device_private_page() conditional, including the ->page_free() callback. For the other page types just open-code the generic wakeup. Yes, the wakeup is only needed in the MEMORY_DEVICE_FSDAX case, but it does no harm in the MEMORY_DEVICE_DEVDAX and MEMORY_DEVICE_PCI_P2PDMA case. Link: http://lkml.kernel.org/r/20200107224558.2362728-4-jhubbard@nvidia.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jérôme Glisse <jglisse@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Björn Töpel <bjorn.topel@intel.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Leon Romanovsky <leonro@mellanox.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm/gup: move try_get_compound_head() to top, fix minor issuesJohn Hubbard
An upcoming patch uses try_get_compound_head() more widely, so move it to the top of gup.c. Also fix a tiny spelling error and a checkpatch.pl warning. Link: http://lkml.kernel.org/r/20200107224558.2362728-3-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Björn Töpel <bjorn.topel@intel.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Leon Romanovsky <leonro@mellanox.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm/gup: factor out duplicate code from four routinesJohn Hubbard
Patch series "mm/gup: prereqs to track dma-pinned pages: FOLL_PIN", v12. Overview: This is a prerequisite to solving the problem of proper interactions between file-backed pages, and [R]DMA activities, as discussed in [1], [2], [3], and in a remarkable number of email threads since about 2017. :) A new internal gup flag, FOLL_PIN is introduced, and thoroughly documented in the last patch's Documentation/vm/pin_user_pages.rst. I believe that this will provide a good starting point for doing the layout lease work that Ira Weiny has been working on. That's because these new wrapper functions provide a clean, constrained, systematically named set of functionality that, again, is required in order to even know if a page is "dma-pinned". In contrast to earlier approaches, the page tracking can be incrementally applied to the kernel call sites that, until now, have been simply calling get_user_pages() ("gup"). In other words, opt-in by changing from this: get_user_pages() (sets FOLL_GET) put_page() to this: pin_user_pages() (sets FOLL_PIN) unpin_user_page() Testing: * I've done some overall kernel testing (LTP, and a few other goodies), and some directed testing to exercise some of the changes. And as you can see, gup_benchmark is enhanced to exercise this. Basically, I've been able to runtime test the core get_user_pages() and pin_user_pages() and related routines, but not so much on several of the call sites--but those are generally just a couple of lines changed, each. Not much of the kernel is actually using this, which on one hand reduces risk quite a lot. But on the other hand, testing coverage is low. So I'd love it if, in particular, the Infiniband and PowerPC folks could do a smoke test of this series for me. Runtime testing for the call sites so far is pretty light: * io_uring: Some directed tests from liburing exercise this, and they pass. * process_vm_access.c: A small directed test passes. * gup_benchmark: the enhanced version hits the new gup.c code, and passes. * infiniband: Ran rdma-core tests: rdma-core/build/bin/run_tests.py * VFIO: compiles (I'm vowing to set up a run time test soon, but it's not ready just yet) * powerpc: it compiles... * drm/via: compiles... * goldfish: compiles... * net/xdp: compiles... * media/v4l2: compiles... [1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/ [2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/ [3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/ This patch (of 22): There are four locations in gup.c that have a fair amount of code duplication. This means that changing one requires making the same changes in four places, not to mention reading the same code four times, and wondering if there are subtle differences. Factor out the common code into static functions, thus reducing the overall line count and the code's complexity. Also, take the opportunity to slightly improve the efficiency of the error cases, by doing a mass subtraction of the refcount, surrounded by get_page()/put_page(). Also, further simplify (slightly), by waiting until the the successful end of each routine, to increment *nr. Link: http://lkml.kernel.org/r/20200107224558.2362728-2-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Björn Töpel <bjorn.topel@intel.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leon Romanovsky <leonro@mellanox.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm/gup.c: use is_vm_hugetlb_page() to check whether to follow hugeWei Yang
No functional change, just leverage the helper function to improve readability as others. Link: http://lkml.kernel.org/r/20200113070322.26627-1-richardw.yang@linux.intel.com Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm: fix gup_pud_rangeQiujun Huang
sorry for not processing for a long time. I met it again. patch v1 https://lkml.org/lkml/2019/9/20/656 do_machine_check() do_memory_failure() memory_failure() hw_poison_user_mappings() try_to_unmap() pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); ...and now we have a swap entry that indicates that the page entry refers to a bad (and poisoned) page of memory, but gup_fast() at this level of the page table was ignoring swap entries, and incorrectly assuming that "!pxd_none() == valid and present". And this was not just a poisoned page problem, but a generaly swap entry problem. So, any swap entry type (device memory migration, numa migration, or just regular swapping) could lead to the same problem. Fix this by checking for pxd_present(), instead of pxd_none(). Link: http://lkml.kernel.org/r/1578479084-15508-1-git-send-email-hqjagain@gmail.com Signed-off-by: Qiujun Huang <hqjagain@gmail.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm/filemap.c: clean up filemap_write_and_wait()Ira Weiny
At some point filemap_write_and_wait() and filemap_write_and_wait_range() got the exact same implementation with the exception of the range being specified in *_range() Similar to other functions in fs.h which call *_range(..., 0, LLONG_MAX), change filemap_write_and_wait() to be a static inline which calls filemap_write_and_wait_range() Link: http://lkml.kernel.org/r/20191129160713.30892-1-ira.weiny@intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm/debug.c: always print flags in dump_page()Vlastimil Babka
Commit 76a1850e4572 ("mm/debug.c: __dump_page() prints an extra line") inadvertently removed printing of page flags for pages that are neither anon nor ksm nor have a mapping. Fix that. Using pr_cont() again would be a solution, but the commit explicitly removed its use. Avoiding the danger of mixing up split lines from multiple CPUs might be beneficial for near-panic dumps like this, so fix this without reintroducing pr_cont(). Link: http://lkml.kernel.org/r/9f884d5c-ca60-dc7b-219c-c081c755fab6@suse.cz Fixes: 76a1850e4572 ("mm/debug.c: __dump_page() prints an extra line") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Anshuman Khandual <anshuman.khandual@arm.com> Reported-by: Michal Hocko <mhocko@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Qian Cai <cai@lca.pw> Cc: Oscar Salvador <osalvador@suse.de> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm/kmemleak: turn kmemleak_lock and object->lock to raw_spinlock_tHe Zhe
kmemleak_lock as a rwlock on RT can possibly be acquired in atomic context which does work. Since the kmemleak operation is performed in atomic context make it a raw_spinlock_t so it can also be acquired on RT. This is used for debugging and is not enabled by default in a production like environment (where performance/latency matters) so it makes sense to make it a raw_spinlock_t instead trying to get rid of the atomic context. Turn also the kmemleak_object->lock into raw_spinlock_t which is acquired (nested) while the kmemleak_lock is held. The time spent in "echo scan > kmemleak" slightly improved on 64core box with this patch applied after boot. [bigeasy@linutronix.de: redo the description, update comments. Merge the individual bits: He Zhe did the kmemleak_lock, Liu Haitao the ->lock and Yongxin Liu forwarded Liu's patch.] Link: http://lkml.kernel.org/r/20191219170834.4tah3prf2gdothz4@linutronix.de Link: https://lkml.kernel.org/r/20181218150744.GB20197@arrakis.emea.arm.com Link: https://lkml.kernel.org/r/1542877459-144382-1-git-send-email-zhe.he@windriver.com Link: https://lkml.kernel.org/r/20190927082230.34152-1-yongxin.liu@windriver.com Signed-off-by: He Zhe <zhe.he@windriver.com> Signed-off-by: Liu Haitao <haitao.liu@windriver.com> Signed-off-by: Yongxin Liu <yongxin.liu@windriver.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm/slub.c: avoid slub allocation while holding list_lockYu Zhao
If we are already under list_lock, don't call kmalloc(). Otherwise we will run into a deadlock because kmalloc() also tries to grab the same lock. Fix the problem by using a static bitmap instead. WARNING: possible recursive locking detected -------------------------------------------- mount-encrypted/4921 is trying to acquire lock: (&(&n->list_lock)->rlock){-.-.}, at: ___slab_alloc+0x104/0x437 but task is already holding lock: (&(&n->list_lock)->rlock){-.-.}, at: __kmem_cache_shutdown+0x81/0x3cb other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(&n->list_lock)->rlock); lock(&(&n->list_lock)->rlock); *** DEADLOCK *** Link: http://lkml.kernel.org/r/20191108193958.205102-2-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm: move_pages: report the number of non-attempted pagesYang Shi
Since commit a49bd4d71637 ("mm, numa: rework do_pages_move"), the semantic of move_pages() has changed to return the number of non-migrated pages if they were result of a non-fatal reasons (usually a busy page). This was an unintentional change that hasn't been noticed except for LTP tests which checked for the documented behavior. There are two ways to go around this change. We can even get back to the original behavior and return -EAGAIN whenever migrate_pages is not able to migrate pages due to non-fatal reasons. Another option would be to simply continue with the changed semantic and extend move_pages documentation to clarify that -errno is returned on an invalid input or when migration simply cannot succeed (e.g. -ENOMEM, -EBUSY) or the number of pages that couldn't have been migrated due to ephemeral reasons (e.g. page is pinned or locked for other reasons). This patch implements the second option because this behavior is in place for some time without anybody complaining and possibly new users depending on it. Also it allows to have a slightly easier error handling as the caller knows that it is worth to retry when err > 0. But since the new semantic would be aborted immediately if migration is failed due to ephemeral reasons, need include the number of non-attempted pages in the return value too. Link: http://lkml.kernel.org/r/1580160527-109104-1-git-send-email-yang.shi@linux.alibaba.com Fixes: a49bd4d71637 ("mm, numa: rework do_pages_move") Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Wei Yang <richardw.yang@linux.intel.com> Cc: <stable@vger.kernel.org> [4.17+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm: thp: don't need care deferred split queue in memcg charge move pathWei Yang
If compound is true, this means it is a PMD mapped THP. Which implies the page is not linked to any defer list. So the first code chunk will not be executed. Also with this reason, it would not be proper to add this page to a defer list. So the second code chunk is not correct. Based on this, we should remove the defer list related code. [yang.shi@linux.alibaba.com: better patch title] Link: http://lkml.kernel.org/r/20200117233836.3434-1-richardw.yang@linux.intel.com Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware") Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> [5.4+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm/memory_hotplug: fix remove_memory() lockdep splatDan Williams
The daxctl unit test for the dax_kmem driver currently triggers the (false positive) lockdep splat below. It results from the fact that remove_memory_block_devices() is invoked under the mem_hotplug_lock() causing lockdep entanglements with cpu_hotplug_lock() and sysfs (kernfs active state tracking). It is a false positive because the sysfs attribute path triggering the memory remove is not the same attribute path associated with memory-block device. sysfs_break_active_protection() is not applicable since there is no real deadlock conflict, instead move memory-block device removal outside the lock. The mem_hotplug_lock() is not needed to synchronize the memory-block device removal vs the page online state, that is already handled by lock_device_hotplug(). Specifically, lock_device_hotplug() is sufficient to allow try_remove_memory() to check the offline state of the memblocks and be assured that any in progress online attempts are flushed / blocked by kernfs_drain() / attribute removal. The add_memory() path safely creates memblock devices under the mem_hotplug_lock(). There is no kernfs active state synchronization in the memblock device_register() path, so nothing to fix there. This change is only possible thanks to the recent change that refactored memory block device removal out of arch_remove_memory() (commit 4c4b7f9ba948 "mm/memory_hotplug: remove memory block devices before arch_remove_memory()"), and David's due diligence tracking down the guarantees afforded by kernfs_drain(). Not flagged for -stable since this only impacts ongoing development and lockdep validation, not a runtime issue. ====================================================== WARNING: possible circular locking dependency detected 5.5.0-rc3+ #230 Tainted: G OE ------------------------------------------------------ lt-daxctl/6459 is trying to acquire lock: ffff99c7f0003510 (kn->count#241){++++}, at: kernfs_remove_by_name_ns+0x41/0x80 but task is already holding lock: ffffffffa76a5450 (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x20/0xe0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (mem_hotplug_lock.rw_sem){++++}: __lock_acquire+0x39c/0x790 lock_acquire+0xa2/0x1b0 get_online_mems+0x3e/0xb0 kmem_cache_create_usercopy+0x2e/0x260 kmem_cache_create+0x12/0x20 ptlock_cache_init+0x20/0x28 start_kernel+0x243/0x547 secondary_startup_64+0xb6/0xc0 -> #1 (cpu_hotplug_lock.rw_sem){++++}: __lock_acquire+0x39c/0x790 lock_acquire+0xa2/0x1b0 cpus_read_lock+0x3e/0xb0 online_pages+0x37/0x300 memory_subsys_online+0x17d/0x1c0 device_online+0x60/0x80 state_store+0x65/0xd0 kernfs_fop_write+0xcf/0x1c0 vfs_write+0xdb/0x1d0 ksys_write+0x65/0xe0 do_syscall_64+0x5c/0xa0 entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #0 (kn->count#241){++++}: check_prev_add+0x98/0xa40 validate_chain+0x576/0x860 __lock_acquire+0x39c/0x790 lock_acquire+0xa2/0x1b0 __kernfs_remove+0x25f/0x2e0 kernfs_remove_by_name_ns+0x41/0x80 remove_files.isra.0+0x30/0x70 sysfs_remove_group+0x3d/0x80 sysfs_remove_groups+0x29/0x40 device_remove_attrs+0x39/0x70 device_del+0x16a/0x3f0 device_unregister+0x16/0x60 remove_memory_block_devices+0x82/0xb0 try_remove_memory+0xb5/0x130 remove_memory+0x26/0x40 dev_dax_kmem_remove+0x44/0x6a [kmem] device_release_driver_internal+0xe4/0x1c0 unbind_store+0xef/0x120 kernfs_fop_write+0xcf/0x1c0 vfs_write+0xdb/0x1d0 ksys_write+0x65/0xe0 do_syscall_64+0x5c/0xa0 entry_SYSCALL_64_after_hwframe+0x49/0xbe other info that might help us debug this: Chain exists of: kn->count#241 --> cpu_hotplug_lock.rw_sem --> mem_hotplug_lock.rw_sem Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(mem_hotplug_lock.rw_sem); lock(cpu_hotplug_lock.rw_sem); lock(mem_hotplug_lock.rw_sem); lock(kn->count#241); *** DEADLOCK *** No fixes tag as this has been a long standing issue that predated the addition of kernfs lockdep annotations. Link: http://lkml.kernel.org/r/157991441887.2763922.4770790047389427325.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm/migrate.c: also overwrite error when it is bigger than zeroWei Yang
If we get here after successfully adding page to list, err would be 1 to indicate the page is queued in the list. Current code has two problems: * on success, 0 is not returned * on error, if add_page_for_migratioin() return 1, and the following err1 from do_move_pages_to_node() is set, the err1 is not returned since err is 1 And these behaviors break the user interface. Link: http://lkml.kernel.org/r/20200119065753.21694-1-richardw.yang@linux.intel.com Fixes: e0153fc2c760 ("mm: move_pages: return valid node id in status if the page is already on the target node"). Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> Acked-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm/sparse.c: reset section's mem_map when fully deactivatedPingfan Liu
After commit ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug"), when a mem section is fully deactivated, section_mem_map still records the section's start pfn, which is not used any more and will be reassigned during re-addition. In analogy with alloc/free pattern, it is better to clear all fields of section_mem_map. Beside this, it breaks the user space tool "makedumpfile" [1], which makes assumption that a hot-removed section has mem_map as NULL, instead of checking directly against SECTION_MARKED_PRESENT bit. (makedumpfile will be better to change the assumption, and need a patch) The bug can be reproduced on IBM POWERVM by "drmgr -c mem -r -q 5" , trigger a crash, and save vmcore by makedumpfile [1]: makedumpfile, commit e73016540293 ("[v1.6.7] Update version") Link: http://lkml.kernel.org/r/1579487594-28889-1-git-send-email-kernelfans@gmail.com Signed-off-by: Pingfan Liu <kernelfans@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Baoquan He <bhe@redhat.com> Cc: Qian Cai <cai@lca.pw> Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm/mempolicy.c: fix out of bounds write in mpol_parse_str()Dan Carpenter
What we are trying to do is change the '=' character to a NUL terminator and then at the end of the function we restore it back to an '='. The problem is there are two error paths where we jump to the end of the function before we have replaced the '=' with NUL. We end up putting the '=' in the wrong place (possibly one element before the start of the buffer). Link: http://lkml.kernel.org/r/20200115055426.vdjwvry44nfug7yy@kili.mountain Reported-by: syzbot+e64a13c5369a194d67df@syzkaller.appspotmail.com Fixes: 095f1fc4ebf3 ("mempolicy: rework shmem mpol parsing and display") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Dmitry Vyukov <dvyukov@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31memcg: fix a crash in wb_workfn when a device disappearsTheodore Ts'o
Without memcg, there is a one-to-one mapping between the bdi and bdi_writeback structures. In this world, things are fairly straightforward; the first thing bdi_unregister() does is to shutdown the bdi_writeback structure (or wb), and part of that writeback ensures that no other work queued against the wb, and that the wb is fully drained. With memcg, however, there is a one-to-many relationship between the bdi and bdi_writeback structures; that is, there are multiple wb objects which can all point to a single bdi. There is a refcount which prevents the bdi object from being released (and hence, unregistered). So in theory, the bdi_unregister() *should* only get called once its refcount goes to zero (bdi_put will drop the refcount, and when it is zero, release_bdi gets called, which calls bdi_unregister). Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about the Brave New memcg World, and calls bdi_unregister directly. It does this without informing the file system, or the memcg code, or anything else. This causes the root wb associated with the bdi to be unregistered, but none of the memcg-specific wb's are shutdown. So when one of these wb's are woken up to do delayed work, they try to dereference their wb->bdi->dev to fetch the device name, but unfortunately bdi->dev is now NULL, thanks to the bdi_unregister() called by del_gendisk(). As a result, *boom*. Fortunately, it looks like the rest of the writeback path is perfectly happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to create a bdi_dev_name() function which can handle bdi->dev being NULL. This also allows us to bulletproof the writeback tracepoints to prevent them from dereferencing a NULL pointer and crashing the kernel if one is tracing with memcg's enabled, and an iSCSI device dies or a USB storage stick is pulled. The most common way of triggering this will be hotremoval of a device while writeback with memcg enabled is going on. It was triggering several times a day in a heavily loaded production environment. Google Bug Id: 145475544 Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <clm@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31Merge tag 'kvm-5.6-1' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM updates from Paolo Bonzini: "This is the first batch of KVM changes. ARM: - cleanups and corner case fixes. PPC: - Bugfixes x86: - Support for mapping DAX areas with large nested page table entries. - Cleanups and bugfixes here too. A particularly important one is a fix for FPU load when the thread has TIF_NEED_FPU_LOAD. There is also a race condition which could be used in guest userspace to exploit the guest kernel, for which the embargo expired today. - Fast path for IPI delivery vmexits, shaving about 200 clock cycles from IPI latency. - Protect against "Spectre-v1/L1TF" (bring data in the cache via speculative out of bound accesses, use L1TF on the sibling hyperthread to read it), which unfortunately is an even bigger whack-a-mole game than SpectreV1. Sean continues his mission to rewrite KVM. In addition to a sizable number of x86 patches, this time he contributed a pretty large refactoring of vCPU creation that affects all architectures but should not have any visible effect. s390 will come next week together with some more x86 patches" * tag 'kvm-5.6-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (204 commits) x86/KVM: Clean up host's steal time structure x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed x86/kvm: Cache gfn to pfn translation x86/kvm: Introduce kvm_(un)map_gfn() x86/kvm: Be careful not to clear KVM_VCPU_FLUSH_TLB bit KVM: PPC: Book3S PR: Fix -Werror=return-type build failure KVM: PPC: Book3S HV: Release lock on page-out failure path KVM: arm64: Treat emulated TVAL TimerValue as a signed 32-bit integer KVM: arm64: pmu: Only handle supported event counters KVM: arm64: pmu: Fix chained SW_INCR counters KVM: arm64: pmu: Don't mark a counter as chained if the odd one is disabled KVM: arm64: pmu: Don't increment SW_INCR if PMCR.E is unset KVM: x86: Use a typedef for fastop functions KVM: X86: Add 'else' to unify fastop and execute call path KVM: x86: inline memslot_valid_for_gpte KVM: x86/mmu: Use huge pages for DAX-backed files KVM: x86/mmu: Remove lpage_is_disallowed() check from set_spte() KVM: x86/mmu: Fold max_mapping_level() into kvm_mmu_hugepage_adjust() KVM: x86/mmu: Zap any compound page when collapsing sptes KVM: x86/mmu: Remove obsolete gfn restoration in FNAME(fetch) ...
2020-01-31Merge branch 'ttm-prot-fix' of git://people.freedesktop.org/~thomash/linux ↵Dave Airlie
into drm-next A small fix for the long-standing ttm vm page protection hack. Sent as a separate PR as it touches mm, has all acks in place. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Thomas Hellström (VMware) <thellstrom@vmware.com> Link: https://patchwork.freedesktop.org/patch/msgid/20200116102411.3056-1-thomas_os@shipmail.org
2020-01-30Merge branch 'cve-2019-3016' into kvm-next-5.6Paolo Bonzini
From Boris Ostrovsky: The KVM hypervisor may provide a guest with ability to defer remote TLB flush when the remote VCPU is not running. When this feature is used, the TLB flush will happen only when the remote VPCU is scheduled to run again. This will avoid unnecessary (and expensive) IPIs. Under certain circumstances, when a guest initiates such deferred action, the hypervisor may miss the request. It is also possible that the guest may mistakenly assume that it has already marked remote VCPU as needing a flush when in fact that request had already been processed by the hypervisor. In both cases this will result in an invalid translation being present in a vCPU, potentially allowing accesses to memory locations in that guest's address space that should not be accessible. Note that only intra-guest memory is vulnerable. The five patches address both of these problems: 1. The first patch makes sure the hypervisor doesn't accidentally clear a guest's remote flush request 2. The rest of the patches prevent the race between hypervisor acknowledging a remote flush request and guest issuing a new one. Conflicts: arch/x86/kvm/x86.c [move from kvm_arch_vcpu_free to kvm_arch_vcpu_destroy]
2020-01-29Merge tag 'for-linus-hmm' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma Pull mmu_notifier updates from Jason Gunthorpe: "This small series revises the names in mmu_notifier to make the code clearer and more readable" * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: mm/mmu_notifiers: Use 'interval_sub' as the variable for mmu_interval_notifier mm/mmu_notifiers: Use 'subscription' as the variable name for mmu_notifier mm/mmu_notifier: Rename struct mmu_notifier_mm to mmu_notifier_subscriptions
2020-01-29Merge tag 'for-5.6/io_uring-vfs-2020-01-29' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring updates from Jens Axboe: - Support for various new opcodes (fallocate, openat, close, statx, fadvise, madvise, openat2, non-vectored read/write, send/recv, and epoll_ctl) - Faster ring quiesce for fileset updates - Optimizations for overflow condition checking - Support for max-sized clamping - Support for probing what opcodes are supported - Support for io-wq backend sharing between "sibling" rings - Support for registering personalities - Lots of little fixes and improvements * tag 'for-5.6/io_uring-vfs-2020-01-29' of git://git.kernel.dk/linux-block: (64 commits) io_uring: add support for epoll_ctl(2) eventpoll: support non-blocking do_epoll_ctl() calls eventpoll: abstract out epoll_ctl() handler io_uring: fix linked command file table usage io_uring: support using a registered personality for commands io_uring: allow registering credentials io_uring: add io-wq workqueue sharing io-wq: allow grabbing existing io-wq io_uring/io-wq: don't use static creds/mm assignments io-wq: make the io_wq ref counted io_uring: fix refcounting with batched allocations at OOM io_uring: add comment for drain_next io_uring: don't attempt to copy iovec for READ/WRITE io_uring: honor IOSQE_ASYNC for linked reqs io_uring: prep req when do IOSQE_ASYNC io_uring: use labeled array init in io_op_defs io_uring: optimise sqe-to-req flags translation io_uring: remove REQ_F_IO_DRAINED io_uring: file switch work needs to get flushed on exit io_uring: hide uring_fd in ctx ...
2020-01-28Merge tag 'for-5.6-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "Features, highlights: - async discard - "mount -o discard=async" to enable it - freed extents are not discarded immediatelly, but grouped together and trimmed later, with IO rate limiting - the "sync" mode submits short extents that could have been ignored completely by the device, for SATA prior to 3.1 the requests are unqueued and have a big impact on performance - the actual discard IO requests have been moved out of transaction commit to a worker thread, improving commit latency - IO rate and request size can be tuned by sysfs files, for now enabled only with CONFIG_BTRFS_DEBUG as we might need to add/delete the files and don't have a stable-ish ABI for general use, defaults are conservative - export device state info in sysfs, eg. missing, writeable - no discard of extents known to be untouched on disk (eg. after reservation) - device stats reset is logged with process name and PID that called the ioctl Fixes: - fix missing hole after hole punching and fsync when using NO_HOLES - writeback: range cyclic mode could miss some dirty pages and lead to OOM - two more corner cases for metadata_uuid change after power loss during the change - fix infinite loop during fsync after mix of rename operations Core changes: - qgroup assign returns ENOTCONN when quotas not enabled, used to return EINVAL that was confusing - device closing does not need to allocate memory anymore - snapshot aware code got removed, disabled for years due to performance problems, reimplmentation will allow to select wheter defrag breaks or does not break COW on shared extents - tree-checker: - check leaf chunk item size, cross check against number of stripes - verify location keys for DIR_ITEM, DIR_INDEX and XATTR items - new self test for physical -> logical mapping code, used for super block range exclusion - assertion helpers/macros updated to avoid objtool "unreachable code" reports on older compilers or config option combinations" * tag 'for-5.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (84 commits) btrfs: free block groups after free'ing fs trees btrfs: Fix split-brain handling when changing FSID to metadata uuid btrfs: Handle another split brain scenario with metadata uuid feature btrfs: Factor out metadata_uuid code from find_fsid. btrfs: Call find_fsid from find_fsid_inprogress Btrfs: fix infinite loop during fsync after rename operations btrfs: set trans->drity in btrfs_commit_transaction btrfs: drop log root for dropped roots btrfs: sysfs, add devid/dev_state kobject and device attributes btrfs: Refactor btrfs_rmap_block to improve readability btrfs: Add self-tests for btrfs_rmap_block btrfs: selftests: Add support for dummy devices btrfs: Move and unexport btrfs_rmap_block btrfs: separate definition of assertion failure handlers btrfs: device stats, log when stats are zeroed btrfs: fix improper setting of scanned for range cyclic write cache pages btrfs: safely advance counter when looking up bio csums btrfs: remove unused member btrfs_device::work btrfs: remove unnecessary wrapper get_alloc_profile btrfs: add correction to handle -1 edge case in async discard ...
2020-01-28Merge branch 'x86-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 updates from Ingo Molnar: "Misc changes: - Enhance #GP fault printouts by distinguishing between canonical and non-canonical address faults, and also add KASAN fault decoding. - Fix/enhance the x86 NMI handler by putting the duration check into a direct function call instead of an irq_work which we know to be broken in some cases. - Clean up do_general_protection() a bit" * 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/nmi: Remove irq_work from the long duration NMI handler x86/traps: Cleanup do_general_protection() x86/kasan: Print original address on #GP x86/dumpstack: Introduce die_addr() for die() with #GP fault address x86/traps: Print address on #GP x86/insn-eval: Add support for 64-bit kernel mode
2020-01-28Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "These were the main changes in this cycle: - More -rt motivated separation of CONFIG_PREEMPT and CONFIG_PREEMPTION. - Add more low level scheduling topology sanity checks and warnings to filter out nonsensical topologies that break scheduling. - Extend uclamp constraints to influence wakeup CPU placement - Make the RT scheduler more aware of asymmetric topologies and CPU capacities, via uclamp metrics, if CONFIG_UCLAMP_TASK=y - Make idle CPU selection more consistent - Various fixes, smaller cleanups, updates and enhancements - please see the git log for details" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits) sched/fair: Define sched_idle_cpu() only for SMP configurations sched/topology: Assert non-NUMA topology masks don't (partially) overlap idle: fix spelling mistake "iterrupts" -> "interrupts" sched/fair: Remove redundant call to cpufreq_update_util() sched/psi: create /proc/pressure and /proc/pressure/{io|memory|cpu} only when psi enabled sched/fair: Fix sgc->{min,max}_capacity calculation for SD_OVERLAP sched/fair: calculate delta runnable load only when it's needed sched/cputime: move rq parameter in irqtime_account_process_tick stop_machine: Make stop_cpus() static sched/debug: Reset watchdog on all CPUs while processing sysrq-t sched/core: Fix size of rq::uclamp initialization sched/uclamp: Fix a bug in propagating uclamp value in new cgroups sched/fair: Load balance aggressively for SCHED_IDLE CPUs sched/fair : Improve update_sd_pick_busiest for spare capacity case watchdog: Remove soft_lockup_hrtimer_cnt and related code sched/rt: Make RT capacity-aware sched/fair: Make EAS wakeup placement consider uclamp restrictions sched/fair: Make task_fits_capacity() consider uclamp restrictions sched/uclamp: Rename uclamp_util_with() into uclamp_rq_util_with() sched/uclamp: Make uclamp util helpers use and return UL values ...
2020-01-28Merge branch 'efi-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull EFI updates from Ingo Molnar: "The main changes in this cycle were: - Cleanup of the GOP [graphics output] handling code in the EFI stub - Complete refactoring of the mixed mode handling in the x86 EFI stub - Overhaul of the x86 EFI boot/runtime code - Increase robustness for mixed mode code - Add the ability to disable DMA at the root port level in the EFI stub - Get rid of RWX mappings in the EFI memory map and page tables, where possible - Move the support code for the old EFI memory mapping style into its only user, the SGI UV1+ support code. - plus misc fixes, updates, smaller cleanups. ... and due to interactions with the RWX changes, another round of PAT cleanups make a guest appearance via the EFI tree - with no side effects intended" * 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits) efi/x86: Disable instrumentation in the EFI runtime handling code efi/libstub/x86: Fix EFI server boot failure efi/x86: Disallow efi=old_map in mixed mode x86/boot/compressed: Relax sed symbol type regex for LLVM ld.lld efi/x86: avoid KASAN false positives when accessing the 1: 1 mapping efi: Fix handling of multiple efi_fake_mem= entries efi: Fix efi_memmap_alloc() leaks efi: Add tracking for dynamically allocated memmaps efi: Add a flags parameter to efi_memory_map efi: Fix comment for efi_mem_type() wrt absent physical addresses efi/arm: Defer probe of PCIe backed efifb on DT systems efi/x86: Limit EFI old memory map to SGI UV machines efi/x86: Avoid RWX mappings for all of DRAM efi/x86: Don't map the entire kernel text RW for mixed mode x86/mm: Fix NX bit clearing issue in kernel_map_pages_in_pgd efi/libstub/x86: Fix unused-variable warning efi/libstub/x86: Use mandatory 16-byte stack alignment in mixed mode efi/libstub/x86: Use const attribute for efi_is_64bit() efi: Allow disabling PCI busmastering on bridges during boot efi/x86: Allow translating 64-bit arguments for mixed mode calls ...
2020-01-27Merge tag 'smp-core-2020-01-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core SMP updates from Thomas Gleixner: "A small set of SMP core code changes: - Rework the smp function call core code to avoid the allocation of an additional cpumask - Remove the not longer required GFP argument from on_each_cpu_cond() and on_each_cpu_cond_mask() and fixup the callers" * tag 'smp-core-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: smp: Remove allocation mask from on_each_cpu_cond.*() smp: Add a smp_cond_func_t argument to smp_call_function_many() smp: Use smp_cond_func_t as type for the conditional function
2020-01-27Merge tag 'timers-core-2020-01-27' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "The timekeeping and timers departement provides: - Time namespace support: If a container migrates from one host to another then it expects that clocks based on MONOTONIC and BOOTTIME are not subject to disruption. Due to different boot time and non-suspended runtime these clocks can differ significantly on two hosts, in the worst case time goes backwards which is a violation of the POSIX requirements. The time namespace addresses this problem. It allows to set offsets for clock MONOTONIC and BOOTTIME once after creation and before tasks are associated with the namespace. These offsets are taken into account by timers and timekeeping including the VDSO. Offsets for wall clock based clocks (REALTIME/TAI) are not provided by this mechanism. While in theory possible, the overhead and code complexity would be immense and not justified by the esoteric potential use cases which were discussed at Plumbers '18. The overhead for tasks in the root namespace (ie where host time offsets = 0) is in the noise and great effort was made to ensure that especially in the VDSO. If time namespace is disabled in the kernel configuration the code is compiled out. Kudos to Andrei Vagin and Dmitry Sofanov who implemented this feature and kept on for more than a year addressing review comments, finding better solutions. A pleasant experience. - Overhaul of the alarmtimer device dependency handling to ensure that the init/suspend/resume ordering is correct. - A new clocksource/event driver for Microchip PIT64 - Suspend/resume support for the Hyper-V clocksource - The usual pile of fixes, updates and improvements mostly in the driver code" * tag 'timers-core-2020-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits) alarmtimer: Make alarmtimer_get_rtcdev() a stub when CONFIG_RTC_CLASS=n alarmtimer: Use wakeup source from alarmtimer platform device alarmtimer: Make alarmtimer platform device child of RTC device alarmtimer: Update alarmtimer_get_rtcdev() docs to reflect reality hrtimer: Add missing sparse annotation for __run_timer() lib/vdso: Only read hrtimer_res when needed in __cvdso_clock_getres() MIPS: vdso: Define BUILD_VDSO32 when building a 32bit kernel clocksource/drivers/hyper-v: Set TSC clocksource as default w/ InvariantTSC clocksource/drivers/hyper-v: Untangle stimers and timesync from clocksources clocksource/drivers/timer-microchip-pit64b: Fix sparse warning clocksource/drivers/exynos_mct: Rename Exynos to lowercase clocksource/drivers/timer-ti-dm: Fix uninitialized pointer access clocksource/drivers/timer-ti-dm: Switch to platform_get_irq clocksource/drivers/timer-ti-dm: Convert to devm_platform_ioremap_resource clocksource/drivers/em_sti: Fix variable declaration in em_sti_probe clocksource/drivers/em_sti: Convert to devm_platform_ioremap_resource clocksource/drivers/bcm2835_timer: Fix memory leak of timer clocksource/drivers/cadence-ttc: Use ttc driver as platform driver clocksource/drivers/timer-microchip-pit64b: Add Microchip PIT64B support clocksource/drivers/hyper-v: Reserve PAGE_SIZE space for tsc page ...
2020-01-27Merge branch 'for-5.6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - cgroup2 interface for hugetlb controller. I think this was the last remaining bit which was missing from cgroup2 - fixes for race and a spurious warning in threaded cgroup handling - other minor changes * 'for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: iocost: Fix iocost_monitor.py due to helper type mismatch cgroup: Prevent double killing of css when enabling threaded cgroup cgroup: fix function name in comment mm: hugetlb controller for cgroups v2