summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2017-11-15kmemcheck: stop using GFP_NOTRACK and SLAB_NOTRACKLevin, Alexander (Sasha Levin)
Convert all allocations that used a NOTRACK flag to stop using it. Link: http://lkml.kernel.org/r/20171007030159.22241-3-alexander.levin@verizon.com Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Cc: Alexander Potapenko <glider@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tim Hansen <devtimhansen@gmail.com> Cc: Vegard Nossum <vegardno@ifi.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15kmemcheck: remove annotationsLevin, Alexander (Sasha Levin)
Patch series "kmemcheck: kill kmemcheck", v2. As discussed at LSF/MM, kill kmemcheck. KASan is a replacement that is able to work without the limitation of kmemcheck (single CPU, slow). KASan is already upstream. We are also not aware of any users of kmemcheck (or users who don't consider KASan as a suitable replacement). The only objection was that since KASAN wasn't supported by all GCC versions provided by distros at that time we should hold off for 2 years, and try again. Now that 2 years have passed, and all distros provide gcc that supports KASAN, kill kmemcheck again for the very same reasons. This patch (of 4): Remove kmemcheck annotations, and calls to kmemcheck from the kernel. [alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs] Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Cc: Alexander Potapenko <glider@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tim Hansen <devtimhansen@gmail.com> Cc: Vegard Nossum <vegardno@ifi.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm/rmap.c: remove redundant variable cendColin Ian King
Variable cend is set but never read, hence it is redundant and can be removed. Cleans up clang build warning: Value stored to 'cend' is never read Link: http://lkml.kernel.org/r/20171011174942.1372-1-colin.king@canonical.com Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: consolidate page table accountingKirill A. Shutemov
Currently, we account page tables separately for each page table level, but that's redundant -- we only make use of total memory allocated to page tables for oom_badness calculation. We also provide the information to userspace, but it has dubious value there too. This patch switches page table accounting to single counter. mm->pgtables_bytes is now used to account all page table levels. We use bytes, because page table size for different levels of page table tree may be different. The change has user-visible effect: we don't have VmPMD and VmPUD reported in /proc/[pid]/status. Not sure if anybody uses them. (As alternative, we can always report 0 kB for them.) OOM-killer report is also slightly changed: we now report pgtables_bytes instead of nr_ptes, nr_pmd, nr_puds. Apart from reducing number of counters per-mm, the benefit is that we now calculate oom_badness() more correctly for machines which have different size of page tables depending on level or where page tables are less than a page in size. The only downside can be debuggability because we do not know which page table level could leak. But I do not remember many bugs that would be caught by separate counters so I wouldn't lose sleep over this. [akpm@linux-foundation.org: fix mm/huge_memory.c] Link: http://lkml.kernel.org/r/20171006100651.44742-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> [kirill.shutemov@linux.intel.com: fix build] Link: http://lkml.kernel.org/r/20171016150113.ikfxy3e7zzfvsr4w@black.fi.intel.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: introduce wrappers to access mm->nr_ptesKirill A. Shutemov
Let's add wrappers for ->nr_ptes with the same interface as for nr_pmd and nr_pud. The patch also makes nr_ptes accounting dependent onto CONFIG_MMU. Page table accounting doesn't make sense if you don't have page tables. It's preparation for consolidation of page-table counters in mm_struct. Link: http://lkml.kernel.org/r/20171006100651.44742-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: account pud page tablesKirill A. Shutemov
On a machine with 5-level paging support a process can allocate significant amount of memory and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PUD page tables. We don't account PUD page tables, only PMD and PTE. We already addressed the same issue for PMD page tables, see commit dc6c9a35b66b ("mm: account pmd page tables to the process"). Introduction of 5-level paging brings the same issue for PUD page tables. The patch expands accounting to PUD level. [kirill.shutemov@linux.intel.com: s/pmd_t/pud_t/] Link: http://lkml.kernel.org/r/20171004074305.x35eh5u7ybbt5kar@black.fi.intel.com [heiko.carstens@de.ibm.com: s390/mm: fix pud table accounting] Link: http://lkml.kernel.org/r/20171103090551.18231-1-heiko.carstens@de.ibm.com Link: http://lkml.kernel.org/r/20171002080427.3320-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15kmemleak: change /sys/kernel/debug/kmemleak permissions from 0444 to 0644Konstantin Khlebnikov
Kmemleak can be tweaked at runtime by writing commands into debugfs file. Root can use it anyway, but without the write-bit this interface isn't obvious. Link: http://lkml.kernel.org/r/150728996582.744328.11541332857988399411.stgit@buzz Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: remove nr_pages argument from pagevec_lookup_{,range}_tag()Jan Kara
All users of pagevec_lookup() and pagevec_lookup_range() now pass PAGEVEC_SIZE as a desired number of pages. Just drop the argument. Link: http://lkml.kernel.org/r/20171009151359.31984-15-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: add variant of pagevec_lookup_range_tag() taking number of pagesJan Kara
Currently pagevec_lookup_range_tag() takes number of pages to look up but most users don't need this. Create a new function pagevec_lookup_range_nr_tag() that takes maximum number of pages to lookup for Ceph which wants this functionality so that we can drop nr_pages argument from pagevec_lookup_range_tag(). Link: http://lkml.kernel.org/r/20171009151359.31984-13-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: use pagevec_lookup_range_tag() in write_cache_pages()Jan Kara
Use pagevec_lookup_range_tag() in write_cache_pages() as it is interested only in pages from given range. Remove unnecessary code resulting from this. Link: http://lkml.kernel.org/r/20171009151359.31984-12-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: use pagevec_lookup_range_tag() in __filemap_fdatawait_range()Jan Kara
Use pagevec_lookup_range_tag() in __filemap_fdatawait_range() as it is interested only in pages from given range. Remove unnecessary code resulting from this. Link: http://lkml.kernel.org/r/20171009151359.31984-11-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: implement find_get_pages_range_tag()Jan Kara
Patch series "Ranged pagevec tagged lookup", v3. In this series I provide a ranged variant of pagevec_lookup_tag() and use it in places where it makes sense. This series removes some common code and it also has a potential for speeding up some operations similarly as for pagevec_lookup_range() (but for now I can think of only artificial cases where this happens). This patch (of 16): Implement a variant of find_get_pages_tag() that stops iterating at given index. Lots of users of this function (through pagevec_lookup()) actually want a range lookup and all of them are currently open-coding this. Also create corresponding pagevec_lookup_range_tag() function. Link: http://lkml.kernel.org/r/20171009151359.31984-2-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Chao Yu <yuchao0@huawei.com> Cc: David Howells <dhowells@redhat.com> Cc: David Sterba <dsterba@suse.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: Steve French <sfrench@samba.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm/page_owner.c: reduce page_owner structure sizeAyush Mittal
Maximum page order can be at max 10 which can be accomodated in short data type(2 bytes). last_migrate_reason is defined as enum type whose values can be accomodated in short data type (2 bytes). Total structure size is currently 16 bytes but after changing structure size it goes to 12 bytes. Vlastimil said: "Looks like it works, so why not. Before: [ 0.001000] allocated 50331648 bytes of page_ext After: [ 0.001000] allocated 41943040 bytes of page_ext" Link: http://lkml.kernel.org/r/1507623917-37991-1-git-send-email-ayush.m@samsung.com Signed-off-by: Ayush Mittal <ayush.m@samsung.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Amit Sahrawat <a.sahrawat@samsung.com> Cc: Vaneet Narang <v.narang@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm/cma.c: change pr_info to pr_err for cma_alloc fail logPintu Agarwal
It was observed that under cma_alloc fail log, pr_info was used instead of pr_err. This will lead to problems if printk debug level is set to below 7. In this case the cma_alloc failure log will not be captured in the log and it will be difficult to debug. Simply replace the pr_info with pr_err to capture failure log. Link: http://lkml.kernel.org/r/1507650633-4430-1-git-send-email-pintu.ping@gmail.com Signed-off-by: Pintu Agarwal <pintu.ping@gmail.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jaewon Kim <jaewon31.kim@samsung.com> Cc: Doug Berger <opendmb@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm/swap_slots.c: fix race conditions in swap_slots cache initTim Chen
Memory allocations can happen before the swap_slots cache initialization is completed during cpu bring up. If we are low on memory, we could call get_swap_page() and access swap_slots_cache before it is fully initialized. Add a check in get_swap_page() for initialized swap_slots_cache to prevent this condition. Similar check already exists in free_swap_slot. Also annotate the checks to indicate the likely condition. We also added a memory barrier to make sure that the locks initialization are done before the assignment of cache->slots and cache->slots_ret pointers. This ensures the assumption that it is safe to acquire the slots cache locks and use the slots cache when the corresponding cache->slots or cache->slots_ret pointers are non null. [akpm@linux-foundation.org: tidy up comment] [akpm@linux-foundation.org: fix spello in comment] Link: http://lkml.kernel.org/r/65a9d0f133f63e66bba37b53b2fd0464b7cae771.1500677066.git.tim.c.chen@linux.intel.com Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Reported-by: Wenwei Tao <wenwei.tww@alibaba-inc.com> Acked-by: Ying Huang <ying.huang@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: remove unused pgdat->inactive_ratioAndrey Ryabinin
Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file list") 'pgdat->inactive_ratio' is not used, except for printing "node_inactive_ratio: 0" in /proc/zoneinfo output. Remove it. Link: http://lkml.kernel.org/r/20171003152611.27483-1-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm/mmu_notifier: avoid call to invalidate_range() in range_end()Jérôme Glisse
This is an optimization patch that only affect mmu_notifier users which rely on the invalidate_range() callback. This patch avoids calling that callback twice in a row from inside __mmu_notifier_invalidate_range_end Existing pattern (before this patch): mmu_notifier_invalidate_range_start() pte/pmd/pud_clear_flush_notify() mmu_notifier_invalidate_range() mmu_notifier_invalidate_range_end() mmu_notifier_invalidate_range() New pattern (after this patch): mmu_notifier_invalidate_range_start() pte/pmd/pud_clear_flush_notify() mmu_notifier_invalidate_range() mmu_notifier_invalidate_range_only_end() We call the invalidate_range callback after clearing the page table under the page table lock and we skip the call to invalidate_range inside the __mmu_notifier_invalidate_range_end() function. Idea from Andrea Arcangeli Link: http://lkml.kernel.org/r/20171017031003.7481-3-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Alistair Popple <alistair@popple.id.au> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm/mmu_notifier: avoid double notification when it is uselessJérôme Glisse
This patch only affects users of mmu_notifier->invalidate_range callback which are device drivers related to ATS/PASID, CAPI, IOMMUv2, SVM ... and it is an optimization for those users. Everyone else is unaffected by it. When clearing a pte/pmd we are given a choice to notify the event under the page table lock (notify version of *_clear_flush helpers do call the mmu_notifier_invalidate_range). But that notification is not necessary in all cases. This patch removes almost all cases where it is useless to have a call to mmu_notifier_invalidate_range before mmu_notifier_invalidate_range_end. It also adds documentation in all those cases explaining why. Below is a more in depth analysis of why this is fine to do this: For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device use thing like ATS/PASID to get the IOMMU to walk the CPU page table to access a process virtual address space). There is only 2 cases when you need to notify those secondary TLB while holding page table lock when clearing a pte/pmd: A) page backing address is free before mmu_notifier_invalidate_range_end B) a page table entry is updated to point to a new page (COW, write fault on zero page, __replace_page(), ...) Case A is obvious you do not want to take the risk for the device to write to a page that might now be used by something completely different. Case B is more subtle. For correctness it requires the following sequence to happen: - take page table lock - clear page table entry and notify (pmd/pte_huge_clear_flush_notify()) - set page table entry to point to new page If clearing the page table entry is not followed by a notify before setting the new pte/pmd value then you can break memory model like C11 or C++11 for the device. Consider the following scenario (device use a feature similar to ATS/ PASID): Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we assume they are write protected for COW (other case of B apply too). [Time N] ----------------------------------------------------------------- CPU-thread-0 {try to write to addrA} CPU-thread-1 {try to write to addrB} CPU-thread-2 {} CPU-thread-3 {} DEV-thread-0 {read addrA and populate device TLB} DEV-thread-2 {read addrB and populate device TLB} [Time N+1] --------------------------------------------------------------- CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}} CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}} CPU-thread-2 {} CPU-thread-3 {} DEV-thread-0 {} DEV-thread-2 {} [Time N+2] --------------------------------------------------------------- CPU-thread-0 {COW_step1: {update page table point to new page for addrA}} CPU-thread-1 {COW_step1: {update page table point to new page for addrB}} CPU-thread-2 {} CPU-thread-3 {} DEV-thread-0 {} DEV-thread-2 {} [Time N+3] --------------------------------------------------------------- CPU-thread-0 {preempted} CPU-thread-1 {preempted} CPU-thread-2 {write to addrA which is a write to new page} CPU-thread-3 {} DEV-thread-0 {} DEV-thread-2 {} [Time N+3] --------------------------------------------------------------- CPU-thread-0 {preempted} CPU-thread-1 {preempted} CPU-thread-2 {} CPU-thread-3 {write to addrB which is a write to new page} DEV-thread-0 {} DEV-thread-2 {} [Time N+4] --------------------------------------------------------------- CPU-thread-0 {preempted} CPU-thread-1 {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}} CPU-thread-2 {} CPU-thread-3 {} DEV-thread-0 {} DEV-thread-2 {} [Time N+5] --------------------------------------------------------------- CPU-thread-0 {preempted} CPU-thread-1 {} CPU-thread-2 {} CPU-thread-3 {} DEV-thread-0 {read addrA from old page} DEV-thread-2 {read addrB from new page} So here because at time N+2 the clear page table entry was not pair with a notification to invalidate the secondary TLB, the device see the new value for addrB before seing the new value for addrA. This break total memory ordering for the device. When changing a pte to write protect or to point to a new write protected page with same content (KSM) it is ok to delay invalidate_range callback to mmu_notifier_invalidate_range_end() outside the page table lock. This is true even if the thread doing page table update is preempted right after releasing page table lock before calling mmu_notifier_invalidate_range_end Thanks to Andrea for thinking of a problematic scenario for COW. [jglisse@redhat.com: v2] Link: http://lkml.kernel.org/r/20171017031003.7481-2-jglisse@redhat.com Link: http://lkml.kernel.org/r/20170901173011.10745-1-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Alistair Popple <alistair@popple.id.au> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15zsmalloc: calling zs_map_object() from irq is a bugSergey Senozhatsky
Use BUG_ON(in_interrupt()) in zs_map_object(). This is not a new BUG_ON(), it's always been there, but was recently changed to VM_BUG_ON(). There are several problems there. First, we use use per-CPU mappings both in zsmalloc and in zram, and interrupt may easily corrupt those buffers. Second, and more importantly, we believe it's possible to start leaking sensitive information. Consider the following case: -> process P swap out zram per-cpu mapping CPU1 compress page A -> IRQ swap out zram per-cpu mapping CPU1 compress page B write page from per-cpu mapping CPU1 to zsmalloc pool iret -> process P write page from per-cpu mapping CPU1 to zsmalloc pool [*] return * so we store overwritten data that actually belongs to another page (task) and potentially contains sensitive data. And when process P will page fault it's going to read (swap in) that other task's data. Link: http://lkml.kernel.org/r/20170929045140.4055-1-sergey.senozhatsky@gmail.com Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm/page-writeback.c: print a warning if the vm dirtiness settings are illogicalYafang Shao
The vm direct limit setting must be set greater than vm background limit setting. Otherwise print a warning to help the operator to figure out that the vm dirtiness settings is in illogical state. Link: http://lkml.kernel.org/r/1506592464-30962-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm/memblock.c: make the index explicit argument of for_each_memblock_typeGioh Kim
for_each_memblock_type macro function relies on idx variable defined in the caller context. Silent macro arguments are almost always wrong thing to do. They make code harder to read and easier to get wrong. Let's use an explicit iterator parameter for for_each_memblock_type and make the code more obious. This patch is a mere cleanup and it shouldn't introduce any functional change. Link: http://lkml.kernel.org/r/20170913133029.28911-1-gi-oh.kim@profitbricks.com Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm, memory_hotplug: remove timeout from __offline_memoryMichal Hocko
We have a hardcoded 120s timeout after which the memory offline fails basically since the hot remove has been introduced. This is essentially a policy implemented in the kernel. Moreover there is no way to adjust the timeout and so we are sometimes facing memory offline failures if the system is under a heavy memory pressure or very intensive CPU workload on large machines. It is not very clear what purpose the timeout actually serves. The offline operation is interruptible by a signal so if userspace wants some timeout based termination this can be done trivially by sending a signal. If there is a strong usecase to do this from the kernel then we should do it properly and have a it tunable from the userspace with the timeout disabled by default along with the explanation who uses it and for what purporse. Link: http://lkml.kernel.org/r/20170918070834.13083-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm, memory_hotplug: do not fail offlining too earlyMichal Hocko
Patch series "mm, memory_hotplug: redefine memory offline retry logic", v2. While testing memory hotplug on a large 4TB machine we have noticed that memory offlining is just too eager to fail. The primary reason is that the retry logic is just too easy to give up. We have 4 ways out of the offline - we have a permanent failure (isolation or memory notifiers fail, or hugetlb pages cannot be dropped) - userspace sends a signal - a hardcoded 120s timeout expires - page migration fails 5 times This is way too convoluted and it doesn't scale very well. We have seen both temporary migration failures as well as 120s being triggered. After removing those restrictions we were able to pass stress testing during memory hot remove without any other negative side effects observed. Therefore I suggest dropping both hard coded policies. I couldn't have found any specific reason for them in the changelog. I neither didn't get any response [1] from Kamezawa. If we need some upper bound - e.g. timeout based - then we should have a proper and user defined policy for that. In any case there should be a clear use case when introducing it. This patch (of 2): Memory offlining can fail too eagerly under heavy memory pressure. page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3 flags: 0x9855fe40010048(uptodate|active|mappedtodisk) page dumped because: isolation failed page->mem_cgroup:ffff8801cd662000 memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed Isolation has failed here because the page is not on LRU. Most probably because it was on the pcp LRU cache or it has been removed from the LRU already but it hasn't been freed yet. In both cases the page doesn't look non-migrable so retrying more makes sense. __offline_pages seems rather cluttered when it comes to the retry logic. We have 5 retries at maximum and a timeout. We could argue whether the timeout makes sense but failing just because of a race when somebody isoltes a page from LRU or puts it on a pcp LRU lists is just wrong. It only takes it to race with a process which unmaps some pages and remove them from the LRU list and we can fail the whole offline because of something that is a temporary condition and actually not harmful for the offline. Please note that unmovable pages should be already excluded during start_isolate_page_range. We could argue that has_unmovable_pages is racy and MIGRATE_MOVABLE check doesn't provide any hard guarantee either but kernel zones (aka < ZONE_MOVABLE) will very likely detect unmovable pages in most cases and movable zone shouldn't contain unmovable pages at all. Some of those pages might be pinned but not for ever because that would be a bug on its own. In any case the context is still interruptible and so the userspace can easily bail out when the operation takes too long. This is certainly better behavior than a hardcoded retry loop which is racy. Fix this by removing the max retry count and only rely on the timeout resp. interruption by a signal from the userspace. Also retry rather than fail when check_pages_isolated sees some !free pages because those could be a result of the race as well. Link: http://lkml.kernel.org/r/20170918070834.13083-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm, page_alloc: fail has_unmovable_pages when seeing reserved pagesMichal Hocko
Reserved pages should be completely ignored by the core mm because they have a special meaning for their owners. has_unmovable_pages doesn't check those so we rely on other tests (reference count, or PageLRU) to fail on such pages. Althought this happens to work it is safer to simply check for those explicitly and do not rely on the owner of the page to abuse those fields for special purposes. Please note that this is more of a further fortification of the code rahter than a fix of an existing issue. Link: http://lkml.kernel.org/r/20171013120756.jeopthigbmm3c7bl@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Igor Mammedov <imammedo@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: distinguish CMA and MOVABLE isolation in has_unmovable_pages()Michal Hocko
Joonsoo has noticed that "mm: drop migrate type checks from has_unmovable_pages" would break CMA allocator because it relies on has_unmovable_pages returning false even for CMA pageblocks which in fact don't have to be movable: alloc_contig_range start_isolate_page_range set_migratetype_isolate has_unmovable_pages This is a result of the code sharing between CMA and memory hotplug while each one has a different idea of what has_unmovable_pages should return. This is unfortunate but fixing it properly would require a lot of code duplication. Fix the issue by introducing the requested migrate type argument and special case MIGRATE_CMA case where CMA page blocks are handled properly. This will work for memory hotplug because it requires MIGRATE_MOVABLE. Link: http://lkml.kernel.org/r/20171019122118.y6cndierwl2vnguj@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Tested-by: Stefan Wahren <stefan.wahren@i2se.com> Tested-by: Ran Wang <ran.wang_1@nxp.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Igor Mammedov <imammedo@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: drop migrate type checks from has_unmovable_pagesMichal Hocko
Michael has noticed that the memory offline tries to migrate kernel code pages when doing echo 0 > /sys/devices/system/memory/memory0/online The current implementation will fail the operation after several failed page migration attempts but we shouldn't even attempt to migrate that memory and fail right away because this memory is clearly not migrateable. This will become a real problem when we drop the retry loop counter resp. timeout. The real problem is in has_unmovable_pages in fact. We should fail if there are any non migrateable pages in the area. In orther to guarantee that remove the migrate type checks because MIGRATE_MOVABLE is not guaranteed to contain only migrateable pages. It is merely a heuristic. Similarly MIGRATE_CMA does guarantee that the page allocator doesn't allocate any non-migrateable pages from the block but CMA allocations themselves are unlikely to migrateable. Therefore remove both checks. [akpm@linux-foundation.org: remove unused local `mt'] Link: http://lkml.kernel.org/r/20171013120013.698-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Michael Ellerman <mpe@ellerman.id.au> Tested-by: Michael Ellerman <mpe@ellerman.id.au> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Tony Lindgren <tony@atomide.com> Tested-by: Ran Wang <ran.wang_1@nxp.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm/page-writeback.c: remove unused parameter from balance_dirty_pages()Tahsin Erdogan
"mapping" parameter to balance_dirty_pages() is not used anymore. Fixes: dfb8ae567835 ("writeback: let balance_dirty_pages() work on the matching cgroup bdi_writeback") Link: http://lkml.kernel.org/r/20170927221311.23263-1-tahsin@google.com Signed-off-by: Tahsin Erdogan <tahsin@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm, swap: fix false error message in __swp_swapcount()Huang Ying
When a page fault occurs for a swap entry, the physical swap readahead (not the VMA base swap readahead) may readahead several swap entries after the fault swap entry. The readahead algorithm calculates some of the swap entries to readahead via increasing the offset of the fault swap entry without checking whether they are beyond the end of the swap device and it relys on the __swp_swapcount() and swapcache_prepare() to check it. Although __swp_swapcount() checks for the swap entry passed in, it will complain with the error message as follow for the expected invalid swap entry. This may make the end users confused. swap_info_get: Bad swap offset entry 0200f8a7 To fix the false error message, the swap entry checking is added in swapin_readahead() to avoid to pass the out-of-bound swap entries and the swap entry reserved for the swap header to __swp_swapcount() and swapcache_prepare(). Link: http://lkml.kernel.org/r/20171102054225.22897-1-ying.huang@intel.com Fixes: e8c26ab60598 ("mm/swap: skip readahead for unreferenced swap slots") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reported-by: Christian Kujau <lists@nerdbynature.de> Acked-by: Minchan Kim <minchan@kernel.org> Suggested-by: Minchan Kim <minchan@kernel.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> [4.11+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: swap: SWP_SYNCHRONOUS_IO: skip swapcache only if swapped page has no ↵Minchan Kim
other reference When SWP_SYNCHRONOUS_IO swapped-in pages are shared by several processes, it can cause unnecessary memory wastage by skipping swap cache. Because, with swapin fault by read, they could share a page if the page were in swap cache. Thus, it avoids allocating same content new pages. This patch makes the swapcache skipping work only if the swap pte is non-sharable. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/1507620825-5537-1-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm, swap: skip swapcache for swapin of synchronous deviceMinchan Kim
With fast swap storage, the platforms want to use swap more aggressively and swap-in is crucial to application latency. The rw_page() based synchronous devices like zram, pmem and btt are such fast storage. When I profile swapin performance with zram lz4 decompress test, S/W overhead is more than 70%. Maybe, it would be bigger in nvdimm. This patch aims to reduce swap-in latency by skipping swapcache if the swap device is synchronous device like rw_page based device. It enhances 45% my swapin test(5G sequential swapin, no readahead, from 2.41sec to 1.64sec). Link: http://lkml.kernel.org/r/1505886205-9671-5-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm, swap: introduce SWP_SYNCHRONOUS_IOMinchan Kim
If rw-page based fast storage is used for swap devices, we need to detect it to enhance swap IO operations. This patch is preparation for optimizing of swap-in operation with next patch. Link: http://lkml.kernel.org/r/1505886205-9671-4-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm/mempool.c: use kmalloc_array_node()Johannes Thumshirn
Now that we have a NUMA-aware version of kmalloc_array() we can use it instead of kmalloc_node() without an overflow check in the size calculation. Link: http://lkml.kernel.org/r/20170927082038.3782-6-jthumshirn@suse.de Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Lameter <cl@linux.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: David Rientjes <rientjes@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Doug Ledford <dledford@redhat.com> Cc: Hal Rosenstock <hal.rosenstock@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mike Marciniszyn <infinipath@intel.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15slub: fix sysfs duplicate filename creation when slub_debug=OMiles Chen
When slub_debug=O is set. It is possible to clear debug flags for an "unmergeable" slab cache in kmem_cache_open(). It makes the "unmergeable" cache became "mergeable" in sysfs_slab_add(). These caches will generate their "unique IDs" by create_unique_id(), but it is possible to create identical unique IDs. In my experiment, sgpool-128, names_cache, biovec-256 generate the same ID ":Ft-0004096" and the kernel reports "sysfs: cannot create duplicate filename '/kernel/slab/:Ft-0004096'". To repeat my experiment, set disable_higher_order_debug=1, CONFIG_SLUB_DEBUG_ON=y in kernel-4.14. Fix this issue by setting unmergeable=1 if slub_debug=O and the the default slub_debug contains any no-merge flags. call path: kmem_cache_create() __kmem_cache_alias() -> we set SLAB_NEVER_MERGE flags here create_cache() __kmem_cache_create() kmem_cache_open() -> clear DEBUG_METADATA_FLAGS sysfs_slab_add() -> the slab cache is mergeable now sysfs: cannot create duplicate filename '/kernel/slab/:Ft-0004096' ------------[ cut here ]------------ WARNING: CPU: 0 PID: 1 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x60/0x7c Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W 4.14.0-rc7ajb-00131-gd4c2e9f-dirty #123 Hardware name: linux,dummy-virt (DT) task: ffffffc07d4e0080 task.stack: ffffff8008008000 PC is at sysfs_warn_dup+0x60/0x7c LR is at sysfs_warn_dup+0x60/0x7c pc : lr : pstate: 60000145 Call trace: sysfs_warn_dup+0x60/0x7c sysfs_create_dir_ns+0x98/0xa0 kobject_add_internal+0xa0/0x294 kobject_init_and_add+0x90/0xb4 sysfs_slab_add+0x90/0x200 __kmem_cache_create+0x26c/0x438 kmem_cache_create+0x164/0x1f4 sg_pool_init+0x60/0x100 do_one_initcall+0x38/0x12c kernel_init_freeable+0x138/0x1d4 kernel_init+0x10/0xfc ret_from_fork+0x10/0x18 Link: http://lkml.kernel.org/r/1510365805-5155-1-git-send-email-miles.chen@mediatek.com Signed-off-by: Miles Chen <miles.chen@mediatek.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15slab, slub, slob: convert slab_flags_t to 32-bitAlexey Dobriyan
struct kmem_cache::flags is "unsigned long" which is unnecessary on 64-bit as no flags are defined in the higher bits. Switch the field to 32-bit and save some space on x86_64 until such flags appear: add/remove: 0/0 grow/shrink: 0/107 up/down: 0/-657 (-657) function old new delta sysfs_slab_add 720 719 -1 ... check_object 699 676 -23 [akpm@linux-foundation.org: fix printk warning] Link: http://lkml.kernel.org/r/20171021100635.GA8287@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15slab, slub, slob: add slab_flags_tAlexey Dobriyan
Add sparse-checked slab_flags_t for struct kmem_cache::flags (SLAB_POISON, etc). SLAB is bloated temporarily by switching to "unsigned long", but only temporarily. Link: http://lkml.kernel.org/r/20171021100225.GA22428@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm/slab.c: only set __GFP_RECLAIMABLE onceDavid Rientjes
SLAB_RECLAIM_ACCOUNT is a permanent attribute of a slab cache. Set __GFP_RECLAIMABLE as part of its ->allocflags rather than check the cachep flag on every page allocation. Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1710171527560.140898@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm/slob.c: remove an unnecessary check for __GFP_ZEROMiles Chen
Current flow guarantees a valid pointer when handling the __GFP_ZERO case. So remove the unnecessary NULL pointer check. Link: http://lkml.kernel.org/r/1507203141-11959-1-git-send-email-miles.chen@mediatek.com Signed-off-by: Miles Chen <miles.chen@mediatek.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: oom: show unreclaimable slab info when unreclaimable slabs > user memoryYang Shi
The kernel may panic when an oom happens without killable process sometimes it is caused by huge unreclaimable slabs used by kernel. Although kdump could help debug such problem, however, kdump is not available on all architectures and it might be malfunction sometime. And, since kernel already panic it is worthy capturing such information in dmesg to aid touble shooting. Print out unreclaimable slab info (used size and total size) which actual memory usage is not zero (num_objs * size != 0) when unreclaimable slabs amount is greater than total user memory (LRU pages). The output looks like: Unreclaimable slab info: Name Used Total rpc_buffers 31KB 31KB rpc_tasks 7KB 7KB ebitmap_node 1964KB 1964KB avtab_node 5024KB 5024KB xfs_buf 1402KB 1402KB xfs_ili 134KB 134KB xfs_efi_item 115KB 115KB xfs_efd_item 115KB 115KB xfs_buf_item 134KB 134KB xfs_log_item_desc 342KB 342KB xfs_trans 1412KB 1412KB xfs_ifork 212KB 212KB [yang.s@alibaba-inc.com: v11] Link: http://lkml.kernel.org/r/1507656303-103845-4-git-send-email-yang.s@alibaba-inc.com Link: http://lkml.kernel.org/r/1507152550-46205-4-git-send-email-yang.s@alibaba-inc.com Signed-off-by: Yang Shi <yang.s@alibaba-inc.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: slabinfo: remove CONFIG_SLABINFOYang Shi
According to discussion with Christoph (https://marc.info/?l=linux-kernel&m=150695909709711&w=2), it sounds like it is pointless to keep CONFIG_SLABINFO around. This patch removes the CONFIG_SLABINFO config option, but /proc/slabinfo is still available. [yang.s@alibaba-inc.com: v11] Link: http://lkml.kernel.org/r/1507656303-103845-3-git-send-email-yang.s@alibaba-inc.com Link: http://lkml.kernel.org/r/1507152550-46205-3-git-send-email-yang.s@alibaba-inc.com Signed-off-by: Yang Shi <yang.s@alibaba-inc.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15Merge branch 'for-linus' of ↵Linus Torvalds
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina: "The usual rocket-science from trivial tree for 4.15" * 'for-linus' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: MAINTAINERS: relinquish kconfig MAINTAINERS: Update my email address treewide: Fix typos in Kconfig kfifo: Fix comments init/Kconfig: Fix module signing document location misc: ibmasm: Return error on error path HID: logitech-hidpp: fix mistake in printk, "feeback" -> "feedback" MAINTAINERS: Correct path to uDraw PS3 driver tracing: Fix doc mistakes in trace sample tracing: Kconfig text fixes for CONFIG_HWLAT_TRACER MIPS: Alchemy: Remove reverted CONFIG_NETLINK_MMAP from db1xxx_defconfig mm/huge_memory.c: fixup grammar in comment lib/xz: Add fall-through comments to a switch statement
2017-11-14Merge branch 'for-4.15/block' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull core block layer updates from Jens Axboe: "This is the main pull request for block storage for 4.15-rc1. Nothing out of the ordinary in here, and no API changes or anything like that. Just various new features for drivers, core changes, etc. In particular, this pull request contains: - A patch series from Bart, closing the whole on blk/scsi-mq queue quescing. - A series from Christoph, building towards hidden gendisks (for multipath) and ability to move bio chains around. - NVMe - Support for native multipath for NVMe (Christoph). - Userspace notifications for AENs (Keith). - Command side-effects support (Keith). - SGL support (Chaitanya Kulkarni) - FC fixes and improvements (James Smart) - Lots of fixes and tweaks (Various) - bcache - New maintainer (Michael Lyle) - Writeback control improvements (Michael) - Various fixes (Coly, Elena, Eric, Liang, et al) - lightnvm updates, mostly centered around the pblk interface (Javier, Hans, and Rakesh). - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph) - Writeback series that fix the much discussed hundreds of millions of sync-all units. This goes all the way, as discussed previously (me). - Fix for missing wakeup on writeback timer adjustments (Yafang Shao). - Fix laptop mode on blk-mq (me). - {mq,name} tupple lookup for IO schedulers, allowing us to have alias names. This means you can use 'deadline' on both !mq and on mq (where it's called mq-deadline). (me). - blktrace race fix, oopsing on sg load (me). - blk-mq optimizations (me). - Obscure waitqueue race fix for kyber (Omar). - NBD fixes (Josef). - Disable writeback throttling by default on bfq, like we do on cfq (Luca Miccio). - Series from Ming that enable us to treat flush requests on blk-mq like any other request. This is a really nice cleanup. - Series from Ming that improves merging on blk-mq with schedulers, getting us closer to flipping the switch on scsi-mq again. - BFQ updates (Paolo). - blk-mq atomic flags memory ordering fixes (Peter Z). - Loop cgroup support (Shaohua). - Lots of minor fixes from lots of different folks, both for core and driver code" * 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits) nvme: fix visibility of "uuid" ns attribute blk-mq: fixup some comment typos and lengths ide: ide-atapi: fix compile error with defining macro DEBUG blk-mq: improve tag waiting setup for non-shared tags brd: remove unused brd_mutex blk-mq: only run the hardware queue if IO is pending block: avoid null pointer dereference on null disk fs: guard_bio_eod() needs to consider partitions xtensa/simdisk: fix compile error nvme: expose subsys attribute to sysfs nvme: create 'slaves' and 'holders' entries for hidden controllers block: create 'slaves' and 'holders' entries for hidden gendisks nvme: also expose the namespace identification sysfs files for mpath nodes nvme: implement multipath access to nvme subsystems nvme: track shared namespaces nvme: introduce a nvme_ns_ids structure nvme: track subsystems block, nvme: Introduce blk_mq_req_flags_t block, scsi: Make SCSI quiesce and resume work reliably block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag ...
2017-11-13Merge branch 'x86-asm-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 core updates from Ingo Molnar: "Note that in this cycle most of the x86 topics interacted at a level that caused them to be merged into tip:x86/asm - but this should be a temporary phenomenon, hopefully we'll back to the usual patterns in the next merge window. The main changes in this cycle were: Hardware enablement: - Add support for the Intel UMIP (User Mode Instruction Prevention) CPU feature. This is a security feature that disables certain instructions such as SGDT, SLDT, SIDT, SMSW and STR. (Ricardo Neri) [ Note that this is disabled by default for now, there are some smaller enhancements in the pipeline that I'll follow up with in the next 1-2 days, which allows this to be enabled by default.] - Add support for the AMD SEV (Secure Encrypted Virtualization) CPU feature, on top of SME (Secure Memory Encryption) support that was added in v4.14. (Tom Lendacky, Brijesh Singh) - Enable new SSE/AVX/AVX512 CPU features: AVX512_VBMI2, GFNI, VAES, VPCLMULQDQ, AVX512_VNNI, AVX512_BITALG. (Gayatri Kammela) Other changes: - A big series of entry code simplifications and enhancements (Andy Lutomirski) - Make the ORC unwinder default on x86 and various objtool enhancements. (Josh Poimboeuf) - 5-level paging enhancements (Kirill A. Shutemov) - Micro-optimize the entry code a bit (Borislav Petkov) - Improve the handling of interdependent CPU features in the early FPU init code (Andi Kleen) - Build system enhancements (Changbin Du, Masahiro Yamada) - ... plus misc enhancements, fixes and cleanups" * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (118 commits) x86/build: Make the boot image generation less verbose selftests/x86: Add tests for the STR and SLDT instructions selftests/x86: Add tests for User-Mode Instruction Prevention x86/traps: Fix up general protection faults caused by UMIP x86/umip: Enable User-Mode Instruction Prevention at runtime x86/umip: Force a page fault when unable to copy emulated result to user x86/umip: Add emulation code for UMIP instructions x86/cpufeature: Add User-Mode Instruction Prevention definitions x86/insn-eval: Add support to resolve 16-bit address encodings x86/insn-eval: Handle 32-bit address encodings in virtual-8086 mode x86/insn-eval: Add wrapper function for 32 and 64-bit addresses x86/insn-eval: Add support to resolve 32-bit address encodings x86/insn-eval: Compute linear address in several utility functions resource: Fix resource_size.cocci warnings X86/KVM: Clear encryption attribute when SEV is active X86/KVM: Decrypt shared per-cpu variables when SEV is active percpu: Introduce DEFINE_PER_CPU_DECRYPTED x86: Add support for changing memory encryption attribute in early boot x86/io: Unroll string I/O when SEV is active x86/boot: Add early boot support when running with SEV active ...
2017-11-10Merge branch 'x86/mm' into x86/asm, to merge branchesIngo Molnar
Most of x86/mm is already in x86/asm, so merge the rest too. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-07mm/sparsemem: Fix ARM64 boot crash when CONFIG_SPARSEMEM_EXTREME=yKirill A. Shutemov
Since commit: 83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y") we allocate the mem_section array dynamically in sparse_memory_present_with_active_regions(), but some architectures, like arm64, don't call the routine to initialize sparsemem. Let's move the initialization into memory_present() it should cover all architectures. Reported-and-tested-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Fixes: 83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y") Link: http://lkml.kernel.org/r/20171107083337.89952-1-kirill.shutemov@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-07Merge branch 'linus' into x86/asm, to pick up fixes and resolve conflictsIngo Molnar
Conflicts: arch/x86/kernel/cpu/Makefile Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-07Merge branch 'linus' into locking/core, to resolve conflictsIngo Molnar
Conflicts: include/linux/compiler-clang.h include/linux/compiler-gcc.h include/linux/compiler-intel.h include/uapi/linux/stddef.h Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-03block: add a poll_fn callback to struct request_queueChristoph Hellwig
That we we can also poll non blk-mq queues. Mostly needed for the NVMe multipath code, but could also be useful elsewhere. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03mm, swap: fix race between swap count continuation operationsHuang Ying
One page may store a set of entries of the sis->swap_map (swap_info_struct->swap_map) in multiple swap clusters. If some of the entries has sis->swap_map[offset] > SWAP_MAP_MAX, multiple pages will be used to store the set of entries of the sis->swap_map. And the pages are linked with page->lru. This is called swap count continuation. To access the pages which store the set of entries of the sis->swap_map simultaneously, previously, sis->lock is used. But to improve the scalability of __swap_duplicate(), swap cluster lock may be used in swap_count_continued() now. This may race with add_swap_count_continuation() which operates on a nearby swap cluster, in which the sis->swap_map entries are stored in the same page. The race can cause wrong swap count in practice, thus cause unfreeable swap entries or software lockup, etc. To fix the race, a new spin lock called cont_lock is added to struct swap_info_struct to protect the swap count continuation page list. This is a lock at the swap device level, so the scalability isn't very well. But it is still much better than the original sis->lock, because it is only acquired/released when swap count continuation is used. Which is considered rare in practice. If it turns out that the scalability becomes an issue for some workloads, we can split the lock into some more fine grained locks. Link: http://lkml.kernel.org/r/20171017081320.28133-1-ying.huang@intel.com Fixes: 235b62176712 ("mm/swap: add cluster lock") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> [4.11+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-03mm/huge_memory.c: deposit page table when copying a PMD migration entryZi Yan
We need to deposit pre-allocated PTE page table when a PMD migration entry is copied in copy_huge_pmd(). Otherwise, we will leak the pre-allocated page and cause a NULL pointer dereference later in zap_huge_pmd(). The missing counters during PMD migration entry copy process are added as well. The bug report is here: https://lkml.org/lkml/2017/10/29/214 Link: http://lkml.kernel.org/r/20171030144636.4836-1-zi.yan@sent.com Fixes: 84c3fc4e9c563 ("mm: thp: check pmd migration entry in common path") Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-03userfaultfd: hugetlbfs: prevent UFFDIO_COPY to fill beyond the end of i_sizeAndrea Arcangeli
This oops: kernel BUG at fs/hugetlbfs/inode.c:484! RIP: remove_inode_hugepages+0x3d0/0x410 Call Trace: hugetlbfs_setattr+0xd9/0x130 notify_change+0x292/0x410 do_truncate+0x65/0xa0 do_sys_ftruncate.constprop.3+0x11a/0x180 SyS_ftruncate+0xe/0x10 tracesys+0xd9/0xde was caused by the lack of i_size check in hugetlb_mcopy_atomic_pte. mmap() can still succeed beyond the end of the i_size after vmtruncate zapped vmas in those ranges, but the faults must not succeed, and that includes UFFDIO_COPY. We could differentiate the retval to userland to represent a SIGBUS like a page fault would do (vs SIGSEGV), but it doesn't seem very useful and we'd need to pick a random retval as there's no meaningful syscall retval that would differentiate from SIGSEGV and SIGBUS, there's just -EFAULT. Link: http://lkml.kernel.org/r/20171016223914.2421-2-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>