summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2016-05-20zsmalloc: require GFP in zs_malloc()Sergey Senozhatsky
Pass GFP flags to zs_malloc() instead of using a fixed mask supplied to zs_create_pool(), so we can be more flexible, but, more importantly, we need this to switch zram to per-cpu compression streams -- zram will try to allocate handle with preemption disabled in a fast path and switch to a slow path (using different gfp mask) if the fast one has failed. Apart from that, this also align zs_malloc() interface with zspool/zbud. [sergey.senozhatsky@gmail.com: pass GFP flags to zs_malloc() instead of using a fixed mask] Link: http://lkml.kernel.org/r/20160429150942.GA637@swordfish Link: http://lkml.kernel.org/r/20160429150942.GA637@swordfish Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm/kasan: add API to check memory regionsAndrey Ryabinin
Memory access coded in an assembly won't be seen by KASAN as a compiler can instrument only C code. Add kasan_check_[read,write]() API which is going to be used to check a certain memory range. Link: http://lkml.kernel.org/r/1462538722-1574-3-git-send-email-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm: kasan: initial memory quarantine implementationAlexander Potapenko
Quarantine isolates freed objects in a separate queue. The objects are returned to the allocator later, which helps to detect use-after-free errors. When the object is freed, its state changes from KASAN_STATE_ALLOC to KASAN_STATE_QUARANTINE. The object is poisoned and put into quarantine instead of being returned to the allocator, therefore every subsequent access to that object triggers a KASAN error, and the error handler is able to say where the object has been allocated and deallocated. When it's time for the object to leave quarantine, its state becomes KASAN_STATE_FREE and it's returned to the allocator. From now on the allocator may reuse it for another allocation. Before that happens, it's still possible to detect a use-after free on that object (it retains the allocation/deallocation stacks). When the allocator reuses this object, the shadow is unpoisoned and old allocation/deallocation stacks are wiped. Therefore a use of this object, even an incorrect one, won't trigger ASan warning. Without the quarantine, it's not guaranteed that the objects aren't reused immediately, that's why the probability of catching a use-after-free is lower than with quarantine in place. Quarantine isolates freed objects in a separate queue. The objects are returned to the allocator later, which helps to detect use-after-free errors. Freed objects are first added to per-cpu quarantine queues. When a cache is destroyed or memory shrinking is requested, the objects are moved into the global quarantine queue. Whenever a kmalloc call allows memory reclaiming, the oldest objects are popped out of the global queue until the total size of objects in quarantine is less than 3/4 of the maximum quarantine size (which is a fraction of installed physical memory). As long as an object remains in the quarantine, KASAN is able to report accesses to it, so the chance of reporting a use-after-free is increased. Once the object leaves quarantine, the allocator may reuse it, in which case the object is unpoisoned and KASAN can't detect incorrect accesses to it. Right now quarantine support is only enabled in SLAB allocator. Unification of KASAN features in SLAB and SLUB will be done later. This patch is based on the "mm: kasan: quarantine" patch originally prepared by Dmitry Chernenkov. A number of improvements have been suggested by Andrey Ryabinin. [glider@google.com: v9] Link: http://lkml.kernel.org/r/1462987130-144092-1-git-send-email-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm: page_is_guard(): return false when page_ext arrays are not allocated yetYang Shi
When enabling the below kernel configs: CONFIG_DEFERRED_STRUCT_PAGE_INIT CONFIG_DEBUG_PAGEALLOC CONFIG_PAGE_EXTENSION CONFIG_DEBUG_VM kernel bootup may fail due to the following oops: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff8118d982>] free_pcppages_bulk+0x2d2/0x8d0 PGD 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: CPU: 11 PID: 106 Comm: pgdatinit1 Not tainted 4.6.0-rc5-next-20160427 #26 Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.10.0025.030220091519 03/02/2009 task: ffff88017c080040 ti: ffff88017c084000 task.ti: ffff88017c084000 RIP: 0010:[<ffffffff8118d982>] [<ffffffff8118d982>] free_pcppages_bulk+0x2d2/0x8d0 RSP: 0000:ffff88017c087c48 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001 RDX: 0000000000000980 RSI: 0000000000000080 RDI: 0000000000660401 RBP: ffff88017c087cd0 R08: 0000000000000401 R09: 0000000000000009 R10: ffff88017c080040 R11: 000000000000000a R12: 0000000000000400 R13: ffffea0019810000 R14: ffffea0019810040 R15: ffff88066cfe6080 FS: 0000000000000000(0000) GS:ffff88066cd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000002406000 CR4: 00000000000006e0 Call Trace: free_hot_cold_page+0x192/0x1d0 __free_pages+0x5c/0x90 __free_pages_boot_core+0x11a/0x14e deferred_free_range+0x50/0x62 deferred_init_memmap+0x220/0x3c3 kthread+0xf8/0x110 ret_from_fork+0x22/0x40 Code: 49 89 d4 48 c1 e0 06 49 01 c5 e9 de fe ff ff 4c 89 f7 44 89 4d b8 4c 89 45 c0 44 89 5d c8 48 89 4d d0 e8 62 c7 07 00 48 8b 4d d0 <48> 8b 00 44 8b 5d c8 4c 8b 45 c0 44 8b 4d b8 a8 02 0f 84 05 ff RIP [<ffffffff8118d982>] free_pcppages_bulk+0x2d2/0x8d0 RSP <ffff88017c087c48> CR2: 0000000000000000 The problem is lookup_page_ext() returns NULL then page_is_guard() tried to access it in page freeing. page_is_guard() depends on PAGE_EXT_DEBUG_GUARD bit of page extension flag, but freeing page might reach here before the page_ext arrays are allocated when feeding a range of pages to the allocator for the first time during bootup or memory hotplug. When it returns NULL, page_is_guard() should just return false instead of checking PAGE_EXT_DEBUG_GUARD unconditionally. Link: http://lkml.kernel.org/r/1463610225-29060-1-git-send-email-yang.shi@linaro.org Signed-off-by: Yang Shi <yang.shi@linaro.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm: make faultaround produce old ptesKirill A. Shutemov
Currently, faultaround code produces young pte. This can screw up vmscan behaviour[1], as it makes vmscan think that these pages are hot and not push them out on first round. During sparse file access faultaround gets more pages mapped and all of them are young. Under memory pressure, this makes vmscan swap out anon pages instead, or to drop other page cache pages which otherwise stay resident. Modify faultaround to produce old ptes, so they can easily be reclaimed under memory pressure. This can to some extend defeat the purpose of faultaround on machines without hardware accessed bit as it will not help us with reducing the number of minor page faults. We may want to disable faultaround on such machines altogether, but that's subject for separate patchset. Minchan: "I tested 512M mmap sequential word read test on non-HW access bit system (i.e., ARM) and confirmed it doesn't increase minor fault any more. old: 4096 fault_around minor fault: 131291 elapsed time: 6747645 usec new: 65536 fault_around minor fault: 131291 elapsed time: 6709263 usec 0.56% benefit" [1] https://lkml.kernel.org/r/1460992636-711-1-git-send-email-vinmenon@codeaurora.org Link: http://lkml.kernel.org/r/1463488366-47723-1-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Minchan Kim <minchan@kernel.org> Tested-by: Minchan Kim <minchan@kernel.org> Acked-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm: use phys_addr_t for reserve_bootmem_region() argumentsStefan Bader
Since commit 92923ca3aace ("mm: meminit: only set page reserved in the memblock region") the reserved bit is set on reserved memblock regions. However start and end address are passed as unsigned long. This is only 32bit on i386, so it can end up marking the wrong pages reserved for ranges at 4GB and above. This was observed on a 32bit Xen dom0 which was booted with initial memory set to a value below 4G but allowing to balloon in memory (dom0_mem=1024M for example). This would define a reserved bootmem region for the additional memory (for example on a 8GB system there was a reverved region covering the 4GB-8GB range). But since the addresses were passed on as unsigned long, this was actually marking all pages from 0 to 4GB as reserved. Fixes: 92923ca3aacef63 ("mm: meminit: only set page reserved in the memblock region") Link: http://lkml.kernel.org/r/1463491221-10573-1-git-send-email-stefan.bader@canonical.com Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Cc: <stable@vger.kernel.org> [4.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20userfaultfd: don't pin the user memory in userfaultfd_file_create()Oleg Nesterov
userfaultfd_file_create() increments mm->mm_users; this means that the memory won't be unmapped/freed if mm owner exits/execs, and UFFDIO_COPY after that can populate the orphaned mm more. Change userfaultfd_file_create() and userfaultfd_ctx_put() to use mm->mm_count to pin mm_struct. This means that atomic_inc_not_zero(mm->mm_users) is needed when we are going to actually play with this memory. Except handle_userfault() path doesn't need this, the caller must already have a reference. The patch adds the new trivial helper, mmget_not_zero(), it can have more users. Link: http://lkml.kernel.org/r/20160516172254.GA8595@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm: thp: microoptimize compound_mapcount()Andrea Arcangeli
compound_mapcount() is only called after PageCompound() has already been checked by the caller, so there's no point to check it again. Gcc may optimize it away too because it's inline but this will remove the runtime check for sure and add it'll add an assert instead. Link: http://lkml.kernel.org/r/1462547040-1737-3-git-send-email-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm: use unsigned long constant for page flagsYu Zhao
struct page->flags is unsigned long, so when shifting bits we should use UL suffix to match it. Found this problem after I added 64-bit CPU specific page flags and failed to compile the kernel: mm/page_alloc.c: In function '__free_one_page': mm/page_alloc.c:672:2: error: integer overflow in expression [-Werror=overflow] Link: http://lkml.kernel.org/r/1461971723-16187-1-git-send-email-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm fix commmets: if SPARSEMEM, pgdata doesn't have page_extWeijie Yang
If SPARSEMEM, use page_ext in mem_section if !SPARSEMEM, use page_ext in pgdata Signed-off-by: Weijie Yang <weijie.yang@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20include/linux/hugetlb.h: use bool instead of int for ↵Chen Gang
hugepage_migration_supported() It is used as a pure bool function within kernel source wide. Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20include/linux/hugetlb*.h: clean up codeChen Gang
Macro HUGETLBFS_SB is clear enough, so one statement is clearer than 3 lines statements. Remove redundant return statements for non-return functions, which can save lines, at least. Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm: tighten fault_in_pages_writeable()Eric Dumazet
copy_page_to_iter_iovec() is currently the only user of fault_in_pages_writeable(), and it definitely can use fragments from high order pages. Make sure fault_in_pages_writeable() is only touching two adjacent pages at most, as claimed. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm/vmalloc: keep a separate lazy-free listChris Wilson
When mixing lots of vmallocs and set_memory_*() (which calls vm_unmap_aliases()) I encountered situations where the performance degraded severely due to the walking of the entire vmap_area list each invocation. One simple improvement is to add the lazily freed vmap_area to a separate lockless free list, such that we then avoid having to walk the full list on each purge. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Roman Pen <r.peniaev@gmail.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Roman Pen <r.peniaev@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Shawn Lin <shawn.lin@rock-chips.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm,oom: speed up select_bad_process() loopTetsuo Handa
Since commit 3a5dda7a17cf ("oom: prevent unnecessary oom kills or kernel panics"), select_bad_process() is using for_each_process_thread(). Since oom_unkillable_task() scans all threads in the caller's thread group and oom_task_origin() scans signal_struct of the caller's thread group, we don't need to call oom_unkillable_task() and oom_task_origin() on each thread. Also, since !mm test will be done later at oom_badness(), we don't need to do !mm test on each thread. Therefore, we only need to do TIF_MEMDIE test on each thread. Although the original code was correct it was quite inefficient because each thread group was scanned num_threads times which can be a lot especially with processes with many threads. Even though the OOM is extremely cold path it is always good to be as effective as possible when we are inside rcu_read_lock() - aka unpreemptible context. If we track number of TIF_MEMDIE threads inside signal_struct, we don't need to do TIF_MEMDIE test on each thread. This will allow select_bad_process() to use for_each_process(). This patch adds a counter to signal_struct for tracking how many TIF_MEMDIE threads are in a given thread group, and check it at oom_scan_process_thread() so that select_bad_process() can use for_each_process() rather than for_each_process_thread(). [mhocko@suse.com: do not blow the signal_struct size] Link: http://lkml.kernel.org/r/20160520075035.GF19172@dhcp22.suse.cz Link: http://lkml.kernel.org/r/201605182230.IDC73435.MVSOHLFOQFOJtF@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20oom: consider multi-threaded tasks in task_will_free_memMichal Hocko
task_will_free_mem is a misnomer for a more complex PF_EXITING test for early break out from the oom killer because it is believed that such a task would release its memory shortly and so we do not have to select an oom victim and perform a disruptive action. Currently we make sure that the given task is not participating in the core dumping because it might get blocked for a long time - see commit d003f371b270 ("oom: don't assume that a coredumping thread will exit soon"). The check can still do better though. We shouldn't consider the task unless the whole thread group is going down. This is rather unlikely but not impossible. A single exiting thread would surely leave all the address space behind. If we are really unlucky it might get stuck on the exit path and keep its TIF_MEMDIE and so block the oom killer. Link: http://lkml.kernel.org/r/1460452756-15491-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm, oom_reaper: do not mmput synchronously from the oom reaper contextMichal Hocko
Tetsuo has properly noted that mmput slow path might get blocked waiting for another party (e.g. exit_aio waits for an IO). If that happens the oom_reaper would be put out of the way and will not be able to process next oom victim. We should strive for making this context as reliable and independent on other subsystems as much as possible. Introduce mmput_async which will perform the slow path from an async (WQ) context. This will delay the operation but that shouldn't be a problem because the oom_reaper has reclaimed the victim's address space for most cases as much as possible and the remaining context shouldn't bind too much memory anymore. The only exception is when mmap_sem trylock has failed which shouldn't happen too often. The issue is only theoretical but not impossible. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm, oom_reaper: hide oom reaped tasks from OOM killer more carefullyMichal Hocko
Commit 36324a990cf5 ("oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space") not only clears TIF_MEMDIE for oom reaped task but also set OOM_SCORE_ADJ_MIN for the target task to hide it from the oom killer. This works in simple cases but it is not sufficient for (unlikely) cases where the mm is shared between independent processes (as they do not share signal struct). If the mm had only small amount of memory which could be reaped then another task sharing the mm could be selected and that wouldn't help to move out from the oom situation. Introduce MMF_OOM_REAPED mm flag which is checked in oom_badness (same as OOM_SCORE_ADJ_MIN) and task is skipped if the flag is set. Set the flag after __oom_reap_task is done with a task. This will force the select_bad_process() to ignore all already oom reaped tasks as well as no such task is sacrificed for its parent. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm, oom, compaction: prevent from should_compact_retry looping for ever for ↵Michal Hocko
costly orders "mm: consider compaction feedback also for costly allocation" has removed the upper bound for the reclaim/compaction retries based on the number of reclaimed pages for costly orders. While this is desirable the patch did miss a mis interaction between reclaim, compaction and the retry logic. The direct reclaim tries to get zones over min watermark while compaction backs off and returns COMPACT_SKIPPED when all zones are below low watermark + 1<<order gap. If we are getting really close to OOM then __compaction_suitable can keep returning COMPACT_SKIPPED a high order request (e.g. hugetlb order-9) while the reclaim is not able to release enough pages to get us over low watermark. The reclaim is still able to make some progress (usually trashing over few remaining pages) so we are not able to break out from the loop. I have seen this happening with the same test described in "mm: consider compaction feedback also for costly allocation" on a swapless system. The original problem got resolved by "vmscan: consider classzone_idx in compaction_ready" but it shows how things might go wrong when we approach the oom event horizont. The reason why compaction requires being over low rather than min watermark is not clear to me. This check was there essentially since 56de7263fcf3 ("mm: compaction: direct compact when a high-order allocation fails"). It is clearly an implementation detail though and we shouldn't pull it into the generic retry logic while we should be able to cope with such eventuality. The only place in should_compact_retry where we retry without any upper bound is for compaction_withdrawn() case. Introduce compaction_zonelist_suitable function which checks the given zonelist and returns true only if there is at least one zone which would would unblock __compaction_suitable if more memory got reclaimed. In this implementation it checks __compaction_suitable with NR_FREE_PAGES plus part of the reclaimable memory as the target for the watermark check. The reclaimable memory is reduced linearly by the allocation order. The idea is that we do not want to reclaim all the remaining memory for a single allocation request just unblock __compaction_suitable which doesn't guarantee we will make a further progress. The new helper is then used if compaction_withdrawn() feedback was provided so we do not retry if there is no outlook for a further progress. !costly requests shouldn't be affected much - e.g. order-2 pages would require to have at least 64kB on the reclaimable LRUs while order-9 would need at least 32M which should be enough to not lock up. [vbabka@suse.cz: fix classzone_idx vs. high_zoneidx usage in compaction_zonelist_suitable] [akpm@linux-foundation.org: fix it for Mel's mm-page_alloc-remove-field-from-alloc_context.patch] Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm, oom: rework oom detectionMichal Hocko
__alloc_pages_slowpath has traditionally relied on the direct reclaim and did_some_progress as an indicator that it makes sense to retry allocation rather than declaring OOM. shrink_zones had to rely on zone_reclaimable if shrink_zone didn't make any progress to prevent from a premature OOM killer invocation - the LRU might be full of dirty or writeback pages and direct reclaim cannot clean those up. zone_reclaimable allows to rescan the reclaimable lists several times and restart if a page is freed. This is really subtle behavior and it might lead to a livelock when a single freed page keeps allocator looping but the current task will not be able to allocate that single page. OOM killer would be more appropriate than looping without any progress for unbounded amount of time. This patch changes OOM detection logic and pulls it out from shrink_zone which is too low to be appropriate for any high level decisions such as OOM which is per zonelist property. It is __alloc_pages_slowpath which knows how many attempts have been done and what was the progress so far therefore it is more appropriate to implement this logic. The new heuristic is implemented in should_reclaim_retry helper called from __alloc_pages_slowpath. It tries to be more deterministic and easier to follow. It builds on an assumption that retrying makes sense only if the currently reclaimable memory + free pages would allow the current allocation request to succeed (as per __zone_watermark_ok) at least for one zone in the usable zonelist. This alone wouldn't be sufficient, though, because the writeback might get stuck and reclaimable pages might be pinned for a really long time or even depend on the current allocation context. Therefore there is a backoff mechanism implemented which reduces the reclaim target after each reclaim round without any progress. This means that we should eventually converge to only NR_FREE_PAGES as the target and fail on the wmark check and proceed to OOM. The backoff is simple and linear with 1/16 of the reclaimable pages for each round without any progress. We are optimistic and reset counter for successful reclaim rounds. Costly high order pages mostly preserve their semantic and those without __GFP_REPEAT fail right away while those which have the flag set will back off after the amount of reclaimable pages reaches equivalent of the requested order. The only difference is that if there was no progress during the reclaim we rely on zone watermark check. This is more logical thing to do than previous 1<<order attempts which were a result of zone_reclaimable faking the progress. [vdavydov@virtuozzo.com: check classzone_idx for shrink_zone] [hannes@cmpxchg.org: separate the heuristic into should_reclaim_retry] [rientjes@google.com: use zone_page_state_snapshot for NR_FREE_PAGES] [rientjes@google.com: shrink_zones doesn't need to return anything] Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm, compaction: abstract compaction feedback to helpersMichal Hocko
Compaction can provide a wild variation of feedback to the caller. Many of them are implementation specific and the caller of the compaction (especially the page allocator) shouldn't be bound to specifics of the current implementation. This patch abstracts the feedback into three basic types: - compaction_made_progress - compaction was active and made some progress. - compaction_failed - compaction failed and further attempts to invoke it would most probably fail and therefore it is not worth retrying - compaction_withdrawn - compaction wasn't invoked for an implementation specific reasons. In the current implementation it means that the compaction was deferred, contended or the page scanners met too early without any progress. Retrying is still worthwhile. [vbabka@suse.cz: do not change thp back off behavior] [akpm@linux-foundation.org: fix typo in comment, per Hillf] Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm, compaction: update compaction_result orderingMichal Hocko
compaction_result will be used as the primary feedback channel for compaction users. At the same time try_to_compact_pages (and potentially others) assume a certain ordering where a more specific feedback takes precendence. This gets a bit awkward when we have conflicting feedback from different zones. E.g one returing COMPACT_COMPLETE meaning the full zone has been scanned without any outcome while other returns with COMPACT_PARTIAL aka made some progress. The caller should get COMPACT_PARTIAL because that means that the compaction still can make some progress. The same applies for COMPACT_PARTIAL vs COMPACT_PARTIAL_SKIPPED. Reorder PARTIAL to be the largest one so the larger the value is the more progress we have done. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm, compaction: distinguish between full and partial COMPACT_COMPLETEMichal Hocko
COMPACT_COMPLETE now means that compaction and free scanner met. This is not very useful information if somebody just wants to use this feedback and make any decisions based on that. The current caller might be a poor guy who just happened to scan tiny portion of the zone and that could be the reason no suitable pages were compacted. Make sure we distinguish the full and partial zone walks. Consumers should treat COMPACT_PARTIAL_SKIPPED as a potential success and be optimistic in retrying. The existing users of COMPACT_COMPLETE are conservatively changed to use COMPACT_PARTIAL_SKIPPED as well but some of them should be probably reconsidered and only defer the compaction only for COMPACT_COMPLETE with the new semantic. This patch shouldn't introduce any functional changes. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm, compaction: distinguish COMPACT_DEFERRED from COMPACT_SKIPPEDMichal Hocko
try_to_compact_pages() can currently return COMPACT_SKIPPED even when the compaction is defered for some zone just because zone DMA is skipped in 99% of cases due to watermark checks. This makes COMPACT_DEFERRED basically unusable for the page allocator as a feedback mechanism. Make sure we distinguish those two states properly and switch their ordering in the enum. This would mean that the COMPACT_SKIPPED will be returned only when all eligible zones are skipped. As a result COMPACT_DEFERRED handling for THP in __alloc_pages_slowpath will be more precise and we would bail out rather than reclaim. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm, compaction: change COMPACT_ constants into enumMichal Hocko
Compaction code is doing weird dances between COMPACT_FOO -> int -> unsigned long But there doesn't seem to be any reason for that. All functions which return/use one of those constants are not expecting any other value so it really makes sense to define an enum for them and make it clear that no other values are expected. This is a pure cleanup and shouldn't introduce any functional changes. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm: vmscan: reduce size of inactive file listRik van Riel
The inactive file list should still be large enough to contain readahead windows and freshly written file data, but it no longer is the only source for detecting multiple accesses to file pages. The workingset refault measurement code causes recently evicted file pages that get accessed again after a shorter interval to be promoted directly to the active list. With that mechanism in place, we can afford to (on a larger system) dedicate more memory to the active file list, so we can actually cache more of the frequently used file pages in memory, and not have them pushed out by streaming writes, once-used streaming file reads, etc. This can help things like database workloads, where only half the page cache can currently be used to cache the database working set. This patch automatically increases that fraction on larger systems, using the same ratio that has already been used for anonymous memory. [hannes@cmpxchg.org: cgroup-awareness] Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Andres Freund <andres@anarazel.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20net: suppress warnings on dev_alloc_skbNeil Horman
Noticed an allocation failure in a network driver the other day on a 32 bit system: DMA-API: debugging out of memory - disabling bnx2fc: adapter_lookup: hba NULL lldpad: page allocation failure. order:0, mode:0x4120 Pid: 4556, comm: lldpad Not tainted 2.6.32-639.el6.i686.debug #1 Call Trace: [<c08a4086>] ? printk+0x19/0x23 [<c05166a4>] ? __alloc_pages_nodemask+0x664/0x830 [<c0649d02>] ? free_object+0x82/0xa0 [<fb4e2c9b>] ? ixgbe_alloc_rx_buffers+0x10b/0x1d0 [ixgbe] [<fb4e2fff>] ? ixgbe_configure_rx_ring+0x29f/0x420 [ixgbe] [<fb4e228c>] ? ixgbe_configure_tx_ring+0x15c/0x220 [ixgbe] [<fb4e3709>] ? ixgbe_configure+0x589/0xc00 [ixgbe] [<fb4e7be7>] ? ixgbe_open+0xa7/0x5c0 [ixgbe] [<fb503ce6>] ? ixgbe_init_interrupt_scheme+0x5b6/0x970 [ixgbe] [<fb4e8e54>] ? ixgbe_setup_tc+0x1a4/0x260 [ixgbe] [<fb505a9f>] ? ixgbe_dcbnl_set_state+0x7f/0x90 [ixgbe] [<c088d80d>] ? dcb_doit+0x10ed/0x16d0 ... Thought that perhaps the big splat in the logs wasn't really necessecary, as all call sites for dev_alloc_skb: a) check the return code for the function and b) either print their own error message or have a recovery path that makes the warning moot. Fix it by modifying dev_alloc_pages to pass __GFP_NOWARN as a gfp flag to suppress the warning applies to the net tree Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <eric.dumazet@gmail.com> CC: Alexander Duyck <alexander.duyck@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-20uapi glibc compat: fix compilation when !__USE_MISC in glibcNicolas Dichtel
These structures are defined only if __USE_MISC is set in glibc net/if.h headers, ie when _BSD_SOURCE or _SVID_SOURCE are defined. CC: Jan Engelhardt <jengelh@inai.de> CC: Josh Boyer <jwboyer@fedoraproject.org> CC: Stephen Hemminger <shemming@brocade.com> CC: Waldemar Brodkorb <mail@waldemar-brodkorb.de> CC: Gabriel Laskar <gabriel@lse.epita.fr> CC: Mikko Rapeli <mikko.rapeli@iki.fi> Fixes: 4a91cb61bb99 ("uapi glibc compat: fix compile errors when glibc net/if.h included before linux/if.h") Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-20Merge tag 'wireless-drivers-next-for-davem-2016-05-13' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next Kalle Valo says: ==================== wireless-drivers patches for 4.7 Major changes: iwlwifi * remove IWLWIFI_DEBUG_EXPERIMENTAL_UCODE kconfig option * work for RX multiqueue continues * dynamic queue allocation work continues * add Luca as maintainer * a bunch of fixes and improvements all over brcmfmac * add 4356 sdio support ath6kl * add ability to set debug uart baud rate with a module parameter wil6210 * add debugfs file to configure firmware led functionality ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-20net: sock: move ->sk_shutdown out of bitfields.Andrey Ryabinin
->sk_shutdown bits share one bitfield with some other bits in sock struct, such as ->sk_no_check_[r,t]x, ->sk_userlocks ... sock_setsockopt() may write to these bits, while holding the socket lock. In case of AF_UNIX sockets, we change ->sk_shutdown bits while holding only unix_state_lock(). So concurrent setsockopt() and shutdown() may lead to corrupting these bits. Fix this by moving ->sk_shutdown bits out of bitfield into a separate byte. This will not change the 'struct sock' size since ->sk_shutdown moved into previously unused 16-bit hole. Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-20ip4ip6: Support for GSO/GROTom Herbert
Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-20fou: Add encap ops for IPv6 tunnelsTom Herbert
This patch add a new fou6 module that provides encapsulation operations for IPv6. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-20ip6_tun: Add infrastructure for doing encapsulationTom Herbert
Add encap_hlen and ip_tunnel_encap structure to ip6_tnl. Add functions for getting encap hlen, setting up encap on a tunnel, performing encapsulation operation. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-20fou: Split out {fou,gue}_build_headerTom Herbert
Create __fou_build_header and __gue_build_header. These implement the protocol generic parts of building the fou and gue header. fou_build_header and gue_build_header implement the IPv4 specific functions and call the __*_build_header functions. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-20net: Cleanup encap items in ip_tunnels.hTom Herbert
Consolidate all the ip_tunnel_encap definitions in one spot in the header file. Also, move ip_encap_hlen and ip_tunnel_encap from ip_tunnel.c to ip_tunnels.h so they call be called without a dependency on ip_tunnel module. Similarly, move iptun_encaps to ip_tunnel_core.c. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-20net: define gso types for IPx over IPv4 and IPv6Tom Herbert
This patch defines two new GSO definitions SKB_GSO_IPXIP4 and SKB_GSO_IPXIP6 along with corresponding NETIF_F_GSO_IPXIP4 and NETIF_F_GSO_IPXIP6. These are used to described IP in IP tunnel and what the outer protocol is. The inner protocol can be deduced from other GSO types (e.g. SKB_GSO_TCPV4 and SKB_GSO_TCPV6). The GSO types of SKB_GSO_IPIP and SKB_GSO_SIT are removed (these are both instances of SKB_GSO_IPXIP4). SKB_GSO_IPXIP6 will be used when support for GSO with IP encapsulation over IPv6 is added. Signed-off-by: Tom Herbert <tom@herbertland.com> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-20Merge tag 'devicetree-for-4.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux Pull devicetree updates from Rob Herring: - Rewrite of the unflattening code to avoid recursion and lessen the stack usage. - Rewrite of the phandle args parsing code to get rid of the fixed args size. This is needed for IOMMU code. - Sync to latest dtc which adds more dts style checking. These warnings are enabled with "W=1" compiles. - Tegra documentation updates related to the above warnings. - A bunch of spelling and other doc fixes. - Various vendor prefix additions. * tag 'devicetree-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: (52 commits) devicetree: Add Creative Technology vendor id gpio: dt-bindings: add ibm,ppc4xx-gpio binding of/unittest: Remove unnecessary module.h header inclusion drivers/of: Fix build warning in populate_node() drivers/of: Fix depth when unflattening devicetree of: dynamic: changeset prop-update revert fix drivers/of: Export of_detach_node() drivers/of: Return allocated memory from of_fdt_unflatten_tree() drivers/of: Specify parent node in of_fdt_unflatten_tree() drivers/of: Rename unflatten_dt_node() drivers/of: Avoid recursively calling unflatten_dt_node() drivers/of: Split unflatten_dt_node() of: include errno.h in of_graph.h of: document refcount incrementation of of_get_cpu_node() Documentation: dt: soc: fix spelling mistakes Documentation: dt: power: fix spelling mistake Documentation: dt: pinctrl: fix spelling mistake Documentation: dt: opp: fix spelling mistake Documentation: dt: net: fix spelling mistakes Documentation: dt: mtd: fix spelling mistake ...
2016-05-20Merge tag 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma Pull rdma updates from Doug Ledford: "Primary 4.7 merge window changes - Updates to the new Intel X722 iWARP driver - Updates to the hfi1 driver - Fixes for the iw_cxgb4 driver - Misc core fixes - Generic RDMA READ/WRITE API addition - SRP updates - Misc ipoib updates - Minor mlx5 updates" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (148 commits) IB/mlx5: Fire the CQ completion handler from tasklet net/mlx5_core: Use tasklet for user-space CQ completion events IB/core: Do not require CAP_NET_ADMIN for packet sniffing IB/mlx4: Fix unaligned access in send_reply_to_slave IB/mlx5: Report Scatter FCS device capability when supported IB/mlx5: Add Scatter FCS support for Raw Packet QP IB/core: Add Scatter FCS create flag IB/core: Add Raw Scatter FCS device capability IB/core: Add extended device capability flags i40iw: pass hw_stats by reference rather than by value i40iw: Remove unnecessary synchronize_irq() before free_irq() i40iw: constify i40iw_vf_cqp_ops structure IB/mlx5: Add UARs write-combining and non-cached mapping IB/mlx5: Allow mapping the free running counter on PROT_EXEC IB/mlx4: Use list_for_each_entry_safe IB/SA: Use correct free function IB/core: Fix a potential array overrun in CMA and SA agent IB/core: Remove unnecessary check in ibnl_rcv_msg IB/IWPM: Fix a potential skb leak RDMA/nes: replace custom print_hex_dump() ...
2016-05-21drm: Nuke ->vblank_disable_allowedDaniel Vetter
This was added in commit 0a3e67a4caac273a3bfc4ced3da364830b1ab241 Author: Jesse Barnes <jbarnes@virtuousgeek.org> Date: Tue Sep 30 12:14:26 2008 -0700 drm: Rework vblank-wait handling to allow interrupt reduction. to stay backwards-compatible with old UMS code that didn't even tell the kernel when it did a modeset, so that the kernel could save/restore vblank counters. At worst this means vblanks will be somewhat funky on a setup that very likely no one still runs. So let's just nuke it. Plan B would be to set it unconditionally in drm_vblank_init for kms drivers, instead of in each driver separately. So if this patch breaks anything please only restore the hunks in drmP.h and drm_irq.c, plus add a check for DRIVER_MODESET in drm_vblank_init. Stumbled over this in a discussion on irc with Chris. Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Liviu Dudau <liviu.dudau@arm.com> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Thierry Reding <thierry.reding@gmail.com> Cc: Eric Anholt <eric@anholt.net> Cc: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com> Cc: Inki Dae <inki.dae@samsung.com> Cc: Tomi Valkeinen <tomi.valkeinen@ti.com> Cc: Mark Yao <mark.yao@rock-chips.com> Cc: Sascha Hauer <s.hauer@pengutronix.de> Cc: Philipp Zabel <p.zabel@pengutronix.de> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Acked-by: Liviu Dudau <Liviu.Dudau@arm.com> Acked-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Tested-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
2016-05-20mlx5: avoid unused variable warningArnd Bergmann
When CONFIG_NET_CLS_ACT is disabled, we get a new warning in the mlx5 ethernet driver because the tc_for_each_action() loop never references the iterator: mellanox/mlx5/core/en_tc.c: In function 'mlx5e_stats_flower': mellanox/mlx5/core/en_tc.c:431:20: error: unused variable 'a' [-Werror=unused-variable] struct tc_action *a; This changes the dummy tc_for_each_action() macro by adding a cast to void, letting the compiler know that the variable is intentionally declared but not used here. I could not come up with a nicer workaround, but this seems to do the trick. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: aad7e08d39bd ("net/mlx5e: Hardware offloaded flower filter statistics support") Fixes: 00175aec941e ("net/sched: Macro instead of CONFIG_NET_CLS_ACT ifdef") Acked-By: Amir Vadai <amir@vadai.me> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-20Merge tag 'mfd-for-linus-4.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd Pull MFD updates from Lee Jones: "New Drivers: - Add new driver for MAXIM MAX77620/MAX20024 PMIC - Add new driver for Hisilicon HI665X PMIC New Device Support: - Add support for AXP809 in axp20x-rsb - Add support for Power Supply in axp20x New core features: - devm_mfd_* managed resources Fix-ups: - Remove unused code (da9063-irq, wm8400-core, tps6105x, smsc-ece1099, twl4030-power) - Improve clean-up in error path (intel_quark_i2c_gpio) - Explicitly include headers (syscon.h) - Allow building as modules (max77693) - Use IS_ENABLED() instead of rolling your own (dm355evm_msp, wm8400-core) - DT adaptions (axp20x, hi655x, arizona, max77620) - Remove CLK_IS_ROOT flag (intel-lpss, intel_quark) - Move to gpiochip API (asic3, dm355evm_msp, htc-egpio, htc-i2cpld, sm501, tc6393xb, tps65010, ucb1x00, vexpress) - Make use of devm_mfd_* calls (act8945a, as3711, atmel-hlcdc, bcm590xx, hi6421-pmic-core, lp3943, menf21bmc, mt6397, rdc321x, rk808, rn5t618, rt5033, sky81452, stw481x, tps6507x, tps65217, wm8400) Bug Fixes" - Fix ACPI child matching (mfd-core) - Fix start-up ordering issues (mt6397-core, arizona-core) - Fix forgotten register state on resume (intel-lpss) - Fix Clock related issues (twl6040) - Fix scheduling whilst atomic (omap-usb-tll) - Kconfig changes (vexpress)" * tag 'mfd-for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (73 commits) mfd: hi655x: Add MFD driver for hi655x mfd: ab8500-debugfs: Trivial fix of spelling mistake on "between" mfd: vexpress: Add !ARCH_USES_GETTIMEOFFSET dependency mfd: Add device-tree binding doc for PMIC MAX77620/MAX20024 mfd: max77620: Add core driver for MAX77620/MAX20024 mfd: arizona: Add defines for GPSW values that can be used from DT mfd: omap-usb-tll: Fix scheduling while atomic BUG mfd: wm5110: ARIZONA_CLOCK_CONTROL should be volatile mfd: axp20x: Add a cell for the ac power_supply part of the axp20x PMICs mfd: intel_soc_pmic_core: Terminate panel control GPIO lookup table correctly mfd: wl1273-core: Use devm_mfd_add_devices() for mfd_device registration mfd: tps65910: Use devm_mfd_add_devices and devm_regmap_add_irq_chip mfd: sec: Use devm_mfd_add_devices and devm_regmap_add_irq_chip mfd: rc5t583: Use devm_mfd_add_devices and devm_request_threaded_irq mfd: max77686: Use devm_mfd_add_devices and devm_regmap_add_irq_chip mfd: as3722: Use devm_mfd_add_devices and devm_regmap_add_irq_chip mfd: twl4030-power: Remove driver path in file comment MAINTAINERS: Add entry for X-Powers AXP family PMIC drivers mfd: smsc-ece1099: Remove unnecessarily remove callback mfd: Use IS_ENABLED(CONFIG_FOO) instead of checking FOO || FOO_MODULE ...
2016-05-20Merge tag 'fbdev-4.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux Pull fbdev updates from Tomi Valkeinen: - imxfb: fix lcd power up - small fixes and cleanups * tag 'fbdev-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux: fbdev: Use IS_ENABLED() instead of checking for built-in or module efifb: Don't show the mapping VA video: AMBA CLCD: Remove unncessary include in amba-clcd.c fbdev: ssd1307fb: Fix charge pump setting Documentation: fb: fix spelling mistakes fbdev: fbmem: implement error handling in fbmem_init() fbdev: sh_mipi_dsi: remove driver video: fbdev: imxfb: add some error handling video: fbdev: imxfb: fix semantic of .get_power and .set_power video: fbdev: omap2: Remove deprecated regulator_can_change_voltage() usage
2016-05-20Merge tag 'powerpc-4.7-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Highlights: - Support for Power ISA 3.0 (Power9) Radix Tree MMU from Aneesh Kumar K.V - Live patching support for ppc64le (also merged via livepatching.git) Various cleanups & minor fixes from: - Aaro Koskinen, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Chris Smart, Daniel Axtens, Frederic Barrat, Gavin Shan, Ian Munsie, Lennart Sorensen, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring, Michael Ellerman, Oliver O'Halloran, Paul Gortmaker, Paul Mackerras, Rashmica Gupta, Russell Currey, Suraj Jitindar Singh, Thiago Jung Bauermann, Valentin Rothberg, Vipin K Parashar. General: - Update LMB associativity index during DLPAR add/remove from Nathan Fontenot - Fix branching to OOL handlers in relocatable kernel from Hari Bathini - Add support for userspace Power9 copy/paste from Chris Smart - Always use STRICT_MM_TYPECHECKS from Michael Ellerman - Add mask of possible MMU features from Michael Ellerman PCI: - Enable pass through of NVLink to guests from Alexey Kardashevskiy - Cleanups in preparation for powernv PCI hotplug from Gavin Shan - Don't report error in eeh_pe_reset_and_recover() from Gavin Shan - Restore initial state in eeh_pe_reset_and_recover() from Gavin Shan - Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell" from Guilherme G Piccoli - Remove the dependency on EEH struct in DDW mechanism from Guilherme G Piccoli selftests: - Test cp_abort during context switch from Chris Smart - Add several tests for transactional memory support from Rashmica Gupta perf: - Add support for sampling interrupt register state from Anju T - Add support for unwinding perf-stackdump from Chandan Kumar cxl: - Configure the PSL for two CAPI ports on POWER8NVL from Philippe Bergheaud - Allow initialization on timebase sync failures from Frederic Barrat - Increase timeout for detection of AFU mmio hang from Frederic Barrat - Handle num_of_processes larger than can fit in the SPA from Ian Munsie - Ensure PSL interrupt is configured for contexts with no AFU IRQs from Ian Munsie - Add kernel API to allow a context to operate with relocate disabled from Ian Munsie - Check periodically the coherent platform function's state from Christophe Lombard Freescale: - Updates from Scott: "Contains 86xx fixes, minor device tree fixes, an erratum workaround, and a kconfig dependency fix." * tag 'powerpc-4.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (192 commits) powerpc/86xx: Fix PCI interrupt map definition powerpc/86xx: Move pci1 definition to the include file powerpc/fsl: Fix build of the dtb embedded kernel images powerpc/fsl: Fix rcpm compatible string powerpc/fsl: Remove FSL_SOC dependency from FSL_LBC powerpc/fsl-pci: Add a workaround for PCI 5 errata powerpc/fsl: Fix SPI compatible on t208xrdb and t1040rdb powerpc/powernv/npu: Add PE to PHB's list powerpc/powernv: Fix insufficient memory allocation powerpc/iommu: Remove the dependency on EEH struct in DDW mechanism Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell" powerpc/eeh: Drop unnecessary label in eeh_pe_change_owner() powerpc/eeh: Ignore handlers in eeh_pe_reset_and_recover() powerpc/eeh: Restore initial state in eeh_pe_reset_and_recover() powerpc/eeh: Don't report error in eeh_pe_reset_and_recover() Revert "powerpc/powernv: Exclude root bus in pnv_pci_reset_secondary_bus()" powerpc/powernv/npu: Enable NVLink pass through powerpc/powernv/npu: Rework TCE Kill handling powerpc/powernv/npu: Add set/unset window helpers powerpc/powernv/ioda2: Export debug helper pe_level_printk() ...
2016-05-20KVM: arm/arm64: vgic-new: implement mapped IRQ handlingAndre Przywara
We now store the mapped hardware IRQ number in our struct, so we don't need the irq_phys_map for the new VGIC. Implement the hardware IRQ mapping on top of the reworked arch timer interface. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
2016-05-20KVM: arm/arm64: vgic-new: vgic_init: implement map_resourcesEric Auger
map_resources is the last initialization step. It is executed on first VCPU run. At that stage the code checks that userspace has provided the base addresses for the relevant VGIC regions, which depend on the type of VGIC that is exposed to the guest. Also we check if the two regions overlap. If the checks succeeded, we register the respective register frames with the kvm_io_bus framework. If we emulate a GICv2, the function also forces vgic_init execution if it has not been executed yet. Also we map the virtual GIC CPU interface onto the guest's CPU interface. Signed-off-by: Eric Auger <eric.auger@linaro.org> Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
2016-05-20KVM: arm/arm64: vgic-new: vgic_init: implement vgic_initEric Auger
This patch allocates and initializes the data structures used to model the vgic distributor and virtual cpu interfaces. At that stage the number of IRQs and number of virtual CPUs is frozen. Signed-off-by: Eric Auger <eric.auger@linaro.org> Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
2016-05-20KVM: arm/arm64: vgic-new: vgic_init: implement vgic_createEric Auger
This patch implements the vgic_creation function which is called on CREATE_IRQCHIP VM IOCTL (v2 only) or KVM_CREATE_DEVICE Signed-off-by: Eric Auger <eric.auger@linaro.org> Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
2016-05-20KVM: arm/arm64: vgic-new: vgic_init: implement kvm_vgic_hyp_initEric Auger
Implements kvm_vgic_hyp_init and vgic_probe function. This uses the new firmware independent VGIC probing to support both ACPI and DT based systems (code from Marc Zyngier). The vgic_global struct is enriched with new fields populated by those functions. Signed-off-by: Eric Auger <eric.auger@linaro.org> Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
2016-05-20KVM: arm/arm64: vgic-new: vgic_kvm_device: implement kvm_vgic_addrEric Auger
kvm_vgic_addr is used by the userspace to set the base address of the following register regions, as seen by the guest: - distributor(v2 and v3), - re-distributors (v3), - CPU interface (v2). Signed-off-by: Eric Auger <eric.auger@linaro.org> Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
2016-05-20KVM: arm/arm64: vgic-new: Add GICv3 SGI system register trap handlerAndre Przywara
In contrast to GICv2 SGIs in a GICv3 implementation are not triggered by a MMIO write, but with a system register write. KVM knows about that register already, we just need to implement the handler and wire it up to the core KVM/ARM code. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>