summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2019-02-01autofs: drop dentry reference only when it is never usedPan Bian
autofs_expire_run() calls dput(dentry) to drop the reference count of dentry. However, dentry is read via autofs_dentry_ino(dentry) after that. This may result in a use-free-bug. The patch drops the reference count of dentry only when it is never used. Link: http://lkml.kernel.org/r/154725122396.11260.16053424107144453867.stgit@pluto-themaw-net Signed-off-by: Pan Bian <bianpan2016@163.com> Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01fs/drop_caches.c: avoid softlockups in drop_pagecache_sb()Jan Kara
When superblock has lots of inodes without any pagecache (like is the case for /proc), drop_pagecache_sb() will iterate through all of them without dropping sb->s_inode_list_lock which can lead to softlockups (one of our customers hit this). Fix the problem by going to the slow path and doing cond_resched() in case the process needs rescheduling. Link: http://lkml.kernel.org/r/20190114085343.15011-1-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01mm: migrate: don't rely on __PageMovable() of newpage after unlocking itDavid Hildenbrand
We had a race in the old balloon compaction code before b1123ea6d3b3 ("mm: balloon: use general non-lru movable page feature") refactored it that became visible after backporting 195a8c43e93d ("virtio-balloon: deflate via a page list") without the refactoring. The bug existed from commit d6d86c0a7f8d ("mm/balloon_compaction: redesign ballooned pages management") till b1123ea6d3b3 ("mm: balloon: use general non-lru movable page feature"). d6d86c0a7f8d ("mm/balloon_compaction: redesign ballooned pages management") was backported to 3.12, so the broken kernels are stable kernels [3.12 - 4.7]. There was a subtle race between dropping the page lock of the newpage in __unmap_and_move() and checking for __is_movable_balloon_page(newpage). Just after dropping this page lock, virtio-balloon could go ahead and deflate the newpage, effectively dequeueing it and clearing PageBalloon, in turn making __is_movable_balloon_page(newpage) fail. This resulted in dropping the reference of the newpage via putback_lru_page(newpage) instead of put_page(newpage), leading to page->lru getting modified and a !LRU page ending up in the LRU lists. With 195a8c43e93d ("virtio-balloon: deflate via a page list") backported, one would suddenly get corrupted lists in release_pages_balloon(): - WARNING: CPU: 13 PID: 6586 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0 - list_del corruption. prev->next should be ffffe253961090a0, but was dead000000000100 Nowadays this race is no longer possible, but it is hidden behind very ugly handling of __ClearPageMovable() and __PageMovable(). __ClearPageMovable() will not make __PageMovable() fail, only PageMovable(). So the new check (__PageMovable(newpage)) will still hold even after newpage was dequeued by virtio-balloon. If anybody would ever change that special handling, the BUG would be introduced again. So instead, make it explicit and use the information of the original isolated page before migration. This patch can be backported fairly easy to stable kernels (in contrast to the refactoring). Link: http://lkml.kernel.org/r/20190129233217.10747-1-david@redhat.com Fixes: d6d86c0a7f8d ("mm/balloon_compaction: redesign ballooned pages management") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Vratislav Bendel <vbendel@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vratislav Bendel <vbendel@redhat.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Konstantin Khlebnikov <k.khlebnikov@samsung.com> Cc: Minchan Kim <minchan@kernel.org> Cc: <stable@vger.kernel.org> [3.12 - 4.7] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01psi: clarify the Kconfig text for the default-disable optionJohannes Weiner
The current help text caused some confusion in online forums about whether or not to default-enable or default-disable psi in vendor kernels. This is because it doesn't communicate the reason for why we made this setting configurable in the first place: that the overhead is non-zero in an artificial scheduler stress test. Since this isn't representative of real workloads, and the effect was not measurable in scheduler-heavy real world applications such as the webservers and memcache installations at Facebook, it's fair to point out that this is a pretty cautious option to select. Link: http://lkml.kernel.org/r/20190129233617.16767-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01mm, memory_hotplug: __offline_pages fix wrong lockingMichal Hocko
Jan has noticed that we do double unlock on some failure paths when offlining a page range. This is indeed the case when test_pages_in_a_zone respp. start_isolate_page_range fail. This was an omission when forward porting the debugging patch from an older kernel. Fix the issue by dropping mem_hotplug_done from the failure condition and keeping the single unlock in the catch all failure path. Link: http://lkml.kernel.org/r/20190115120307.22768-1-mhocko@kernel.org Fixes: 7960509329c2 ("mm, memory_hotplug: print reason for the offlining failure") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: Jan Kara <jack@suse.cz> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01mm: hwpoison: use do_send_sig_info() instead of force_sig()Naoya Horiguchi
Currently memory_failure() is racy against process's exiting, which results in kernel crash by null pointer dereference. The root cause is that memory_failure() uses force_sig() to forcibly kill asynchronous (meaning not in the current context) processes. As discussed in thread https://lkml.org/lkml/2010/6/8/236 years ago for OOM fixes, this is not a right thing to do. OOM solves this issue by using do_send_sig_info() as done in commit d2d393099de2 ("signal: oom_kill_task: use SEND_SIG_FORCED instead of force_sig()"), so this patch is suggesting to do the same for hwpoison. do_send_sig_info() properly accesses to siglock with lock_task_sighand(), so is free from the reported race. I confirmed that the reported bug reproduces with inserting some delay in kill_procs(), and it never reproduces with this patch. Note that memory_failure() can send another type of signal using force_sig_mceerr(), and the reported race shouldn't happen on it because force_sig_mceerr() is called only for synchronous processes (i.e. BUS_MCEERR_AR happens only when some process accesses to the corrupted memory.) Link: http://lkml.kernel.org/r/20190116093046.GA29835@hori1.linux.bs1.fc.nec.co.jp Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01kasan: mark file common so ftrace doesn't trace itAnders Roxell
When option CONFIG_KASAN is enabled toghether with ftrace, function ftrace_graph_caller() gets in to a recursion, via functions kasan_check_read() and kasan_check_write(). Breakpoint 2, ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:179 179 mcount_get_pc x0 // function's pc (gdb) bt #0 ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:179 #1 0xffffff90101406c8 in ftrace_caller () at ../arch/arm64/kernel/entry-ftrace.S:151 #2 0xffffff90106fd084 in kasan_check_write (p=0xffffffc06c170878, size=4) at ../mm/kasan/common.c:105 #3 0xffffff90104a2464 in atomic_add_return (v=<optimized out>, i=<optimized out>) at ./include/generated/atomic-instrumented.h:71 #4 atomic_inc_return (v=<optimized out>) at ./include/generated/atomic-fallback.h:284 #5 trace_graph_entry (trace=0xffffffc03f5ff380) at ../kernel/trace/trace_functions_graph.c:441 #6 0xffffff9010481774 in trace_graph_entry_watchdog (trace=<optimized out>) at ../kernel/trace/trace_selftest.c:741 #7 0xffffff90104a185c in function_graph_enter (ret=<optimized out>, func=<optimized out>, frame_pointer=18446743799894897728, retp=<optimized out>) at ../kernel/trace/trace_functions_graph.c:196 #8 0xffffff9010140628 in prepare_ftrace_return (self_addr=18446743592948977792, parent=0xffffffc03f5ff418, frame_pointer=18446743799894897728) at ../arch/arm64/kernel/ftrace.c:231 #9 0xffffff90101406f4 in ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:182 Backtrace stopped: previous frame identical to this frame (corrupt stack?) (gdb) Rework so that the kasan implementation isn't traced. Link: http://lkml.kernel.org/r/20181212183447.15890-1-anders.roxell@linaro.org Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Acked-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01init/Kconfig: fix grammar by moving a closing parenthesisJonathan Neuschäfer
Link: http://lkml.kernel.org/r/20190129150813.15785-1-j.neuschaefer@gmx.net Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01lib/test_kmod.c: potential double free in error handlingDan Carpenter
There is a copy and paste bug so we set "config->test_driver" to NULL twice instead of setting "config->test_fs". Smatch complains that it leads to a double free: lib/test_kmod.c:840 __kmod_config_init() warn: 'config->test_fs' double freed Link: http://lkml.kernel.org/r/20190121140011.GA14283@kadam Fixes: d9c6a72d6fa2 ("kmod: add test driver to stress test the module loader") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01mm, oom: fix use-after-free in oom_kill_processShakeel Butt
Syzbot instance running on upstream kernel found a use-after-free bug in oom_kill_process. On further inspection it seems like the process selected to be oom-killed has exited even before reaching read_lock(&tasklist_lock) in oom_kill_process(). More specifically the tsk->usage is 1 which is due to get_task_struct() in oom_evaluate_task() and the put_task_struct within for_each_thread() frees the tsk and for_each_thread() tries to access the tsk. The easiest fix is to do get/put across the for_each_thread() on the selected task. Now the next question is should we continue with the oom-kill as the previously selected task has exited? However before adding more complexity and heuristics, let's answer why we even look at the children of oom-kill selected task? The select_bad_process() has already selected the worst process in the system/memcg. Due to race, the selected process might not be the worst at the kill time but does that matter? The userspace can use the oom_score_adj interface to prefer children to be killed before the parent. I looked at the history but it seems like this is there before git history. Link: http://lkml.kernel.org/r/20190121215850.221745-1-shakeelb@google.com Reported-by: syzbot+7fbbfa368521945f0e3d@syzkaller.appspotmail.com Fixes: 6b0c81b3be11 ("mm, oom: reduce dependency on tasklist_lock") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01mm/hotplug: invalid PFNs from pfn_to_online_page()Qian Cai
On an arm64 ThunderX2 server, the first kmemleak scan would crash [1] with CONFIG_DEBUG_VM_PGFLAGS=y due to page_to_nid() found a pfn that is not directly mapped (MEMBLOCK_NOMAP). Hence, the page->flags is uninitialized. This is due to the commit 9f1eb38e0e11 ("mm, kmemleak: little optimization while scanning") starts to use pfn_to_online_page() instead of pfn_valid(). However, in the CONFIG_MEMORY_HOTPLUG=y case, pfn_to_online_page() does not call memblock_is_map_memory() while pfn_valid() does. Historically, the commit 68709f45385a ("arm64: only consider memblocks with NOMAP cleared for linear mapping") causes pages marked as nomap being no long reassigned to the new zone in memmap_init_zone() by calling __init_single_page(). Since the commit 2d070eab2e82 ("mm: consider zone which is not fully populated to have holes") introduced pfn_to_online_page() and was designed to return a valid pfn only, but it is clearly broken on arm64. Therefore, let pfn_to_online_page() call pfn_valid_within(), so it can handle nomap thanks to the commit f52bb98f5ade ("arm64: mm: always enable CONFIG_HOLES_IN_ZONE"), while it will be optimized away on architectures where have no HOLES_IN_ZONE. [1] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000006 Mem abort info: ESR = 0x96000005 Exception class = DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000005 CM = 0, WnR = 0 Internal error: Oops: 96000005 [#1] SMP CPU: 60 PID: 1408 Comm: kmemleak Not tainted 5.0.0-rc2+ #8 pstate: 60400009 (nZCv daif +PAN -UAO) pc : page_mapping+0x24/0x144 lr : __dump_page+0x34/0x3dc sp : ffff00003a5cfd10 x29: ffff00003a5cfd10 x28: 000000000000802f x27: 0000000000000000 x26: 0000000000277d00 x25: ffff000010791f56 x24: ffff7fe000000000 x23: ffff000010772f8b x22: ffff00001125f670 x21: ffff000011311000 x20: ffff000010772f8b x19: fffffffffffffffe x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: ffff802698b19600 x13: ffff802698b1a200 x12: ffff802698b16f00 x11: ffff802698b1a400 x10: 0000000000001400 x9 : 0000000000000001 x8 : ffff00001121a000 x7 : 0000000000000000 x6 : ffff0000102c53b8 x5 : 0000000000000000 x4 : 0000000000000003 x3 : 0000000000000100 x2 : 0000000000000000 x1 : ffff000010772f8b x0 : ffffffffffffffff Process kmemleak (pid: 1408, stack limit = 0x(____ptrval____)) Call trace: page_mapping+0x24/0x144 __dump_page+0x34/0x3dc dump_page+0x28/0x4c kmemleak_scan+0x4ac/0x680 kmemleak_scan_thread+0xb4/0xdc kthread+0x12c/0x13c ret_from_fork+0x10/0x18 Code: d503201f f9400660 36000040 d1000413 (f9400661) ---[ end trace 4d4bd7f573490c8e ]--- Kernel panic - not syncing: Fatal exception SMP: stopping secondary CPUs Kernel Offset: disabled CPU features: 0x002,20000c38 Memory Limit: none ---[ end Kernel panic - not syncing: Fatal exception ]--- Link: http://lkml.kernel.org/r/20190122132916.28360-1-cai@lca.pw Fixes: 9f1eb38e0e11 ("mm, kmemleak: little optimization while scanning") Signed-off-by: Qian Cai <cai@lca.pw> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01mm,memory_hotplug: fix scan_movable_pages() for gigantic hugepagesOscar Salvador
This is the same sort of error we saw in commit 17e2e7d7e1b8 ("mm, page_alloc: fix has_unmovable_pages for HugePages"). Gigantic hugepages cross several memblocks, so it can be that the page we get in scan_movable_pages() is a page-tail belonging to a 1G-hugepage. If that happens, page_hstate()->size_to_hstate() will return NULL, and we will blow up in hugepage_migration_supported(). The splat is as follows: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 #PF error: [normal kernel read fault] PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 1 PID: 1350 Comm: bash Tainted: G E 5.0.0-rc1-mm1-1-default+ #27 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014 RIP: 0010:__offline_pages+0x6ae/0x900 Call Trace: memory_subsys_offline+0x42/0x60 device_offline+0x80/0xa0 state_store+0xab/0xc0 kernfs_fop_write+0x102/0x180 __vfs_write+0x26/0x190 vfs_write+0xad/0x1b0 ksys_write+0x42/0x90 do_syscall_64+0x5b/0x180 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Modules linked in: af_packet(E) xt_tcpudp(E) ipt_REJECT(E) xt_conntrack(E) nf_conntrack(E) nf_defrag_ipv4(E) ip_set(E) nfnetlink(E) ebtable_nat(E) ebtable_broute(E) bridge(E) stp(E) llc(E) iptable_mangle(E) iptable_raw(E) iptable_security(E) ebtable_filter(E) ebtables(E) iptable_filter(E) ip_tables(E) x_tables(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) bochs_drm(E) ttm(E) aesni_intel(E) drm_kms_helper(E) aes_x86_64(E) crypto_simd(E) cryptd(E) glue_helper(E) drm(E) virtio_net(E) syscopyarea(E) sysfillrect(E) net_failover(E) sysimgblt(E) pcspkr(E) failover(E) i2c_piix4(E) fb_sys_fops(E) parport_pc(E) parport(E) button(E) btrfs(E) libcrc32c(E) xor(E) zstd_decompress(E) zstd_compress(E) xxhash(E) raid6_pq(E) sd_mod(E) ata_generic(E) ata_piix(E) ahci(E) libahci(E) libata(E) crc32c_intel(E) serio_raw(E) virtio_pci(E) virtio_ring(E) virtio(E) sg(E) scsi_mod(E) autofs4(E) [akpm@linux-foundation.org: fix brace layout, per David. Reduce indentation] Link: http://lkml.kernel.org/r/20190122154407.18417-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Anthony Yznaga <anthony.yznaga@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01psi: fix aggregation idle shut-offJohannes Weiner
psi has provisions to shut off the periodic aggregation worker when there is a period of no task activity - and thus no data that needs aggregating. However, while developing psi monitoring, Suren noticed that the aggregation clock currently won't stay shut off for good. Debugging this revealed a flaw in the idle design: an aggregation run will see no task activity and decide to go to sleep; shortly thereafter, the kworker thread that executed the aggregation will go idle and cause a scheduling change, during which the psi callback will kick the !pending worker again. This will ping-pong forever, and is equivalent to having no shut-off logic at all (but with more code!) Fix this by exempting aggregation workers from psi's clock waking logic when the state change is them going to sleep. To do this, tag workers with the last work function they executed, and if in psi we see a worker going to sleep after aggregating psi data, we will not reschedule the aggregation work item. What if the worker is also executing other items before or after? Any psi state times that were incurred by work items preceding the aggregation work will have been collected from the per-cpu buckets during the aggregation itself. If there are work items following the aggregation work, the worker's last_func tag will be overwritten and the aggregator will be kept alive to process this genuine new activity. If the aggregation work is the last thing the worker does, and we decide to go idle, the brief period of non-idle time incurred between the aggregation run and the kworker's dequeue will be stranded in the per-cpu buckets until the clock is woken by later activity. But that should not be a problem. The buckets can hold 4s worth of time, and future activity will wake the clock with a 2s delay, giving us 2s worth of data we can leave behind when disabling aggregation. If it takes a worker more than two seconds to go idle after it finishes its last work item, we likely have bigger problems in the system, and won't notice one sample that was averaged with a bogus per-CPU weight. Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org Fixes: eb414681d5a0 ("psi: pressure stall information for CPU, memory, and IO") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01mm, memory_hotplug: test_pages_in_a_zone do not pass the end of zoneMikhail Zaslonko
If memory end is not aligned with the sparse memory section boundary, the mapping of such a section is only partly initialized. This may lead to VM_BUG_ON due to uninitialized struct pages access from test_pages_in_a_zone() function triggered by memory_hotplug sysfs handlers. Here are the the panic examples: CONFIG_DEBUG_VM_PGFLAGS=y kernel parameter mem=2050M -------------------------- page:000003d082008000 is uninitialized and poisoned page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) Call Trace: test_pages_in_a_zone+0xde/0x160 show_valid_zones+0x5c/0x190 dev_attr_show+0x34/0x70 sysfs_kf_seq_show+0xc8/0x148 seq_read+0x204/0x480 __vfs_read+0x32/0x178 vfs_read+0x82/0x138 ksys_read+0x5a/0xb0 system_call+0xdc/0x2d8 Last Breaking-Event-Address: test_pages_in_a_zone+0xde/0x160 Kernel panic - not syncing: Fatal exception: panic_on_oops Fix this by checking whether the pfn to check is within the zone. [mhocko@suse.com: separated this change from http://lkml.kernel.org/r/20181105150401.97287-2-zaslonko@linux.ibm.com] Link: http://lkml.kernel.org/r/20190128144506.15603-3-mhocko@kernel.org [mhocko@suse.com: separated this change from http://lkml.kernel.org/r/20181105150401.97287-2-zaslonko@linux.ibm.com] Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01mm, memory_hotplug: is_mem_section_removable do not pass the end of a zoneMichal Hocko
Patch series "mm, memory_hotplug: fix uninitialized pages fallouts", v2. Mikhail Zaslonko has posted fixes for the two bugs quite some time ago [1]. I have pushed back on those fixes because I believed that it is much better to plug the problem at the initialization time rather than play whack-a-mole all over the hotplug code and find all the places which expect the full memory section to be initialized. We have ended up with commit 2830bf6f05fb ("mm, memory_hotplug: initialize struct pages for the full memory section") merged and cause a regression [2][3]. The reason is that there might be memory layouts when two NUMA nodes share the same memory section so the merged fix is simply incorrect. In order to plug this hole we really have to be zone range aware in those handlers. I have split up the original patch into two. One is unchanged (patch 2) and I took a different approach for `removable' crash. [1] http://lkml.kernel.org/r/20181105150401.97287-2-zaslonko@linux.ibm.com [2] https://bugzilla.redhat.com/show_bug.cgi?id=1666948 [3] http://lkml.kernel.org/r/20190125163938.GA20411@dhcp22.suse.cz This patch (of 2): Mikhail has reported the following VM_BUG_ON triggered when reading sysfs removable state of a memory block: page:000003d08300c000 is uninitialized and poisoned page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) Call Trace: is_mem_section_removable+0xb4/0x190 show_mem_removable+0x9a/0xd8 dev_attr_show+0x34/0x70 sysfs_kf_seq_show+0xc8/0x148 seq_read+0x204/0x480 __vfs_read+0x32/0x178 vfs_read+0x82/0x138 ksys_read+0x5a/0xb0 system_call+0xdc/0x2d8 Last Breaking-Event-Address: is_mem_section_removable+0xb4/0x190 Kernel panic - not syncing: Fatal exception: panic_on_oops The reason is that the memory block spans the zone boundary and we are stumbling over an unitialized struct page. Fix this by enforcing zone range in is_mem_section_removable so that we never run away from a zone. Link: http://lkml.kernel.org/r/20190128144506.15603-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Debugged-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01oom, oom_reaper: do not enqueue same task twiceTetsuo Handa
Arkadiusz reported that enabling memcg's group oom killing causes strange memcg statistics where there is no task in a memcg despite the number of tasks in that memcg is not 0. It turned out that there is a bug in wake_oom_reaper() which allows enqueuing same task twice which makes impossible to decrease the number of tasks in that memcg due to a refcount leak. This bug existed since the OOM reaper became invokable from task_will_free_mem(current) path in out_of_memory() in Linux 4.7, T1@P1 |T2@P1 |T3@P1 |OOM reaper ----------+----------+----------+------------ # Processing an OOM victim in a different memcg domain. try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) out_of_memory() oom_kill_process(P1) do_send_sig_info(SIGKILL, @P1) mark_oom_victim(T1@P1) wake_oom_reaper(T1@P1) # T1@P1 is enqueued. mutex_unlock(&oom_lock) out_of_memory() mark_oom_victim(T2@P1) wake_oom_reaper(T2@P1) # T2@P1 is enqueued. mutex_unlock(&oom_lock) out_of_memory() mark_oom_victim(T1@P1) wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL. mutex_unlock(&oom_lock) # Completed processing an OOM victim in a different memcg domain. spin_lock(&oom_reaper_lock) # T1P1 is dequeued. spin_unlock(&oom_reaper_lock) but memcg's group oom killing made it easier to trigger this bug by calling wake_oom_reaper() on the same task from one out_of_memory() request. Fix this bug using an approach used by commit 855b018325737f76 ("oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task"). As a side effect of this patch, this patch also avoids enqueuing multiple threads sharing memory via task_will_free_mem(current) path. Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp Fixes: af8e15cc85a25315 ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl> Tested-by: Arkadiusz Miskiewicz <arekm@maven.pl> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Aleksa Sarai <asarai@suse.de> Cc: Jay Kamat <jgkamat@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01mm: migrate: make buffer_migrate_page_norefs() actually succeedJan Kara
Currently, buffer_migrate_page_norefs() was constantly failing because buffer_migrate_lock_buffers() grabbed reference on each buffer. In fact, there's no reason for buffer_migrate_lock_buffers() to grab any buffer references as the page is locked during all our operation and thus nobody can reclaim buffers from the page. So remove grabbing of buffer references which also makes buffer_migrate_page_norefs() succeed. Link: http://lkml.kernel.org/r/20190116131217.7226-1-jack@suse.cz Fixes: 89cb0888ca14 "mm: migrate: provide buffer_migrate_page_norefs()" Signed-off-by: Jan Kara <jack@suse.cz> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01kernel/exit.c: release ptraced tasks before zap_pid_ns_processesAndrei Vagin
Currently, exit_ptrace() adds all ptraced tasks in a dead list, then zap_pid_ns_processes() waits on all tasks in a current pidns, and only then are tasks from the dead list released. zap_pid_ns_processes() can get stuck on waiting tasks from the dead list. In this case, we will have one unkillable process with one or more dead children. Thanks to Oleg for the advice to release tasks in find_child_reaper(). Link: http://lkml.kernel.org/r/20190110175200.12442-1-avagin@gmail.com Fixes: 7c8bd2322c7f ("exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent()") Signed-off-by: Andrei Vagin <avagin@gmail.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01x86_64: increase stack size for KASAN_EXTRAQian Cai
If the kernel is configured with KASAN_EXTRA, the stack size is increasted significantly because this option sets "-fstack-reuse" to "none" in GCC [1]. As a result, it triggers stack overrun quite often with 32k stack size compiled using GCC 8. For example, this reproducer https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/madvise/madvise06.c triggers a "corrupted stack end detected inside scheduler" very reliably with CONFIG_SCHED_STACK_END_CHECK enabled. There are just too many functions that could have a large stack with KASAN_EXTRA due to large local variables that have been called over and over again without being able to reuse the stacks. Some noticiable ones are size 7648 shrink_page_list 3584 xfs_rmap_convert 3312 migrate_page_move_mapping 3312 dev_ethtool 3200 migrate_misplaced_transhuge_page 3168 copy_process There are other 49 functions are over 2k in size while compiling kernel with "-Wframe-larger-than=" even with a related minimal config on this machine. Hence, it is too much work to change Makefiles for each object to compile without "-fsanitize-address-use-after-scope" individually. [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81715#c23 Although there is a patch in GCC 9 to help the situation, GCC 9 probably won't be released in a few months and then it probably take another 6-month to 1-year for all major distros to include it as a default. Hence, the stack usage with KASAN_EXTRA can be revisited again in 2020 when GCC 9 is everywhere. Until then, this patch will help users avoid stack overrun. This has already been fixed for arm64 for the same reason via 6e8830674ea ("arm64: kasan: Increase stack size for KASAN_EXTRA"). Link: http://lkml.kernel.org/r/20190109215209.2903-1-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01mm/hugetlb.c: teach follow_hugetlb_page() to handle FOLL_NOWAITAndrea Arcangeli
hugetlb needs the same fix as faultin_nopage (which was applied in commit 96312e61282a ("mm/gup.c: teach get_user_pages_unlocked to handle FOLL_NOWAIT")) or KVM hangs because it thinks the mmap_sem was already released by hugetlb_fault() if it returned VM_FAULT_RETRY, but it wasn't in the FOLL_NOWAIT case. Link: http://lkml.kernel.org/r/20190109020203.26669-2-aarcange@redhat.com Fixes: ce53053ce378 ("kvm: switch get_user_page_nowait() to get_user_pages_unlocked()") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Tested-by: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Reported-by: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01arch: unexport asm/shmparam.h for all architecturesMasahiro Yamada
Most architectures do not export shmparam.h to user-space. $ find arch -name shmparam.h | sort arch/alpha/include/asm/shmparam.h arch/arc/include/asm/shmparam.h arch/arm64/include/asm/shmparam.h arch/arm/include/asm/shmparam.h arch/csky/include/asm/shmparam.h arch/ia64/include/asm/shmparam.h arch/mips/include/asm/shmparam.h arch/nds32/include/asm/shmparam.h arch/nios2/include/asm/shmparam.h arch/parisc/include/asm/shmparam.h arch/powerpc/include/asm/shmparam.h arch/s390/include/asm/shmparam.h arch/sh/include/asm/shmparam.h arch/sparc/include/asm/shmparam.h arch/x86/include/asm/shmparam.h arch/xtensa/include/asm/shmparam.h Strangely, some users of the asm-generic wrapper export shmparam.h $ git grep 'generic-y += shmparam.h' arch/c6x/include/uapi/asm/Kbuild:generic-y += shmparam.h arch/h8300/include/uapi/asm/Kbuild:generic-y += shmparam.h arch/hexagon/include/uapi/asm/Kbuild:generic-y += shmparam.h arch/m68k/include/uapi/asm/Kbuild:generic-y += shmparam.h arch/microblaze/include/uapi/asm/Kbuild:generic-y += shmparam.h arch/openrisc/include/uapi/asm/Kbuild:generic-y += shmparam.h arch/riscv/include/asm/Kbuild:generic-y += shmparam.h arch/unicore32/include/uapi/asm/Kbuild:generic-y += shmparam.h The newly added riscv correctly creates the asm-generic wrapper in the kernel space, but the others (c6x, h8300, hexagon, m68k, microblaze, openrisc, unicore32) create the one in the uapi directory. Digging into the git history, now I guess fcc8487d477a ("uapi: export all headers under uapi directories") was the misconversion. Prior to that commit, no architecture exported to shmparam.h As its commit description said, that commit exported shmparam.h for c6x, h8300, hexagon, m68k, openrisc, unicore32. 83f0124ad81e ("microblaze: remove asm-generic wrapper headers") accidentally exported shmparam.h for microblaze. This commit unexports shmparam.h for those architectures. There is no more reason to export include/uapi/asm-generic/shmparam.h, so it has been moved to include/asm-generic/shmparam.h Link: http://lkml.kernel.org/r/1546904307-11124-1-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Acked-by: Stafford Horne <shorne@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Michal Simek <monstr@monstr.eu> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Aurelien Jacquiot <jacquiot.aurelien@gmail.com> Cc: Greentime Hu <green.hu@gmail.com> Cc: Guo Ren <guoren@kernel.org> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Mark Salter <msalter@redhat.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Jonas Bonn <jonas@southpole.se> Cc: Vincent Chen <deanbo422@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01proc: fix /proc/net/* after setns(2)Alexey Dobriyan
/proc entries under /proc/net/* can't be cached into dcache because setns(2) can change current net namespace. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: avoid vim miscolorization] [adobriyan@gmail.com: write test, add dummy ->d_revalidate hook: necessary if /proc/net/* is pinned at setns time] Link: http://lkml.kernel.org/r/20190108192350.GA12034@avx2 Link: http://lkml.kernel.org/r/20190107162336.GA9239@avx2 Fixes: 1da4d377f943fe4194ffb9fb9c26cc58fad4dd24 ("proc: revalidate misc dentries") Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reported-by: Mateusz Stępień <mateusz.stepien@netrounds.com> Reported-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01mm, memory_hotplug: don't bail out in do_migrate_range() prematurelyOscar Salvador
do_migrate_range() takes a memory range and tries to isolate the pages to put them into a list. This list will be later on used in migrate_pages() to know the pages we need to migrate. Currently, if we fail to isolate a single page, we put all already isolated pages back to their LRU and we bail out from the function. This is quite suboptimal, as this will force us to start over again because scan_movable_pages will give us the same range. If there is no chance that we can isolate that page, we will loop here forever. Issue debugged in [1] has proved that. During the debugging of that issue, it was noticed that if do_migrate_ranges() fails to isolate a single page, we will just discard the work we have done so far and bail out, which means that scan_movable_pages() will find again the same set of pages. Instead, we can just skip the error, keep isolating as much pages as possible and then proceed with the call to migrate_pages(). This will allow us to do as much work as possible at once. [1] https://lkml.org/lkml/2018/12/6/324 Michal said: : I still think that this doesn't give us a whole picture. Looping for : ever is a bug. Failing the isolation is quite possible and it should : be a ephemeral condition (e.g. a race with freeing the page or : somebody else isolating the page for whatever reason). And here comes : the disadvantage of the current implementation. We simply throw : everything on the floor just because of a ephemeral condition. The : racy page_count check is quite dubious to prevent from that. Link: http://lkml.kernel.org/r/20181211135312.27034-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dan Williams <dan.j.williams@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01Merge branch 'devlink-add-device-driver-information-API'David S. Miller
Jakub Kicinski says: ==================== devlink: add device (driver) information API fw_version field in ethtool -i does not suit modern needs with 31 characters being quite limiting on more complex systems. There is also no distinction between the running and flashed versions of the firmware. Since the driver information pertains to the entire device, rather than a particular netdev, it seems wise to move it do devlink, at the same time fixing the aforementioned issues. The new API allows exposing the device serial number and versions of the components of the card - both hardware, firmware (running and flashed). Driver authors can choose descriptive identifiers for the version fields. A few version identifiers which seemed relevant for most devices have been added to the global devlink header. Example: $ devlink dev info pci/0000:05:00.0 pci/0000:05:00.0: driver nfp serial_number 16240145 versions: fixed: board.id AMDA0099-0001 board.rev 07 board.vendor SMA board.model carbon running: fw.mgmt: 010156.010156.010156 fw.cpld: 0x44 fw.app: sriov-2.1.16 stored: fw.mgmt: 010158.010158.010158 fw.cpld: 0x44 fw.app: sriov-2.1.20 Last patch also includes a compat code for ethtool. If driver reports no fw_version via the traditional ethtool API, ethtool can call into devlink and try to cram as many versions as possible into the 31 characters. v4: - use IS_REACHABLE instead of IS_ENABLED in last patch. v3 (Jiri): - rename various functions and attributes; - break out the version helpers per-type; - make the compat code parse a dump instead of special casing in each helper; - move generic version defines to a separate patch. v2: - rebase. this non-RFC, v3 some would say: - add three more versions in the NFP patches; - add last patch (ethool compat) - Andrew & Michal. RFCv2: - use one driver op; - allow longer serial number; - wrap the skb into an opaque request struct; - add some common identifier into the devlink header. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01ethtool: add compat for devlink infoJakub Kicinski
If driver did not fill the fw_version field, try to call into the new devlink get_info op and collect the versions that way. We assume ethtool was always reporting running versions. v4: - use IS_REACHABLE() to avoid problems with DEVLINK=m (kbuildbot). v3 (Jiri): - do a dump and then parse it instead of special handling; - concatenate all versions (well, all that fit :)). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01nfp: devlink: report the running and flashed versionsJakub Kicinski
Report versions of firmware components using the new NSP command. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01nfp: nsp: add support for versions commandJakub Kicinski
Retrieve the FW versions with the new command. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01nfp: devlink: report fixed versionsJakub Kicinski
Report information about the hardware. RFCv2: - add defines for board IDs which are likely to be reusable for other drivers (Jiri). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01nfp: devlink: report driver name and serial numberJakub Kicinski
Report the basic info through new devlink info API. RFCv2: - add driver name; - align serial to core changes. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01devlink: add generic info version namesJakub Kicinski
Add defines and docs for generic info versions. v3: - add docs; - separate patch (Jiri). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01devlink: add version reporting to devlink info APIJakub Kicinski
ethtool -i has a few fixed-size fields which can be used to report firmware version and expansion ROM version. Unfortunately, modern hardware has more firmware components. There is usually some datapath microcode, management controller, PXE drivers, and a CPLD load. Running ethtool -i on modern controllers reveals the fact that vendors cram multiple values into firmware version field. Here are some examples from systems I could lay my hands on quickly: tg3: "FFV20.2.17 bc 5720-v1.39" i40e: "6.01 0x800034a4 1.1747.0" nfp: "0.0.3.5 0.25 sriov-2.1.16 nic" Add a new devlink API to allow retrieving multiple versions, and provide user-readable name for those versions. While at it break down the versions into three categories: - fixed - this is the board/fixed component version, usually vendors report information like the board version in the PCI VPD, but it will benefit from naming and common API as well; - running - this is the running firmware version; - stored - this is firmware in the flash, after firmware update this value will reflect the flashed version, while the running version may only be updated after reboot. v3: - add per-type helpers instead of using the special argument (Jiri). RFCv2: - remove the nesting in attr DEVLINK_ATTR_INFO_VERSIONS (now versions are mixed with other info attrs)l - have the driver report versions from the same callback as other info. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01devlink: add device information APIJakub Kicinski
ethtool -i has served us well for a long time, but its showing its limitations more and more. The device information should also be reported per device not per-netdev. Lay foundation for a simple devlink-based way of reading device info. Add driver name and device serial number as initial pieces of information exposed via this new API. v3: - rename helpers (Jiri); - rename driver name attr (Jiri); - remove double spacing in commit message (Jiri). RFC v2: - wrap the skb into an opaque structure (Jiri); - allow the serial number of be any length (Jiri & Andrew); - add driver name (Jonathan). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf 2019-01-31 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) disable preemption in sender side of socket filters, from Alexei. 2) fix two potential deadlocks in syscall bpf lookup and prog_register, from Martin and Alexei. 3) fix BTF to allow typedef on func_proto, from Yonghong. 4) two bpftool fixes, from Jiri and Paolo. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01Merge branch 'selftests-Various-fixes'David S. Miller
Petr Machata says: ==================== selftests: Various fixes This patch set contains various fixes whose common denominator is improving quality of forwarding and mlxsw selftests. Most of the fixes are improvements in determinism (such that timing and latency don't impact the test performance). These were prompted by regular runs of the test suite on a hardware emulator, the performance of which is necessarily lower than that of the real device. Patches #1 (from Ido), #2 and #3 make changes to ping limits. Patches #4 and #5 add more sleep in places where things need more time to finish. Patches #6 and #7 fix two tests in the suite of mirror-to-gretap tests where underlay involves a VLAN device over an 802.1q bridge. Patches #8, #9 and #10 fix bugs in mirror-to-gretap test where underlay involves a LAG device. Patch #11 fixes a missed RET initialization in mirror-to-gretap flower test. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01selftests: forwarding: mirror_gre_flower: Fix test result handlingPetr Machata
The global variable RET needs to be initialized before each call to log_test. This test case sets it once before running the tests, but then calls log_tests for every individual test. Thus a failure in one of the tests causes spurious failures in follow-up tests as well. Fix by moving the initialization of RET from test_all() to full_test_span_gre_dir_acl(), a function that implements the test. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01selftests: forwarding: mirror_gre_bridge_1q_lag: Ignore ARPPetr Machata
This test sets up mirroring such that it mirrors all overlay traffic. That includes ARP, which causes occasional miscounts and spurious failures. Ignore ARP explicitly to avoid these problems. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01selftests: forwarding: mirror_gre_bridge_1q_lag: Enable forwardingPetr Machata
This test relies on routing in the primary traffic path, but neglects to enable forwarding. Do so. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01selftests: forwarding: mirror_gre_bridge_1q_lag: Flush neighborsPetr Machata
After one LAG slave is downed and another upped, it takes a while for the neighbor on a bridge to time out and get renegotiated. The test does prompt update of FDB entries by arpinging. But because the neighbor still references another address, offloading is not possible, and some packets may end up not being mirrored. To force the neighbor renegotiation, simply flush the neighbor table at the bridge. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01selftests: forwarding: mirror_gre_vlan_bridge_1q: Fix roaming testPetr Machata
ARP or ND traffic can cause spurious migration of FDB back to $swp3. Mirroring is then updated in accordance with the change, and mirrored packets are seen at h3, causing a failure. Detect the case of this spurious roaming, and retry the test. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01selftests: forwarding: mirror_gre_vlan_bridge_1q: Fix untagged testPetr Machata
The untagged egress test sets up mirroring to {,ip6}gretap such that the underlay goes through a bridge. Then VLAN flags are manipulated to test that the traffic leaves the bridge 802.1q-tagged or not, as appropriate. However, when a neighbor expires at the time that the bridge VLAN is configured as PVID and egress untagged, the following discovery process can't finish, because the IP address on H3 is still at the VLAN-tagged netdevice. This manifests by occasional failures where only several of the 10 required packets get through. Therefore, when reconfiguring the VLAN flags, move the IP address to the appropriate device in the H3 VRF. In addition to that, take this opportunity to embed an ASCII art diagram to make the topology move obvious. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01selftests: forwarding: mirror_lib: Wait for tardy mirrored packetsPetr Machata
When running in an environment with poor performance (such as a simulator), processing mirrored packets can take a while. Evaluating the condition too soon leads to spurious "seen 9, expected 10" failures as the last packet doesn't have enough time to get mirrored and the mirror to arrive and bump the observed counters. Wait for one ping interval before evaluating the test. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01selftests: forwarding: mirror_gre_changes: Fix TTL testPetr Machata
When running in a simulator, the TTL change takes a while to settle and during this time the performance of the packet processing is lowered. The resulting instability leads to ping sending more packets as it assumes some have been dropped. This then leads to regular spurious failures as more packets than expected are observed. Sleep a bit to give the system time to stabilize. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01selftests: mlxsw: Update ping limitsPetr Machata
The current ping intervals are too short for running mirroring tests in simulator. This leads to ping sending a follow-up ping before the reply arrives, thus sending more than the requested 10 ICMP requests. This traffic is seen at the counters, and causes spurious failures. Bump interval and timeout numbers 5x in mirroring tests to address the spurious failures. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01selftests: forwarding: mirror_lib: Update ping limitsPetr Machata
The current ping intervals are too short for running mirroring tests in simulator. This leads to ping sending a follow-up ping before the reply arrives, thus sending more than the requested 10 ICMP requests. Those are mirrored, and over a certain threshold the test case run is considered a failure, because too much traffic is observed. Bump interval and timeout numbers 5x in mirroring tests to address the spurious failures. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01selftests: forwarding: Make ping timeout configurableIdo Schimmel
The current timeout (2 seconds) proved to be too low for some (emulated) systems where we run the tests. Make the timeout configurable and default to 5 seconds. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01ipconfig: add carrier_timeout kernel parameterMartin Kepplinger
commit 3fb72f1e6e61 ("ipconfig wait for carrier") added a "wait for carrier" policy, with a fixed worst case maximum wait of two minutes. Now make the wait for carrier timeout configurable on the kernel commandline and use the 120s as the default. The timeout messages introduced with commit 5e404cd65860 ("ipconfig: add informative timeout messages while waiting for carrier") are done in a fixed interval of 20 seconds, just like they were before (240/12). Signed-off-by: Martin Kepplinger <martin.kepplinger@ginzinger.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01ipv4: fib: use struct_size() in kzalloc()Gustavo A. R. Silva
One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; struct boo entry[]; }; instance = kzalloc(sizeof(struct foo) + count * sizeof(struct boo), GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL); This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01nfp: use struct_size() in kzalloc()Gustavo A. R. Silva
One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; struct boo entry[]; }; instance = kzalloc(sizeof(struct foo) + count * sizeof(struct boo), GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL); This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01tulip: eeprom: use struct_size() in kmalloc()Gustavo A. R. Silva
One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; struct boo entry[]; }; instance = kmalloc(sizeof(struct foo) + count * sizeof(struct boo), GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL); This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01cxgb4: smt: use struct_size() in kvzalloc()Gustavo A. R. Silva
One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; struct boo entry[]; }; instance = kvzalloc(sizeof(struct foo) + count * sizeof(struct boo), GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kvzalloc(struct_size(instance, entry, count), GFP_KERNEL); This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>