summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-03-29um: virtio: Implement VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONSJohannes Berg
Implement in-band notifications that are necessary for running vhost-user devices under externally synchronized time-travel mode (which is in a follow-up patch). This feature makes what usually should be eventfd notifications in-band messages. We'll prefer this feature, under the assumption that only a few (simulation) devices will ever support it, since it's not very efficient. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-03-29um: time-travel: Rewrite as an event schedulerJohannes Berg
Instead of tracking all the various timer configurations, modify the time-travel mode to have an event scheduler and use a timer event on the scheduler to handle the different timer configurations. This doesn't change the function right now, but it prepares the code for having different kinds of events in the future (i.e. interrupts coming from other devices that are part of co-simulation.) While at it, also move time_travel_sleep() to time.c to reduce the externally visible API surface. Also, we really should mark time-travel as incompatible with SMP, even if UML doesn't support SMP yet. Finally, I noticed a bug while developing this - if we move time forward due to consuming time while reading the clock, we might move across the next event and that would cause us to go backward in time when we then handle that event. Fix that by invoking the whole event machine in this case, but in order to simplify this, make reading the clock only cost something when interrupts are not disabled. Otherwise, we'd have to hook into the interrupt delivery machinery etc. and that's somewhat intrusive. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-03-29um: Move timer-internal.h to non-sharedJohannes Berg
This file isn't really shared, it's only used on the kernel side, not on the user side. Remove the include from the user-side and move the file to a better place. While at it, rename it to time-internal.h, it's not really just timers but all kinds of things related to timekeeping. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-03-29hostfs: Use kasprintf() instead of fixed buffer formattingAndy Shevchenko
Improve readability and maintainability by replacing a hardcoded string allocation and formatting by the use of the kasprintf() helper. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-03-29um: falloc.h needs to be directly included for older libcAlan Maguire
When building UML with glibc 2.17 installed, compilation of arch/um/os-Linux/file.c fails due to failure to find FALLOC_FL_PUNCH_HOLE and FALLOC_FL_KEEP_SIZE definitions. It appears that /usr/include/bits/fcntl-linux.h (indirectly included by /usr/include/fcntl.h) does not include falloc.h with an older glibc, whereas a more up-to-date version does. Adding the direct include to file.c resolves the issue and does not cause problems for more recent glibc. Fixes: 50109b5a03b4 ("um: Add support for DISCARD in the UBD Driver") Cc: Brendan Higgins <brendanhiggins@google.com> Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Reviewed-by: Brendan Higgins <brendanhiggins@google.com> Acked-By: Anton Ivanov <anton.ivanov@cambridgegreys.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-03-29um: ubd: Retry buffer read on any kind of errorGabriel Krisman Bertazi
Should bulk_req_safe_read return an error, we want to retry the read, otherwise, even though no IO will be done, os_write_file might still end up writing garbage to the pipe. Cc: Martyn Welch <martyn.welch@collabora.com> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-03-29um: ubd: Prevent buffer overrun on command completionGabriel Krisman Bertazi
On the hypervisor side, when completing commands and the pipe is full, we retry writing only the entries that failed, by offsetting io_req_buffer, but we don't reduce the number of bytes written, which can cause a buffer overrun of io_req_buffer, and write garbage to the pipe. Cc: Martyn Welch <martyn.welch@collabora.com> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-03-29um: Fix overlapping ELF segments when statically linkedDavid Gow
When statically linked, the .text section in UML kernels is not page aligned, causing it to share a page with the executable headers. As .text and the executable headers have different permissions, this causes the kernel to wish to map the same page twice (once as headers with r-- permissions, once as .text with r-x permissions), causing a segfault, and a nasty message printed to the host kernel's dmesg: "Uhuuh, elf segment at 0000000060000000 requested but the memory is mapped already" By aligning the .text to a page boundary (as in the dynamically linked version in dyn.lds.S), there is no such overlap, and the kernel runs correctly. Signed-off-by: David Gow <davidgow@google.com> Reviewed-by: Brendan Higgins <brendanhiggins@google.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-03-29um: Delete never executed timerLeon Romanovsky
The "#ifdef undef" construction effectively disabled the timer. It causes to the fact that this timer did nothing, so delete it. Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Acked-by: Anton Ivanov <anton.ivanov@cambridgegreys.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-03-29um: Don't overwrite ethtool driver versionLeon Romanovsky
In-tree drivers don't need to manage internal version because they are aligned to the global Linux kernel version, which is reported by default with "ethtool -i". Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Acked-by: Anton Ivanov <anton.ivanov@cambridgegreys.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-03-29um: Fix len of file in create_pid_fileWen Yang
sizeof gives us the size of the pointer variable, not of the area it points to. So the number of bytes copied by umid_file_name() is 8. We should pass in the correct length of the file buffer. Signed-off-by: Wen Yang <wenyang@linux.alibaba.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-03-29um: Don't use console_drivers directlyAndy Shevchenko
console_drivers is kind of (semi-)private variable to the console code. Direct use of it make us stuck with it being exported here and there. Reduce use of console_drivers by replacing it with for_each_console(). Cc: Thomas Meyer <thomas@m3y3r.de> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-03-29um: Cleanup CONFIG_IOSCHED_CFQKrzysztof Kozlowski
CONFIG_IOSCHED_CFQ is gone since commit f382fb0bcef4 ("block: remove legacy IO schedulers"). The IOSCHED_BFQ seems to replace IOSCHED_CFQ so select it in configs previously choosing the latter. Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Acked-by: Anton Ivanov <anton.ivanov@cambridgegreys.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-03-29Merge tag 'irqchip-5.7' of ↵Thomas Gleixner
git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core Pull irqchip updates from Marc Zyngier: - Second batch of the GICv4.1 support saga - Level triggered interrupt support for the stm32 controller - Versatile-fpga chained interrupt fixes - DT support for cascaded VIC interrupt controller - RPi irqchip initialization fixes - Multi-instance support for the Xilinx interrupt controller - Multi-instance support for the PLIC interrupt controller - CPU hotplug support for the PLIC interrupt controller - Ingenic X1000 TCU support - Small fixes all over the shop (GICv3, GICv4, Xilinx, Atmel, sa1111) - Cleanups (setup_irq removal, zero-length array removal)
2020-03-29rtc: imx-sc: Align imx sc msg structs to 4Leonard Crestez
The imx SC api strongly assumes that messages are composed out of 4-bytes words but some of our message structs have odd sizeofs. This produces many oopses with CONFIG_KASAN=y. Fix by marking with __aligned(4). Fixes: a3094fc1a15e ("rtc: imx-sc: add rtc alarm support") Signed-off-by: Leonard Crestez <leonard.crestez@nxp.com> Link: https://lore.kernel.org/r/13404bac8360852d86c61fad5ae5f0c91ffc4cb6.1582216144.git.leonard.crestez@nxp.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
2020-03-29rtc: fsl-ftm-alarm: report alarm to coreBiwen Li
Report interrupt state to the RTC core. Signed-off-by: Biwen Li <biwen.li@nxp.com> Link: https://lore.kernel.org/r/20200327084457.45161-1-biwen.li@nxp.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
2020-03-29unicore32: Replace setup_irq() by request_irq()afzal mohammed
request_irq() is preferred over setup_irq(). Invocations of setup_irq() occur after memory allocators are ready. setup_irq() was required in older kernels as the memory allocator was not available during early boot. Hence replace setup_irq() by request_irq(). Signed-off-by: afzal mohammed <afzal.mohd.ma@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/82667ae23520611b2a9d8db77e1d8aeb982f08e5.1585320721.git.afzal.mohd.ma@gmail.com
2020-03-29sh: Replace setup_irq() by request_irq()afzal mohammed
request_irq() is preferred over setup_irq(). Invocations of setup_irq() occur after memory allocators are ready. setup_irq() was required in older kernels as the memory allocator was not available during early boot. Hence replace setup_irq() by request_irq(). Signed-off-by: afzal mohammed <afzal.mohd.ma@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/b060312689820559121ee0a6456bbc1202fb7ee5.1585320721.git.afzal.mohd.ma@gmail.com
2020-03-29hexagon: Replace setup_irq() by request_irq()afzal mohammed
request_irq() is preferred over setup_irq(). Invocations of setup_irq() occur after memory allocators are ready. setup_irq() was required in older kernels as the memory allocator was not available during early boot. Hence replace setup_irq() by request_irq(). Signed-off-by: afzal mohammed <afzal.mohd.ma@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/e84ac60de8f747d49ce082659e51595f708c29d4.1585320721.git.afzal.mohd.ma@gmail.com
2020-03-29c6x: Replace setup_irq() by request_irq()afzal mohammed
request_irq() is preferred over setup_irq(). Invocations of setup_irq() occur after memory allocators are ready. setup_irq() was required in older kernels as the memory allocator was not available during early boot. Hence replace setup_irq() by request_irq(). Signed-off-by: afzal mohammed <afzal.mohd.ma@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/56e991e920ce5806771fab892574cba89a3d413f.1585320721.git.afzal.mohd.ma@gmail.com
2020-03-29alpha: Replace setup_irq() by request_irq()afzal mohammed
request_irq() is preferred over setup_irq(). Invocations of setup_irq() occur after memory allocators are ready. setup_irq() was required in older kernels as the memory allocator was not available during early boot. Hence replace setup_irq() by request_irq(). Signed-off-by: afzal mohammed <afzal.mohd.ma@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Matt Turner <mattst88@gmail.com> Link: https://lkml.kernel.org/r/51f8ae7da9f47a23596388141933efa2bdef317b.1585320721.git.afzal.mohd.ma@gmail.com
2020-03-29Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge vm fixes from Andrew Morton: "5 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm/sparse: fix kernel crash with pfn_section_valid check mm: fork: fix kernel_stack memcg stats for various stack implementations hugetlb_cgroup: fix illegal access to memory drivers/base/memory.c: indicate all memory blocks as removable mm/swapfile.c: move inode_lock out of claim_swapfile
2020-03-29Merge tag 'timers-urgent-2020-03-29' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Thomas Gleixner: "A single fix for the Hyper-V clocksource driver to make sched clock actually return nanoseconds and not the virtual clock value which increments at 10e7 HZ (100ns)" * tag 'timers-urgent-2020-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource/drivers/hyper-v: Make sched clock return nanoseconds correctly
2020-03-29Merge tag 'irq-urgent-2020-03-29' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fix from Thomas Gleixner: "A single bugfix to prevent reference leaks in irq affinity notifiers" * tag 'irq-urgent-2020-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Fix reference leaks on irq affinity notifiers
2020-03-29mm/sparse: fix kernel crash with pfn_section_valid checkAneesh Kumar K.V
Fix the crash like this: BUG: Kernel NULL pointer dereference on read at 0x00000000 Faulting instruction address: 0xc000000000c3447c Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1 ... NIP [c000000000c3447c] vmemmap_populated+0x98/0xc0 LR [c000000000088354] vmemmap_free+0x144/0x320 Call Trace: section_deactivate+0x220/0x240 __remove_pages+0x118/0x170 arch_remove_memory+0x3c/0x150 memunmap_pages+0x1cc/0x2f0 devm_action_release+0x30/0x50 release_nodes+0x2f8/0x3e0 device_release_driver_internal+0x168/0x270 unbind_store+0x130/0x170 drv_attr_store+0x44/0x60 sysfs_kf_write+0x68/0x80 kernfs_fop_write+0x100/0x290 __vfs_write+0x3c/0x70 vfs_write+0xcc/0x240 ksys_write+0x7c/0x140 system_call+0x5c/0x68 The crash is due to NULL dereference at test_bit(idx, ms->usage->subsection_map); due to ms->usage = NULL in pfn_section_valid() With commit d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case") section_mem_map is set to NULL after depopulate_section_mem(). This was done so that pfn_page() can work correctly with kernel config that disables SPARSEMEM_VMEMMAP. With that config pfn_to_page does __section_mem_map_addr(__sec) + __pfn; where static inline struct page *__section_mem_map_addr(struct mem_section *section) { unsigned long map = section->section_mem_map; map &= SECTION_MAP_MASK; return (struct page *)map; } Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is used to check the pfn validity (pfn_valid()). Since section_deactivate release mem_section->usage if a section is fully deactivated, pfn_valid() check after a subsection_deactivate cause a kernel crash. static inline int pfn_valid(unsigned long pfn) { ... return early_section(ms) || pfn_section_valid(ms, pfn); } where static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn) { int idx = subsection_map_index(pfn); return test_bit(idx, ms->usage->subsection_map); } Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is freed. For architectures like ppc64 where large pages are used for vmmemap mapping (16MB), a specific vmemmap mapping can cover multiple sections. Hence before a vmemmap mapping page can be freed, the kernel needs to make sure there are no valid sections within that mapping. Clearing the section valid bit before depopulate_section_memap enables this. [aneesh.kumar@linux.ibm.com: add comment] Link: http://lkml.kernel.org/r/20200326133235.343616-1-aneesh.kumar@linux.ibm.comLink: http://lkml.kernel.org/r/20200325031914.107660-1-aneesh.kumar@linux.ibm.com Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case") Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-29mm: fork: fix kernel_stack memcg stats for various stack implementationsRoman Gushchin
Depending on CONFIG_VMAP_STACK and the THREAD_SIZE / PAGE_SIZE ratio the space for task stacks can be allocated using __vmalloc_node_range(), alloc_pages_node() and kmem_cache_alloc_node(). In the first and the second cases page->mem_cgroup pointer is set, but in the third it's not: memcg membership of a slab page should be determined using the memcg_from_slab_page() function, which looks at page->slab_cache->memcg_params.memcg . In this case, using mod_memcg_page_state() (as in account_kernel_stack()) is incorrect: page->mem_cgroup pointer is NULL even for pages charged to a non-root memory cgroup. It can lead to kernel_stack per-memcg counters permanently showing 0 on some architectures (depending on the configuration). In order to fix it, let's introduce a mod_memcg_obj_state() helper, which takes a pointer to a kernel object as a first argument, uses mem_cgroup_from_obj() to get a RCU-protected memcg pointer and calls mod_memcg_state(). It allows to handle all possible configurations (CONFIG_VMAP_STACK and various THREAD_SIZE/PAGE_SIZE values) without spilling any memcg/kmem specifics into fork.c . Note: This is a special version of the patch created for stable backports. It contains code from the following two patches: - mm: memcg/slab: introduce mem_cgroup_from_obj() - mm: fork: fix kernel_stack memcg stats for various stack implementations [guro@fb.com: introduce mem_cgroup_from_obj()] Link: http://lkml.kernel.org/r/20200324004221.GA36662@carbon.dhcp.thefacebook.com Fixes: 4d96ba353075 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages") Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Bharata B Rao <bharata@linux.ibm.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200303233550.251375-1-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-29hugetlb_cgroup: fix illegal access to memoryMina Almasry
This appears to be a mistake in commit faced7e0806cf ("mm: hugetlb controller for cgroups v2"). Essentially that commit does a hugetlb_cgroup_from_counter assuming that page_counter_try_charge has initialized counter. But if that has failed then it seems will not initialize counter, so hugetlb_cgroup_from_counter(counter) ends up pointing to random memory, causing kasan to complain. The solution is to simply use 'h_cg', instead of hugetlb_cgroup_from_counter(counter), since that is a reference to the hugetlb_cgroup anyway. After this change kasan ceases to complain. Fixes: faced7e0806cf ("mm: hugetlb controller for cgroups v2") Reported-by: syzbot+cac0c4e204952cf449b1@syzkaller.appspotmail.com Signed-off-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Giuseppe Scrivano <gscrivan@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Rientjes <rientjes@google.com> Link: http://lkml.kernel.org/r/20200313223920.124230-1-almasrymina@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-29drivers/base/memory.c: indicate all memory blocks as removableDavid Hildenbrand
We see multiple issues with the implementation/interface to compute whether a memory block can be offlined (exposed via /sys/devices/system/memory/memoryX/removable) and would like to simplify it (remove the implementation). 1. It runs basically lockless. While this might be good for performance, we see possible races with memory offlining that will require at least some sort of locking to fix. 2. Nowadays, more false positives are possible. No arch-specific checks are performed that validate if memory offlining will not be denied right away (and such check will require locking). For example, arm64 won't allow to offline any memory block that was added during boot - which will imply a very high error rate. Other archs have other constraints. 3. The interface is inherently racy. E.g., if a memory block is detected to be removable (and was not a false positive at that time), there is still no guarantee that offlining will actually succeed. So any caller already has to deal with false positives. 4. It is unclear which performance benefit this interface actually provides. The introducing commit 5c755e9fd813 ("memory-hotplug: add sysfs removable attribute for hotplug memory remove") mentioned "A user-level agent must be able to identify which sections of memory are likely to be removable before attempting the potentially expensive operation." However, no actual performance comparison was included. Known users: - lsmem: Will group memory blocks based on the "removable" property. [1] - chmem: Indirect user. It has a RANGE mode where one can specify removable ranges identified via lsmem to be offlined. However, it also has a "SIZE" mode, which allows a sysadmin to skip the manual "identify removable blocks" step. [2] - powerpc-utils: Uses the "removable" attribute to skip some memory blocks right away when trying to find some to offline+remove. However, with ballooning enabled, it already skips this information completely (because it once resulted in many false negatives). Therefore, the implementation can deal with false positives properly already. [3] According to Nathan Fontenot, DLPAR on powerpc is nowadays no longer driven from userspace via the drmgr command (powerpc-utils). Nowadays it's managed in the kernel - including onlining/offlining of memory blocks - triggered by drmgr writing to /sys/kernel/dlpar. So the affected legacy userspace handling is only active on old kernels. Only very old versions of drmgr on a new kernel (unlikely) might execute slower - totally acceptable. With CONFIG_MEMORY_HOTREMOVE, always indicating "removable" should not break any user space tool. We implement a very bad heuristic now. Without CONFIG_MEMORY_HOTREMOVE we cannot offline anything, so report "not removable" as before. Original discussion can be found in [4] ("[PATCH RFC v1] mm: is_mem_section_removable() overhaul"). Other users of is_mem_section_removable() will be removed next, so that we can remove is_mem_section_removable() completely. [1] http://man7.org/linux/man-pages/man1/lsmem.1.html [2] http://man7.org/linux/man-pages/man8/chmem.8.html [3] https://github.com/ibm-power-utilities/powerpc-utils [4] https://lkml.kernel.org/r/20200117105759.27905-1-david@redhat.com Also, this patch probably fixes a crash reported by Steve. http://lkml.kernel.org/r/CAPcyv4jpdaNvJ67SkjyUJLBnBnXXQv686BiVW042g03FUmWLXw@mail.gmail.com Reported-by: "Scargall, Steve" <steve.scargall@intel.com> Suggested-by: Michal Hocko <mhocko@kernel.org> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Nathan Fontenot <ndfont@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Robert Jennings <rcj@linux.vnet.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Karel Zak <kzak@redhat.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200128093542.6908-1-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-29mm/swapfile.c: move inode_lock out of claim_swapfileNaohiro Aota
claim_swapfile() currently keeps the inode locked when it is successful, or the file is already swapfile (with -EBUSY). And, on the other error cases, it does not lock the inode. This inconsistency of the lock state and return value is quite confusing and actually causing a bad unlock balance as below in the "bad_swap" section of __do_sys_swapon(). This commit fixes this issue by moving the inode_lock() and IS_SWAPFILE check out of claim_swapfile(). The inode is unlocked in "bad_swap_unlock_inode" section, so that the inode is ensured to be unlocked at "bad_swap". Thus, error handling codes after the locking now jumps to "bad_swap_unlock_inode" instead of "bad_swap". ===================================== WARNING: bad unlock balance detected! 5.5.0-rc7+ #176 Not tainted ------------------------------------- swapon/4294 is trying to release lock (&sb->s_type->i_mutex_key) at: __do_sys_swapon+0x94b/0x3550 but there are no more locks to release! other info that might help us debug this: no locks held by swapon/4294. stack backtrace: CPU: 5 PID: 4294 Comm: swapon Not tainted 5.5.0-rc7-BTRFS-ZNS+ #176 Hardware name: ASUS All Series/H87-PRO, BIOS 2102 07/29/2014 Call Trace: dump_stack+0xa1/0xea print_unlock_imbalance_bug.cold+0x114/0x123 lock_release+0x562/0xed0 up_write+0x2d/0x490 __do_sys_swapon+0x94b/0x3550 __x64_sys_swapon+0x54/0x80 do_syscall_64+0xa4/0x4b0 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f15da0a0dc7 Fixes: 1638045c3677 ("mm: set S_SWAPFILE on blockdev swap devices") Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Qais Youef <qais.yousef@arm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200206090132.154869-1-naohiro.aota@wdc.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-29block: return NULL in blk_alloc_queue() on errorChaitanya Kulkarni
This patch fixes follwoing warning: block/blk-core.c: In function ‘blk_alloc_queue’: block/blk-core.c:558:10: warning: returning ‘int’ from a function with return type ‘struct request_queue *’ makes pointer from integer without a cast [-Wint-conversion] return -EINVAL; Fixes: 3d745ea5b095a ("block: simplify queue allocation") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-29netfilter: nf_queue: prefer nf_queue_entry_freeFlorian Westphal
Instead of dropping refs+kfree, use the helper added in previous patch. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-29netfilter: nf_queue: do not release refcouts until nf_reinject is doneFlorian Westphal
nf_queue is problematic when another NF_QUEUE invocation happens from nf_reinject(). 1. nf_queue is invoked, increments state->sk refcount. 2. skb is queued, waiting for verdict. 3. sk is closed/released. 3. verdict comes back, nf_reinject is called. 4. nf_reinject drops the reference -- refcount can now drop to 0 Instead of get_ref/release_ref pattern, we need to nest the get_ref calls: get_ref get_ref release_ref release_ref So that when we invoke the next processing stage (another netfilter or the okfn()), we hold at least one reference count on the devices/socket. After previous patch, it is now safe to put the entry even after okfn() has potentially free'd the skb. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-29netfilter: nf_queue: place bridge physports into queue_entry structFlorian Westphal
The refcount is done via entry->skb, which does work fine. Major problem: When putting the refcount of the bridge ports, we must always put the references while the skb is still around. However, we will need to put the references after okfn() to avoid a possible 1 -> 0 -> 1 refcount transition, so we cannot use the skb pointer anymore. Place the physports in the queue entry structure instead to allow for refcounting changes in the next patch. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-29netfilter: nf_queue: make nf_queue_entry_release_refs staticFlorian Westphal
This is a preparation patch, no logical changes. Move free_entry into core and rename it to something more sensible. Will ease followup patches which will complicate the refcount handling. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-29IB/qib: Delete struct qib_ivdev.qp_rndGeorge Spelvin
I was checking the field to see if it needed the full get_random_bytes() and discovered it's unused. Only compile-tested, as I don't have the hardware, but I'm still pretty confident. Link: https://lore.kernel.org/r/202003281643.02SGh6eG002694@sdf.org Signed-off-by: George Spelvin <lkml@sdf.org> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-29RDMA/hns: Fix uninitialized variable bugGustavo A. R. Silva
There is a potential execution path in which variable *ret* is returned without being properly initialized, previously. Fix this by initializing variable *ret* to 0. Link: https://lore.kernel.org/r/20200328023539.GA32016@embeddedor Addresses-Coverity-ID: 1491917 ("Uninitialized scalar variable") Fixes: 2f49de21f3e9 ("RDMA/hns: Optimize mhop get flow for multi-hop addressing") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Acked-by: Weihang Li <liweihang@huawei.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-29RDMA/hns: Modify the mask of QP number for CQE of hip08Lang Cheng
The hip08 supports up to 1M QPs, so the qpn mask of cqe should be modified. Link: https://lore.kernel.org/r/1585194018-4381-4-git-send-email-liweihang@huawei.com Signed-off-by: Lang Cheng <chenglang@huawei.com> Signed-off-by: Weihang Li <liweihang@huawei.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-29RDMA/hns: Reduce the maximum number of extend SGE per WQELang Cheng
Just reduce the default number to 64 for backward compatibility, the driver can still get this configuration from the firmware. Link: https://lore.kernel.org/r/1585194018-4381-3-git-send-email-liweihang@huawei.com Signed-off-by: Lang Cheng <chenglang@huawei.com> Signed-off-by: Weihang Li <liweihang@huawei.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-29RDMA/hns: Reduce PFC frames in congestion scenariosJihua Tao
The original value means sending 16 packets at a time, and it should be configured to 0 which means sending 1 packet instead. It is modified to reduce the number of PFC frames to make sure the performance meets expectations when flow control is enabled on hip08. Link: https://lore.kernel.org/r/1585194018-4381-2-git-send-email-liweihang@huawei.com Signed-off-by: Jihua Tao <taojihua4@huawei.com> Signed-off-by: Weihang Li <liweihang@huawei.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-29kbuild: add outputmakefile to no-dot-config-targetsDavid Engraf
The target outputmakefile is used to generate a Makefile for out-of-tree builds and does not depend on the kernel configuration. Signed-off-by: David Engraf <david.engraf@sysgo.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-03-29kbuild: remove AS variableMasahiro Yamada
As commit 5ef872636ca7 ("kbuild: get rid of misleading $(AS) from documents") noted, we rarely use $(AS) directly in the kernel build. Now that the only/last user of $(AS) in drivers/net/wan/Makefile was converted to $(CC), $(AS) is no longer used in the build process. You can still pass in AS=clang, which is just a switch to turn on the LLVM integrated assembler. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
2020-03-29net: wan: wanxl: refactor the firmware rebuild ruleMasahiro Yamada
Split the big recipe into 3 stages: compile, link, and hexdump. After this commit, the build log with CONFIG_WANXL_BUILD_FIRMWARE will look like this: M68KAS drivers/net/wan/wanxlfw.o M68KLD drivers/net/wan/wanxlfw.bin BLDFW drivers/net/wan/wanxlfw.inc CC [M] drivers/net/wan/wanxl.o Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-03-29net: wan: wanxl: use $(M68KCC) instead of $(M68KAS) for rebuilding firmwareMasahiro Yamada
The firmware source, wanxlfw.S, is currently compiled by the combo of $(CPP) and $(M68KAS). This is not what we usually do for compiling *.S files. In fact, this Makefile is the only user of $(AS) in the kernel build. Instead of combining $(CPP) and (AS) from different tool sets, using $(M68KCC) as an assembler driver is simpler, and saner. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-03-29net: wan: wanxl: use allow to pass CROSS_COMPILE_M68k for rebuilding firmwareMasahiro Yamada
As far as I understood from the Kconfig help text, this build rule is used to rebuild the driver firmware, which runs on an old m68k-based chip. So, you need m68k tools for the firmware rebuild. wanxl.c is a PCI driver, but CONFIG_M68K does not select CONFIG_HAVE_PCI. So, you cannot enable CONFIG_WANXL_BUILD_FIRMWARE for ARCH=m68k. In other words, ifeq ($(ARCH),m68k) is false here. I am keeping the dead code for now, but rebuilding the firmware requires 'as68k' and 'ld68k', which I do not have in hand. Instead, the kernel.org m68k GCC [1] successfully built it. Allowing a user to pass in CROSS_COMPILE_M68K= is handier. [1] https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/9.2.0/x86_64-gcc-9.2.0-nolibc-m68k-linux.tar.xz Suggested-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-03-29kbuild: add comment about grouped targetMasahiro Yamada
GNU Make commit 8c888d95f618 ("[SV 8297] Implement "grouped targets" for explicit rules.") added the '&:' syntax. I think '&:' is a perfect fit here, but we cannot use it any time soon. Just add a TODO comment. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-03-29kbuild: add -Wall to KBUILD_HOSTCXXFLAGSMasahiro Yamada
Add -Wall to catch more warnings for C++ host programs. When I submitted the previous version, the 0-day bot reported -Wc++11-compat warnings for old GCC: HOSTCXX -fPIC scripts/gcc-plugins/latent_entropy_plugin.o In file included from /usr/lib/gcc/x86_64-linux-gnu/4.8/plugin/include/tm.h:28:0, from scripts/gcc-plugins/gcc-common.h:15, from scripts/gcc-plugins/latent_entropy_plugin.c:78: /usr/lib/gcc/x86_64-linux-gnu/4.8/plugin/include/config/elfos.h:102:21: warning: C++11 requires a space between string literal and macro [-Wc++11-compat] fprintf ((FILE), "%s"HOST_WIDE_INT_PRINT_UNSIGNED"\n",\ ^ /usr/lib/gcc/x86_64-linux-gnu/4.8/plugin/include/config/elfos.h:170:24: warning: C++11 requires a space between string literal and macro [-Wc++11-compat] fprintf ((FILE), ","HOST_WIDE_INT_PRINT_UNSIGNED",%u\n", \ ^ In file included from /usr/lib/gcc/x86_64-linux-gnu/4.8/plugin/include/tm.h:42:0, from scripts/gcc-plugins/gcc-common.h:15, from scripts/gcc-plugins/latent_entropy_plugin.c:78: /usr/lib/gcc/x86_64-linux-gnu/4.8/plugin/include/defaults.h:126:24: warning: C++11 requires a space between string literal and macro [-Wc++11-compat] fprintf ((FILE), ","HOST_WIDE_INT_PRINT_UNSIGNED",%u\n", \ ^ The source of the warnings is in the plugin headers, so we have no control of it. I just suppressed them by adding -Wno-c++11-compat to scripts/gcc-plugins/Makefile. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Kees Cook <keescook@chromium.org>
2020-03-29kconfig: remove unused variable in qconf.ccMasahiro Yamada
If this file were compiled with -Wall, the following warning would be reported: scripts/kconfig/qconf.cc:312:6: warning: unused variable ‘i’ [-Wunused-variable] int i; ^ The commit prepares to turn on -Wall for C++ host programs. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org>
2020-03-29KEYS: Avoid false positive ENOMEM error on key readWaiman Long
By allocating a kernel buffer with a user-supplied buffer length, it is possible that a false positive ENOMEM error may be returned because the user-supplied length is just too large even if the system do have enough memory to hold the actual key data. Moreover, if the buffer length is larger than the maximum amount of memory that can be returned by kmalloc() (2^(MAX_ORDER-1) number of pages), a warning message will also be printed. To reduce this possibility, we set a threshold (PAGE_SIZE) over which we do check the actual key length first before allocating a buffer of the right size to hold it. The threshold is arbitrary, it is just used to trigger a buffer length check. It does not limit the actual key length as long as there is enough memory to satisfy the memory request. To further avoid large buffer allocation failure due to page fragmentation, kvmalloc() is used to allocate the buffer so that vmapped pages can be used when there is not a large enough contiguous set of pages available for allocation. In the extremely unlikely scenario that the key keeps on being changed and made longer (still <= buflen) in between 2 __keyctl_read_key() calls, the __keyctl_read_key() calling loop in keyctl_read_key() may have to be iterated a large number of times, but definitely not infinite. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com>
2020-03-29KEYS: Don't write out to userspace while holding key semaphoreWaiman Long
A lockdep circular locking dependency report was seen when running a keyutils test: [12537.027242] ====================================================== [12537.059309] WARNING: possible circular locking dependency detected [12537.088148] 4.18.0-147.7.1.el8_1.x86_64+debug #1 Tainted: G OE --------- - - [12537.125253] ------------------------------------------------------ [12537.153189] keyctl/25598 is trying to acquire lock: [12537.175087] 000000007c39f96c (&mm->mmap_sem){++++}, at: __might_fault+0xc4/0x1b0 [12537.208365] [12537.208365] but task is already holding lock: [12537.234507] 000000003de5b58d (&type->lock_class){++++}, at: keyctl_read_key+0x15a/0x220 [12537.270476] [12537.270476] which lock already depends on the new lock. [12537.270476] [12537.307209] [12537.307209] the existing dependency chain (in reverse order) is: [12537.340754] [12537.340754] -> #3 (&type->lock_class){++++}: [12537.367434] down_write+0x4d/0x110 [12537.385202] __key_link_begin+0x87/0x280 [12537.405232] request_key_and_link+0x483/0xf70 [12537.427221] request_key+0x3c/0x80 [12537.444839] dns_query+0x1db/0x5a5 [dns_resolver] [12537.468445] dns_resolve_server_name_to_ip+0x1e1/0x4d0 [cifs] [12537.496731] cifs_reconnect+0xe04/0x2500 [cifs] [12537.519418] cifs_readv_from_socket+0x461/0x690 [cifs] [12537.546263] cifs_read_from_socket+0xa0/0xe0 [cifs] [12537.573551] cifs_demultiplex_thread+0x311/0x2db0 [cifs] [12537.601045] kthread+0x30c/0x3d0 [12537.617906] ret_from_fork+0x3a/0x50 [12537.636225] [12537.636225] -> #2 (root_key_user.cons_lock){+.+.}: [12537.664525] __mutex_lock+0x105/0x11f0 [12537.683734] request_key_and_link+0x35a/0xf70 [12537.705640] request_key+0x3c/0x80 [12537.723304] dns_query+0x1db/0x5a5 [dns_resolver] [12537.746773] dns_resolve_server_name_to_ip+0x1e1/0x4d0 [cifs] [12537.775607] cifs_reconnect+0xe04/0x2500 [cifs] [12537.798322] cifs_readv_from_socket+0x461/0x690 [cifs] [12537.823369] cifs_read_from_socket+0xa0/0xe0 [cifs] [12537.847262] cifs_demultiplex_thread+0x311/0x2db0 [cifs] [12537.873477] kthread+0x30c/0x3d0 [12537.890281] ret_from_fork+0x3a/0x50 [12537.908649] [12537.908649] -> #1 (&tcp_ses->srv_mutex){+.+.}: [12537.935225] __mutex_lock+0x105/0x11f0 [12537.954450] cifs_call_async+0x102/0x7f0 [cifs] [12537.977250] smb2_async_readv+0x6c3/0xc90 [cifs] [12538.000659] cifs_readpages+0x120a/0x1e50 [cifs] [12538.023920] read_pages+0xf5/0x560 [12538.041583] __do_page_cache_readahead+0x41d/0x4b0 [12538.067047] ondemand_readahead+0x44c/0xc10 [12538.092069] filemap_fault+0xec1/0x1830 [12538.111637] __do_fault+0x82/0x260 [12538.129216] do_fault+0x419/0xfb0 [12538.146390] __handle_mm_fault+0x862/0xdf0 [12538.167408] handle_mm_fault+0x154/0x550 [12538.187401] __do_page_fault+0x42f/0xa60 [12538.207395] do_page_fault+0x38/0x5e0 [12538.225777] page_fault+0x1e/0x30 [12538.243010] [12538.243010] -> #0 (&mm->mmap_sem){++++}: [12538.267875] lock_acquire+0x14c/0x420 [12538.286848] __might_fault+0x119/0x1b0 [12538.306006] keyring_read_iterator+0x7e/0x170 [12538.327936] assoc_array_subtree_iterate+0x97/0x280 [12538.352154] keyring_read+0xe9/0x110 [12538.370558] keyctl_read_key+0x1b9/0x220 [12538.391470] do_syscall_64+0xa5/0x4b0 [12538.410511] entry_SYSCALL_64_after_hwframe+0x6a/0xdf [12538.435535] [12538.435535] other info that might help us debug this: [12538.435535] [12538.472829] Chain exists of: [12538.472829] &mm->mmap_sem --> root_key_user.cons_lock --> &type->lock_class [12538.472829] [12538.524820] Possible unsafe locking scenario: [12538.524820] [12538.551431] CPU0 CPU1 [12538.572654] ---- ---- [12538.595865] lock(&type->lock_class); [12538.613737] lock(root_key_user.cons_lock); [12538.644234] lock(&type->lock_class); [12538.672410] lock(&mm->mmap_sem); [12538.687758] [12538.687758] *** DEADLOCK *** [12538.687758] [12538.714455] 1 lock held by keyctl/25598: [12538.732097] #0: 000000003de5b58d (&type->lock_class){++++}, at: keyctl_read_key+0x15a/0x220 [12538.770573] [12538.770573] stack backtrace: [12538.790136] CPU: 2 PID: 25598 Comm: keyctl Kdump: loaded Tainted: G [12538.844855] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 12/27/2015 [12538.881963] Call Trace: [12538.892897] dump_stack+0x9a/0xf0 [12538.907908] print_circular_bug.isra.25.cold.50+0x1bc/0x279 [12538.932891] ? save_trace+0xd6/0x250 [12538.948979] check_prev_add.constprop.32+0xc36/0x14f0 [12538.971643] ? keyring_compare_object+0x104/0x190 [12538.992738] ? check_usage+0x550/0x550 [12539.009845] ? sched_clock+0x5/0x10 [12539.025484] ? sched_clock_cpu+0x18/0x1e0 [12539.043555] __lock_acquire+0x1f12/0x38d0 [12539.061551] ? trace_hardirqs_on+0x10/0x10 [12539.080554] lock_acquire+0x14c/0x420 [12539.100330] ? __might_fault+0xc4/0x1b0 [12539.119079] __might_fault+0x119/0x1b0 [12539.135869] ? __might_fault+0xc4/0x1b0 [12539.153234] keyring_read_iterator+0x7e/0x170 [12539.172787] ? keyring_read+0x110/0x110 [12539.190059] assoc_array_subtree_iterate+0x97/0x280 [12539.211526] keyring_read+0xe9/0x110 [12539.227561] ? keyring_gc_check_iterator+0xc0/0xc0 [12539.249076] keyctl_read_key+0x1b9/0x220 [12539.266660] do_syscall_64+0xa5/0x4b0 [12539.283091] entry_SYSCALL_64_after_hwframe+0x6a/0xdf One way to prevent this deadlock scenario from happening is to not allow writing to userspace while holding the key semaphore. Instead, an internal buffer is allocated for getting the keys out from the read method first before copying them out to userspace without holding the lock. That requires taking out the __user modifier from all the relevant read methods as well as additional changes to not use any userspace write helpers. That is, 1) The put_user() call is replaced by a direct copy. 2) The copy_to_user() call is replaced by memcpy(). 3) All the fault handling code is removed. Compiling on a x86-64 system, the size of the rxrpc_read() function is reduced from 3795 bytes to 2384 bytes with this patch. Fixes: ^1da177e4c3f4 ("Linux-2.6.12-rc2") Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com>
2020-03-29efi/libstub/arm: Fix spurious message that an initrd was loadedArd Biesheuvel
Commit: ec93fc371f014a6f ("efi/libstub: Add support for loading the initrd from a device path") added a diagnostic print to the ARM version of the EFI stub that reports whether an initrd has been loaded that was passed via the command line using initrd=. However, it failed to take into account that, for historical reasons, the file loading routines return EFI_SUCCESS when no file was found, and the only way to decide whether a file was loaded is to inspect the 'size' argument that is passed by reference. So let's inspect this returned size, to prevent the print from being emitted even if no initrd was loaded at all. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-efi@vger.kernel.org