summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2019-07-12mm: security: introduce init_on_alloc=1 and init_on_free=1 boot optionsAlexander Potapenko
Patch series "add init_on_alloc/init_on_free boot options", v10. Provide init_on_alloc and init_on_free boot options. These are aimed at preventing possible information leaks and making the control-flow bugs that depend on uninitialized values more deterministic. Enabling either of the options guarantees that the memory returned by the page allocator and SL[AU]B is initialized with zeroes. SLOB allocator isn't supported at the moment, as its emulation of kmem caches complicates handling of SLAB_TYPESAFE_BY_RCU caches correctly. Enabling init_on_free also guarantees that pages and heap objects are initialized right after they're freed, so it won't be possible to access stale data by using a dangling pointer. As suggested by Michal Hocko, right now we don't let the heap users to disable initialization for certain allocations. There's not enough evidence that doing so can speed up real-life cases, and introducing ways to opt-out may result in things going out of control. This patch (of 2): The new options are needed to prevent possible information leaks and make control-flow bugs that depend on uninitialized values more deterministic. This is expected to be on-by-default on Android and Chrome OS. And it gives the opportunity for anyone else to use it under distros too via the boot args. (The init_on_free feature is regularly requested by folks where memory forensics is included in their threat models.) init_on_alloc=1 makes the kernel initialize newly allocated pages and heap objects with zeroes. Initialization is done at allocation time at the places where checks for __GFP_ZERO are performed. init_on_free=1 makes the kernel initialize freed pages and heap objects with zeroes upon their deletion. This helps to ensure sensitive data doesn't leak via use-after-free accesses. Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator returns zeroed memory. The two exceptions are slab caches with constructors and SLAB_TYPESAFE_BY_RCU flag. Those are never zero-initialized to preserve their semantics. Both init_on_alloc and init_on_free default to zero, but those defaults can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and CONFIG_INIT_ON_FREE_DEFAULT_ON. If either SLUB poisoning or page poisoning is enabled, those options take precedence over init_on_alloc and init_on_free: initialization is only applied to unpoisoned allocations. Slowdown for the new features compared to init_on_free=0, init_on_alloc=0: hackbench, init_on_free=1: +7.62% sys time (st.err 0.74%) hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%) Linux build with -j12, init_on_free=1: +8.38% wall time (st.err 0.39%) Linux build with -j12, init_on_free=1: +24.42% sys time (st.err 0.52%) Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%) Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%) The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline is within the standard error. The new features are also going to pave the way for hardware memory tagging (e.g. arm64's MTE), which will require both on_alloc and on_free hooks to set the tags for heap objects. With MTE, tagging will have the same cost as memory initialization. Although init_on_free is rather costly, there are paranoid use-cases where in-memory data lifetime is desired to be minimized. There are various arguments for/against the realism of the associated threat models, but given that we'll need the infrastructure for MTE anyway, and there are people who want wipe-on-free behavior no matter what the performance cost, it seems reasonable to include it in this series. [glider@google.com: v8] Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com [glider@google.com: v9] Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com [glider@google.com: v10] Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Michal Hocko <mhocko@suse.cz> [page and dmapool parts Acked-by: James Morris <jamorris@linux.microsoft.com>] Cc: Christoph Lameter <cl@linux.com> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Sandeep Patil <sspatil@android.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Jann Horn <jannh@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm/pgtable: drop pgtable_t variable from pte_fn_t functionsAnshuman Khandual
Drop the pgtable_t variable from all implementation for pte_fn_t as none of them use it. apply_to_pte_range() should stop computing it as well. Should help us save some cycles. Link: http://lkml.kernel.org/r/1556803126-26596-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Matthew Wilcox <willy@infradead.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: <jglisse@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12asm-generic, x86: introduce generic pte_{alloc,free}_one[_kernel]Mike Rapoport
Most architectures have identical or very similar implementation of pte_alloc_one_kernel(), pte_alloc_one(), pte_free_kernel() and pte_free(). Add a generic implementation that can be reused across architectures and enable its use on x86. The generic implementation uses GFP_KERNEL | __GFP_ZERO for the kernel page tables and GFP_KERNEL | __GFP_ZERO | __GFP_ACCOUNT for the user page tables. The "base" functions for PTE allocation, namely __pte_alloc_one_kernel() and __pte_alloc_one() are intended for the architectures that require additional actions after actual memory allocation or must use non-default GFP flags. x86 is switched to use generic pte_alloc_one_kernel(), pte_free_kernel() and pte_free(). x86 still implements pte_alloc_one() to allow run-time control of GFP flags required for "userpte" command line option. Link: http://lkml.kernel.org/r/1557296232-15361-2-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Guo Ren <ren_guo@c-sky.com> Cc: Helge Deller <deller@gmx.de> Cc: Ley Foon Tan <lftan@altera.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Sam Creasey <sammy@sammy.net> Cc: Vincent Chen <deanbo422@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm: move the powerpc hugepd code to mm/gup.cChristoph Hellwig
While only powerpc supports the hugepd case, the code is pretty generic and I'd like to keep all GUP internals in one place. Link: http://lkml.kernel.org/r/20190625143715.1689-15-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Miller <davem@davemloft.net> Cc: James Hogan <jhogan@kernel.org> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Rich Felker <dalias@libc.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm, memcg: add a memcg_slabinfo debugfs fileWaiman Long
There are concerns about memory leaks from extensive use of memory cgroups as each memory cgroup creates its own set of kmem caches. There is a possiblity that the memcg kmem caches may remain even after the memory cgroups have been offlined. Therefore, it will be useful to show the status of each of memcg kmem caches. This patch introduces a new <debugfs>/memcg_slabinfo file which is somewhat similar to /proc/slabinfo in format, but lists only information about kmem caches that have child memcg kmem caches. Information available in /proc/slabinfo are not repeated in memcg_slabinfo. A portion of a sample output of the file was: # <name> <css_id[:dead]> <active_objs> <num_objs> <active_slabs> <num_slabs> rpc_inode_cache root 13 51 1 1 rpc_inode_cache 48 0 0 0 0 fat_inode_cache root 1 45 1 1 fat_inode_cache 41 2 45 1 1 xfs_inode root 770 816 24 24 xfs_inode 92 22 34 1 1 xfs_inode 88:dead 1 34 1 1 xfs_inode 89:dead 23 34 1 1 xfs_inode 85 4 34 1 1 xfs_inode 84 9 34 1 1 The css id of the memcg is also listed. If a memcg is not online, the tag ":dead" will be attached as shown above. [longman@redhat.com: memcg: add ":deact" tag for reparented kmem caches in memcg_slabinfo] Link: http://lkml.kernel.org/r/20190621173005.31514-1-longman@redhat.com [longman@redhat.com: set the flag in the common code as suggested by Roman] Link: http://lkml.kernel.org/r/20190627184324.5875-1-longman@redhat.com Link: http://lkml.kernel.org/r/20190619171621.26209-1-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Suggested-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm: memcg/slab: reparent memcg kmem_caches on cgroup removalRoman Gushchin
Let's reparent non-root kmem_caches on memcg offlining. This allows us to release the memory cgroup without waiting for the last outstanding kernel object (e.g. dentry used by another application). Since the parent cgroup is already charged, everything we need to do is to splice the list of kmem_caches to the parent's kmem_caches list, swap the memcg pointer, drop the css refcounter for each kmem_cache and adjust the parent's css refcounter. Please, note that kmem_cache->memcg_params.memcg isn't a stable pointer anymore. It's safe to read it under rcu_read_lock(), cgroup_mutex held, or any other way that protects the memory cgroup from being released. We can race with the slab allocation and deallocation paths. It's not a big problem: parent's charge and slab global stats are always correct, and we don't care anymore about the child usage and global stats. The child cgroup is already offline, so we don't use or show it anywhere. Local slab stats (NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE) aren't used anywhere except count_shadow_nodes(). But even there it won't break anything: after reparenting "nodes" will be 0 on child level (because we're already reparenting shrinker lists), and on parent level page stats always were 0, and this patch won't change anything. [guro@fb.com: properly handle kmem_caches reparented to root_mem_cgroup] Link: http://lkml.kernel.org/r/20190620213427.1691847-1-guro@fb.com Link: http://lkml.kernel.org/r/20190611231813.3148843-11-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Waiman Long <longman@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Andrei Vagin <avagin@gmail.com> Cc: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm: memcg/slab: rework non-root kmem_cache lifecycle managementRoman Gushchin
Currently each charged slab page holds a reference to the cgroup to which it's charged. Kmem_caches are held by the memcg and are released all together with the memory cgroup. It means that none of kmem_caches are released unless at least one reference to the memcg exists, which is very far from optimal. Let's rework it in a way that allows releasing individual kmem_caches as soon as the cgroup is offline, the kmem_cache is empty and there are no pending allocations. To make it possible, let's introduce a new percpu refcounter for non-root kmem caches. The counter is initialized to the percpu mode, and is switched to the atomic mode during kmem_cache deactivation. The counter is bumped for every charged page and also for every running allocation. So the kmem_cache can't be released unless all allocations complete. To shutdown non-active empty kmem_caches, let's reuse the work queue, previously used for the kmem_cache deactivation. Once the reference counter reaches 0, let's schedule an asynchronous kmem_cache release. * I used the following simple approach to test the performance (stolen from another patchset by T. Harding): time find / -name fname-no-exist echo 2 > /proc/sys/vm/drop_caches repeat 10 times Results: orig patched real 0m1.455s real 0m1.355s user 0m0.206s user 0m0.219s sys 0m0.855s sys 0m0.807s real 0m1.487s real 0m1.699s user 0m0.221s user 0m0.256s sys 0m0.806s sys 0m0.948s real 0m1.515s real 0m1.505s user 0m0.183s user 0m0.215s sys 0m0.876s sys 0m0.858s real 0m1.291s real 0m1.380s user 0m0.193s user 0m0.198s sys 0m0.843s sys 0m0.786s real 0m1.364s real 0m1.374s user 0m0.180s user 0m0.182s sys 0m0.868s sys 0m0.806s real 0m1.352s real 0m1.312s user 0m0.201s user 0m0.212s sys 0m0.820s sys 0m0.761s real 0m1.302s real 0m1.349s user 0m0.205s user 0m0.203s sys 0m0.803s sys 0m0.792s real 0m1.334s real 0m1.301s user 0m0.194s user 0m0.201s sys 0m0.806s sys 0m0.779s real 0m1.426s real 0m1.434s user 0m0.216s user 0m0.181s sys 0m0.824s sys 0m0.864s real 0m1.350s real 0m1.295s user 0m0.200s user 0m0.190s sys 0m0.842s sys 0m0.811s So it looks like the difference is not noticeable in this test. [cai@lca.pw: fix an use-after-free in kmemcg_workfn()] Link: http://lkml.kernel.org/r/1560977573-10715-1-git-send-email-cai@lca.pw Link: http://lkml.kernel.org/r/20190611231813.3148843-9-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Qian Cai <cai@lca.pw> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Andrei Vagin <avagin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm: memcg/slab: introduce __memcg_kmem_uncharge_memcg()Roman Gushchin
Let's separate the page counter modification code out of __memcg_kmem_uncharge() in a way similar to what __memcg_kmem_charge() and __memcg_kmem_charge_memcg() work. This will allow to reuse this code later using a new memcg_kmem_uncharge_memcg() wrapper, which calls __memcg_kmem_uncharge_memcg() if memcg_kmem_enabled() check is passed. Link: http://lkml.kernel.org/r/20190611231813.3148843-5-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Waiman Long <longman@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Andrei Vagin <avagin@gmail.com> Cc: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm: memcg/slab: rename slab delayed deactivation functions and fieldsRoman Gushchin
The delayed work/rcu deactivation infrastructure of non-root kmem_caches can be also used for asynchronous release of these objects. Let's get rid of the word "deactivation" in corresponding names to make the code look better after generalization. It's easier to make the renaming first, so that the generalized code will look consistent from scratch. Let's rename struct memcg_cache_params fields: deact_fn -> work_fn deact_rcu_head -> rcu_head deact_work -> work And RCU/delayed work callbacks in slab common code: kmemcg_deactivate_rcufn -> kmemcg_rcufn kmemcg_deactivate_workfn -> kmemcg_workfn This patch contains no functional changes, only renamings. Link: http://lkml.kernel.org/r/20190611231813.3148843-3-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Waiman Long <longman@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Andrei Vagin <avagin@gmail.com> Cc: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm, memcg: introduce memory.events.localShakeel Butt
The memory controller in cgroup v2 exposes memory.events file for each memcg which shows the number of times events like low, high, max, oom and oom_kill have happened for the whole tree rooted at that memcg. Users can also poll or register notification to monitor the changes in that file. Any event at any level of the tree rooted at memcg will notify all the listeners along the path till root_mem_cgroup. There are existing users which depend on this behavior. However there are users which are only interested in the events happening at a specific level of the memcg tree and not in the events in the underlying tree rooted at that memcg. One such use-case is a centralized resource monitor which can dynamically adjust the limits of the jobs running on a system. The jobs can create their sub-hierarchy for their own sub-tasks. The centralized monitor is only interested in the events at the top level memcgs of the jobs as it can then act and adjust the limits of the jobs. Using the current memory.events for such centralized monitor is very inconvenient. The monitor will keep receiving events which it is not interested and to find if the received event is interesting, it has to read memory.event files of the next level and compare it with the top level one. So, let's introduce memory.events.local to the memcg which shows and notify for the events at the memcg level. Now, does memory.stat and memory.pressure need their local versions. IMHO no due to the no internal process contraint of the cgroup v2. The memory.stat file of the top level memcg of a job shows the stats and vmevents of the whole tree. The local stats or vmevents of the top level memcg will only change if there is a process running in that memcg but v2 does not allow that. Similarly for memory.pressure there will not be any process in the internal nodes and thus no chance of local pressure. Link: http://lkml.kernel.org/r/20190527174643.209172-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Chris Down <chris@chrisdown.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm, swap: use rbtree for swap_extentAaron Lu
swap_extent is used to map swap page offset to backing device's block offset. For a continuous block range, one swap_extent is used and all these swap_extents are managed in a linked list. These swap_extents are used by map_swap_entry() during swap's read and write path. To find out the backing device's block offset for a page offset, the swap_extent list will be traversed linearly, with curr_swap_extent being used as a cache to speed up the search. This works well as long as swap_extents are not huge or when the number of processes that access swap device are few, but when the swap device has many extents and there are a number of processes accessing the swap device concurrently, it can be a problem. On one of our servers, the disk's remaining size is tight: $df -h Filesystem Size Used Avail Use% Mounted on ... ... /dev/nvme0n1p1 1.8T 1.3T 504G 72% /home/t4 When creating a 80G swapfile there, there are as many as 84656 swap extents. The end result is, kernel spends abou 30% time in map_swap_entry() and swap throughput is only 70MB/s. As a comparison, when I used smaller sized swapfile, like 4G whose swap_extent dropped to 2000, swap throughput is back to 400-500MB/s and map_swap_entry() is about 3%. One downside of using rbtree for swap_extent is, 'struct rbtree' takes 24 bytes while 'struct list_head' takes 16 bytes, that's 8 bytes more for each swap_extent. For a swapfile that has 80k swap_extents, that means 625KiB more memory consumed. Test: Since it's not possible to reboot that server, I can not test this patch diretly there. Instead, I tested it on another server with NVMe disk. I created a 20G swapfile on an NVMe backed XFS fs. By default, the filesystem is quite clean and the created swapfile has only 2 extents. Testing vanilla and this patch shows no obvious performance difference when swapfile is not fragmented. To see the patch's effects, I used some tweaks to manually fragment the swapfile by breaking the extent at 1M boundary. This made the swapfile have 20K extents. nr_task=4 kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf) vanilla 165191 90.77% 171798 90.21% patched 858993 +420% 2.16% 715827 +317% 0.77% nr_task=8 kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf) vanilla 306783 92.19% 318145 87.76% patched 954437 +211% 2.35% 1073741 +237% 1.57% swapout: the throughput of swap out, in KB/s, higher is better 1st map_swap_entry: cpu cycles percent sampled by perf swapin: the throughput of swap in, in KB/s, higher is better. 2nd map_swap_entry: cpu cycles percent sampled by perf nr_task=1 doesn't show any difference, this is due to the curr_swap_extent can be effectively used to cache the correct swap extent for single task workload. [akpm@linux-foundation.org: s/BUG_ON(1)/BUG()/] Link: http://lkml.kernel.org/r/20190523142404.GA181@aaronlu Signed-off-by: Aaron Lu <ziqian.lzq@antfin.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm, swap: fix race between swapoff and some swap operationsHuang Ying
When swapin is performed, after getting the swap entry information from the page table, system will swap in the swap entry, without any lock held to prevent the swap device from being swapoff. This may cause the race like below, CPU 1 CPU 2 ----- ----- do_swap_page swapin_readahead __read_swap_cache_async swapoff swapcache_prepare p->swap_map = NULL __swap_duplicate p->swap_map[?] /* !!! NULL pointer access */ Because swapoff is usually done when system shutdown only, the race may not hit many people in practice. But it is still a race need to be fixed. To fix the race, get_swap_device() is added to check whether the specified swap entry is valid in its swap device. If so, it will keep the swap entry valid via preventing the swap device from being swapoff, until put_swap_device() is called. Because swapoff() is very rare code path, to make the normal path runs as fast as possible, rcu_read_lock/unlock() and synchronize_rcu() instead of reference count is used to implement get/put_swap_device(). >From get_swap_device() to put_swap_device(), RCU reader side is locked, so synchronize_rcu() in swapoff() will wait until put_swap_device() is called. In addition to swap_map, cluster_info, etc. data structure in the struct swap_info_struct, the swap cache radix tree will be freed after swapoff, so this patch fixes the race between swap cache looking up and swapoff too. Races between some other swap cache usages and swapoff are fixed too via calling synchronize_rcu() between clearing PageSwapCache() and freeing swap cache data structure. Another possible method to fix this is to use preempt_off() + stop_machine() to prevent the swap device from being swapoff when its data structure is being accessed. The overhead in hot-path of both methods is similar. The advantages of RCU based method are, 1. stop_machine() may disturb the normal execution code path on other CPUs. 2. File cache uses RCU to protect its radix tree. If the similar mechanism is used for swap cache too, it is easier to share code between them. 3. RCU is used to protect swap cache in total_swapcache_pages() and exit_swap_address_space() already. The two mechanisms can be merged to simplify the logic. Link: http://lkml.kernel.org/r/20190522015423.14418-1-ying.huang@intel.com Fixes: 235b62176712 ("mm/swap: add cluster lock") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com> Not-nacked-by: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm/filemap: don't cast ->readpage to filler_t for do_read_cache_pageChristoph Hellwig
We can just pass a NULL filler and do the right thing inside of do_read_cache_page based on the NULL parameter. Link: http://lkml.kernel.org/r/20190520055731.24538-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm, debug_pagealloc: use a page type instead of page_ext flagVlastimil Babka
When debug_pagealloc is enabled, we currently allocate the page_ext array to mark guard pages with the PAGE_EXT_DEBUG_GUARD flag. Now that we have the page_type field in struct page, we can use that instead, as guard pages are neither PageSlab nor mapped to userspace. This reduces memory overhead when debug_pagealloc is enabled and there are no other features requiring the page_ext array. Link: http://lkml.kernel.org/r/20190603143451.27353-4-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm, debug_pagelloc: use static keys to enable debuggingVlastimil Babka
Patch series "debug_pagealloc improvements". I have been recently debugging some pcplist corruptions, where it would be useful to perform struct page checks immediately as pages are allocated from and freed to pcplists, which is now only possible by rebuilding the kernel with CONFIG_DEBUG_VM (details in Patch 2 changelog). To make this kind of debugging simpler in future on a distro kernel, I have improved CONFIG_DEBUG_PAGEALLOC so that it has even smaller overhead when not enabled at boot time (Patch 1) and also when enabled (Patch 3), and extended it to perform the struct page checks more often when enabled (Patch 2). Now it can be configured in when building a distro kernel without extra overhead, and debugging page use after free or double free can be enabled simply by rebooting with debug_pagealloc=on. This patch (of 3): CONFIG_DEBUG_PAGEALLOC has been redesigned by 031bc5743f15 ("mm/debug-pagealloc: make debug-pagealloc boottime configurable") to allow being always enabled in a distro kernel, but only perform its expensive functionality when booted with debug_pagelloc=on. We can further reduce the overhead when not boot-enabled (including page allocator fast paths) using static keys. This patch introduces one for debug_pagealloc core functionality, and another for the optional guard page functionality (enabled by booting with debug_guardpage_minorder=X). Link: http://lkml.kernel.org/r/20190603143451.27353-2-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12include/linux/pagemap.h: document trylock_page() return valueAndrew Morton
Cc: Henry Burns <henryburns@google.com> Cc: Jonathan Adams <jwadams@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Xidong Wang <wangxidong_97@163.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12include/linux/vmpressure.h: use spinlock_t instead of struct spinlockSebastian Andrzej Siewior
For spinlocks the type spinlock_t should be used instead of "struct spinlock". Use spinlock_t for spinlock's definition. Link: http://lkml.kernel.org/r/20190704153803.12739-3-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm/page_isolation.c: change the prototype of undo_isolate_page_range()Pingfan Liu
undo_isolate_page_range() never fails, so no need to return value. Link: http://lkml.kernel.org/r/1562075604-8979-1-git-send-email-kernelfans@gmail.com Signed-off-by: Pingfan Liu <kernelfans@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12include/linux/mm_types.h: ifdef struct vm_area_struct::swap_readahead_infoAlexey Dobriyan
The field is only used in swap code. Link: http://lkml.kernel.org/r/20190503190500.GA30589@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm: make !CONFIG_HUGE_PAGE wrappers into static inlinesJason Gunthorpe
Instead of using defines, which loses type safety and provokes unused variable warnings from gcc, put the constants into static inlines. Link: http://lkml.kernel.org/r/20190522235102.GA15370@mellanox.com Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Jerome Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12include/linux/pfn_t.h: remove pfn_t_to_virt()Andrew Morton
It has no callers and there is no virt_to_pfn_t(). Reported-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm/kasan: add object validation in ksize()Marco Elver
ksize() has been unconditionally unpoisoning the whole shadow memory region associated with an allocation. This can lead to various undetected bugs, for example, double-kzfree(). Specifically, kzfree() uses ksize() to determine the actual allocation size, and subsequently zeroes the memory. Since ksize() used to just unpoison the whole shadow memory region, no invalid free was detected. This patch addresses this as follows: 1. Add a check in ksize(), and only then unpoison the memory region. 2. Preserve kasan_unpoison_slab() semantics by explicitly unpoisoning the shadow memory region using the size obtained from __ksize(). Tested: 1. With SLAB allocator: a) normal boot without warnings; b) verified the added double-kzfree() is detected. 2. With SLUB allocator: a) normal boot without warnings; b) verified the added double-kzfree() is detected. [elver@google.com: s/BUG_ON/WARN_ON_ONCE/, per Kees] Link: http://lkml.kernel.org/r/20190627094445.216365-6-elver@google.com Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199359 Link: http://lkml.kernel.org/r/20190626142014.141844-6-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Acked-by: Kees Cook <keescook@chromium.org> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm/slab: refactor common ksize KASAN logic into slab_common.cMarco Elver
This refactors common code of ksize() between the various allocators into slab_common.c: __ksize() is the allocator-specific implementation without instrumentation, whereas ksize() includes the required KASAN logic. Link: http://lkml.kernel.org/r/20190626142014.141844-5-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Acked-by: Christoph Lameter <cl@linux.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm/kasan: change kasan_check_{read,write} to return booleanMarco Elver
This changes {,__}kasan_check_{read,write} functions to return a boolean denoting if the access was valid or not. [sfr@canb.auug.org.au: include types.h for "bool"] Link: http://lkml.kernel.org/r/20190705184949.13cdd021@canb.auug.org.au Link: http://lkml.kernel.org/r/20190626142014.141844-3-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm/kasan: introduce __kasan_check_{read,write}Marco Elver
Patch series "mm/kasan: Add object validation in ksize()", v3. This patch (of 5): This introduces __kasan_check_{read,write}. __kasan_check functions may be used from anywhere, even compilation units that disable instrumentation selectively. This change eliminates the need for the __KASAN_INTERNAL definition. [elver@google.com: v5] Link: http://lkml.kernel.org/r/20190708170706.174189-2-elver@google.com Link: http://lkml.kernel.org/r/20190626142014.141844-2-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12asm-generic, x86: add bitops instrumentation for KASANMarco Elver
This adds a new header to asm-generic to allow optionally instrumenting architecture-specific asm implementations of bitops. This change includes the required change for x86 as reference and changes the kernel API doc to point to bitops-instrumented.h instead. Rationale: the functions in x86's bitops.h are no longer the kernel API functions, but instead the arch_ prefixed functions, which are then instrumented via bitops-instrumented.h. Other architectures can similarly add support for asm implementations of bitops. The documentation text was derived from x86 and existing bitops asm-generic versions: 1) references to x86 have been removed; 2) as a result, some of the text had to be reworded for clarity and consistency. Tested using lib/test_kasan with bitops tests (pre-requisite patch). Bugzilla ref: https://bugzilla.kernel.org/show_bug.cgi?id=198439 Link: http://lkml.kernel.org/r/20190613125950.197667-4-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12include/linux/dmar.h: replace single-char identifiers in macrosQian Cai
There are a few macros in IOMMU have single-char identifiers make the code hard to read and debug. Replace them with meaningful names. Link: http://lkml.kernel.org/r/1559566783-13627-1-git-send-email-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Cc: Joerg Roedel <jroedel@suse.de> Cc: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12nilfs2: do not use unexported cpu_to_le32()/le32_to_cpu() in uapi headerMasahiro Yamada
cpu_to_le32/le32_to_cpu is defined in include/linux/byteorder/generic.h, which is not exported to user-space. UAPI headers must use the ones prefixed with double-underscore. Detected by compile-testing exported headers: include/linux/nilfs2_ondisk.h: In function `nilfs_checkpoint_set_snapshot': include/linux/nilfs2_ondisk.h:536:17: error: implicit declaration of function `cpu_to_le32' [-Werror=implicit-function-declaration] cp->cp_flags = cpu_to_le32(le32_to_cpu(cp->cp_flags) | \ ^ include/linux/nilfs2_ondisk.h:552:1: note: in expansion of macro `NILFS_CHECKPOINT_FNS' NILFS_CHECKPOINT_FNS(SNAPSHOT, snapshot) ^~~~~~~~~~~~~~~~~~~~ include/linux/nilfs2_ondisk.h:536:29: error: implicit declaration of function `le32_to_cpu' [-Werror=implicit-function-declaration] cp->cp_flags = cpu_to_le32(le32_to_cpu(cp->cp_flags) | \ ^ include/linux/nilfs2_ondisk.h:552:1: note: in expansion of macro `NILFS_CHECKPOINT_FNS' NILFS_CHECKPOINT_FNS(SNAPSHOT, snapshot) ^~~~~~~~~~~~~~~~~~~~ include/linux/nilfs2_ondisk.h: In function `nilfs_segment_usage_set_clean': include/linux/nilfs2_ondisk.h:622:19: error: implicit declaration of function `cpu_to_le64' [-Werror=implicit-function-declaration] su->su_lastmod = cpu_to_le64(0); ^~~~~~~~~~~ Link: http://lkml.kernel.org/r/20190605053006.14332-1-yamada.masahiro@socionext.com Fixes: e63e88bc53ba ("nilfs2: move ioctl interface and disk layout to uapi separately") Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Joe Perches <joe@perches.com> Cc: <stable@vger.kernel.org> [4.9+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm/nvdimm: add is_ioremap_addr and use that to check ioremap addressAneesh Kumar K.V
Architectures like powerpc use different address range to map ioremap and vmalloc range. The memunmap() check used by the nvdimm layer was wrongly using is_vmalloc_addr() to check for ioremap range which fails for ppc64. This result in ppc64 not freeing the ioremap mapping. The side effect of this is an unbind failure during module unload with papr_scm nvdimm driver Link: http://lkml.kernel.org/r/20190701134038.14165-1-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Fixes: b5beae5e224f ("powerpc/pseries: Add driver for PAPR SCM regions") Cc: Dan Williams <dan.j.williams@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12clk: consoldiate the __clk_get_hw() declarationsStephen Rothwell
Without this we were getting errors like: In file included from drivers/clk/clkdev.c:22:0: drivers/clk/clk.h:36:23: error: static declaration of '__clk_get_hw' follows non-static declaration include/linux/clk-provider.h:808:16: note: previous declaration of '__clk_get_hw' was here Fixes: 59fcdce425b7 ("clk: Remove ifdef for COMMON_CLK in clk-provider.h") fixes: 73e0e496afda ("clkdev: Always allocate a struct clk and call __clk_get() w/ CCF") Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2019-07-12SUNRPC: Fix transport accounting when caller specifies an rpc_xprtTrond Myklebust
Ensure that we do the required accounting for the round robin queue when the caller to rpc_init_task() has passed in a transport to be used. Reported-by: Olga Kornievskaia <aglo@umich.edu> Reported-by: Neil Brown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-12Merge tag 'nfs-rdma-for-5.3-1' of ↵Trond Myklebust
git://git.linux-nfs.org/projects/anna/linux-nfs NFSoRDMA client updates for 5.3 New features: - Add a way to place MRs back on the free list - Reduce context switching - Add new trace events Bugfixes and cleanups: - Fix a BUG when tracing is enabled with NFSv4.1 - Fix a use-after-free in rpcrdma_post_recvs - Replace use of xdr_stream_pos in rpcrdma_marshal_req - Fix occasional transport deadlock - Fix show_nfs_errors macros, other tracing improvements - Remove RPCRDMA_REQ_F_PENDING and fr_state - Various simplifications and refactors
2019-07-11block: Limit zone array allocation sizeDamien Le Moal
Limit the size of the struct blk_zone array used in blk_revalidate_disk_zones() to avoid memory allocation failures leading to disk revalidation failure. Also further reduce the likelyhood of such failures by using kvcalloc() (that is vmalloc()) instead of allocating contiguous pages with alloc_pages(). Fixes: 515ce6061312 ("scsi: sd_zbc: Fix sd_zbc_report_zones() buffer allocation") Fixes: e76239a3748c ("block: add a report_zones method") Cc: stable@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-11block: Kill gfp_t argument of blkdev_report_zones()Damien Le Moal
Only GFP_KERNEL and GFP_NOIO are used with blkdev_report_zones(). In preparation of using vmalloc() for large report buffer and zone array allocations used by this function, remove its "gfp_t gfp_mask" argument and rely on the caller context to use memalloc_noio_save/restore() where necessary (block layer zone revalidation and dm-zoned I/O error path). Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-11Merge tag 'tag-chrome-platform-for-v5.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux Pull chrome platform updates from Benson Leung "CrOS EC: - Add new CrOS ISHTP transport protocol - Add proper documentation for debugfs entries and expose resume and uptime files - Select LPC transport protocol variant at runtime. - Add lid angle sensor driver - Fix oops on suspend/resume for lightbar driver - Set CrOS SPI transport protol in realtime Wilco EC: - Add telemetry char device interface - Add support for event handling - Add new sysfs attributes Misc: - Contains ib-mfd-cros-v5.3 immutable branch from mfd, with cros_ec_commands.h header freshly synced with Chrome OS's EC project" * tag 'tag-chrome-platform-for-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux: (54 commits) mfd / platform: cros_ec_debugfs: Expose resume result via debugfs platform/chrome: lightbar: Get drvdata from parent in suspend/resume iio: cros_ec: Add lid angle driver platform/chrome: wilco_ec: Add circular buffer as event queue platform/chrome: cros_ec_lpc_mec: Fix kernel-doc comment first line platform/chrome: cros_ec_lpc: Choose Microchip EC at runtime platform/chrome: cros_ec_lpc: Merge cros_ec_lpc and cros_ec_lpc_reg Input: cros_ec_keyb: mask out extra flags in event_type platform/chrome: wilco_ec: Fix unreleased lock in event_read() platform/chrome: cros_ec_debugfs: cros_ec_uptime_fops can be static platform/chrome: cros_ec_debugfs: Add debugfs ABI documentation platform/chrome: cros_ec_debugfs: Fix kernel-doc comment first line platform/chrome: cros_ec_debugfs: Add debugfs entry to retrieve EC uptime mfd: cros_ec: Update I2S API mfd: cros_ec: Add Management API entry points mfd: cros_ec: Add SKU ID and Secure storage API mfd: cros_ec: Add API for rwsig mfd: cros_ec: Add API for Fingerprint support mfd: cros_ec: Add API for Touchpad support mfd: cros_ec: Add API for EC-EC communication ...
2019-07-11Merge tag 'devicetree-for-5.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux Pull Devicetree updates from Rob Herring: - DT binding schema examples are now validated against the schemas. Various examples are fixed due to that. - Sync dtc with upstream version v1.5.0-30-g702c1b6c0e73 - Initial schemas for networking bindings. This includes ethernet, phy and mdio common bindings with several Allwinner and stmmac converted to the schema. - Conversion of more Arm top-level SoC/board bindings to DT schema - Conversion of PSCI binding to DT schema - Rework Arm CPU schema to coexist with other CPU schemas - Add a bunch of missing vendor prefixes and new ones for SoChip, Sipeed, Kontron, B&R Industrial Automation GmbH, and Espressif - Add Mediatek UART RX wakeup support to binding - Add reset to ST UART binding - Remove some Linuxisms from the endianness common-properties.txt binding - Make the flattened DT read-only after init - Ignore disabled reserved memory nodes - Clean-up some dead code in FDT functions * tag 'devicetree-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: (56 commits) dt-bindings: vendor-prefixes: add Sipeed dt-bindings: vendor-prefixes: add SoChip dt-bindings: 83xx-512x-pci: Drop cell-index property dt-bindings: serial: add documentation for Rx in-band wakeup support dt-bindings: arm: Convert RDA Micro board/soc bindings to json-schema of: unittest: simplify getting the adapter of a client of/fdt: pass early_init_dt_reserve_memory_arch() with bool type nomap of/platform: Drop superfluous cast in of_device_make_bus_id() dt-bindings: usb: ehci: Fix example warnings dt-bindings: net: Use phy-mode instead of phy-connection-type dt-bindings: simple-framebuffer: Add requirement for pipelines dt-bindings: display: Fix simple-framebuffer example dt-bindings: net: mdio: Add child nodes dt-bindings: net: mdio: Add address and size cells dt-bindings: net: mdio: Add a nodename pattern dt-bindings: mtd: sunxi-nand: Drop 'maxItems' from child 'reg' property dt-bindings: arm: Limit cpus schema to only check Arm 'cpu' nodes dt-bindings: backlight: lm3630a: correct schema validation dt-bindings: net: dwmac: Deprecate the PHY reset properties dt-bindings: net: sun8i-emac: Convert the binding to a schemas ...
2019-07-11Merge tag 'mmc-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmcLinus Torvalds
Pull MMC updates from Ulf Hansson: "MMC core: - Let the dma map ops deal with bouncing and drop dma_max_pfn() from the dma-mapping interface for ARM - Convert the generic MMC DT doc to YAML schemas - Drop questionable support for powered-on re-init of SDIO cards at runtime resume and for SDIO HW reset - Prevent questionable re-init of powered-on removable SDIO cards at system resume - Cleanup and clarify some SDIO core code MMC host: - tmio: Make runtime PM enablement more flexible for variants - tmio/renesas_sdhi: Rename DT doc tmio_mmc.txt to renesas,sdhi.txt to clarify - sdhci-pci: Add support for Intel EHL - sdhci-pci-o2micro: Enable support for 8-bit bus - sdhci-msm: Prevent acquiring a mutex while holding a spin_lock - sdhci-of-esdhc: Improve clock management and tuning - sdhci_am654: Enable support for 4 and 8-bit bus on J721E - sdhci-sprd: Use pinctrl for a proper signal voltage switch - sdhci-sprd: Add support for HS400 enhanced strobe mode - sdhci-sprd: Enable PHY DLL and allow delay config to stabilize the clock - sdhci-sprd: Add support for optional gate clock - sunxi-mmc: Convert DT doc to YAML schemas - meson-gx: Add support for broken DRAM access for DMA MEMSTICK core: - Fixup error path of memstick_init()" * tag 'mmc-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc: (52 commits) mmc: sdhci_am654: Add dependency on MMC_SDHCI_AM654 mmc: alcor: remove a redundant greater or equal to zero comparison mmc: sdhci-msm: fix mutex while in spinlock mmc: sdhci_am654: Make some symbols static dma-mapping: remove dma_max_pfn mmc: core: let the dma map ops handle bouncing dt-binding: mmc: rename tmio_mmc.txt to renesas,sdhi.txt mmc: sdhci-sprd: Add pin control support for voltage switch dt-bindings: mmc: sprd: Add pinctrl support mmc: sdhci-sprd: Add start_signal_voltage_switch ops mmc: sdhci-pci: Add support for Intel EHL mmc: tmio: Use dma_max_mapping_size() instead of a workaround mmc: sdio: Drop unused in-parameter from mmc_sdio_init_card() mmc: sdio: Drop unused in-parameter to mmc_sdio_reinit_card() mmc: sdio: Don't re-initialize powered-on removable SDIO cards at resume mmc: sdio: Drop powered-on re-init at runtime resume and HW reset mmc: sdio: Move comment about re-initialization to mmc_sdio_reinit_card() mmc: sdio: Drop mmc_claim|release_host() in mmc_sdio_power_restore() mmc: sdio: Turn sdio_run_irqs() into static mmc: sdhci: Fix indenting on SDHCI_CTRL_8BITBUS ...
2019-07-11Merge tag 'usb-5.3-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB / PHY updates from Greg KH: "Here is the big USB and PHY driver pull request for 5.3-rc1. Lots of stuff here, all of which has been in linux-next for a while with no reported issues. Nothing is earth-shattering, just constant forward progress for more devices supported and cleanups and small fixes: - USB gadget driver updates and fixes - new USB gadget driver for some hardware, followed by a quick revert of those patches as they were not ready to be merged... - PHY driver updates - Lots of new driver additions and cleanups with a few fixes mixed in" * tag 'usb-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (145 commits) Revert "usb: gadget: storage: Remove warning message" Revert "dt-bindings: add binding for USBSS-DRD controller." Revert "usb:gadget Separated decoding functions from dwc3 driver." Revert "usb:gadget Patch simplify usb_decode_set_clear_feature function." Revert "usb:gadget Simplify usb_decode_get_set_descriptor function." Revert "usb:cdns3 Add Cadence USB3 DRD Driver" Revert "usb:cdns3 Fix for stuck packets in on-chip OUT buffer." usb :fsl: Change string format for errata property usb: host: Stops USB controller init if PLL fails to lock usb: linux/fsl_device: Add platform member has_fsl_erratum_a006918 usb: phy: Workaround for USB erratum-A005728 usb: fsl: Set USB_EN bit to select ULPI phy usb: Handle USB3 remote wakeup for LPM enabled devices correctly drivers/usb/typec/tps6598x.c: fix 4CC cmd write drivers/usb/typec/tps6598x.c: fix portinfo width usb: storage: scsiglue: Do not skip VPD if try_vpd_pages is set usb: renesas_usbhs: add a workaround for a race condition of workqueue usb: gadget: udc: renesas_usb3: remove redundant assignment to ret usb: dwc2: use a longer AHB idle timeout in dwc2_core_reset() USB: gadget: function: fix issue Unneeded variable: "value" ...
2019-07-11Merge tag 'tty-5.3-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty / serial driver updates from Greg KH: "Here is the "large" TTY and Serial driver update for 5.3-rc1. It's in the negative number of lines overall as we removed an obsolete serial driver that was causing problems for some people who were trying to clean up some apis (the mpsc.c driver, which only worked for some pre-production hardware that no one has anymore.) Other than that, lots of tiny changes, cleaning up small things along with some platform-specific serial driver updates. All of these have been in linux-next for a while now with no reported issues" * tag 'tty-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (68 commits) tty: serial: fsl_lpuart: add imx8qxp support serial: imx: set_termios(): preserve RTS state serial: imx: set_termios(): clarify RTS/CTS bits calculation serial: imx: set_termios(): factor-out 'ucr2' initial value serial: sh-sci: Terminate TX DMA during buffer flushing serial: sh-sci: Fix TX DMA buffer flushing and workqueue races serial: mpsc: Remove obsolete MPSC driver serial: 8250: 8250_core: Fix missing unlock on error in serial8250_register_8250_port() serial: stm32: add RX and TX FIFO flush serial: stm32: add support of RX FIFO threshold serial: stm32: add support of TX FIFO threshold serial: stm32: update PIO transmission serial: stm32: add support of timeout interrupt for RX Revert "serial: 8250: Don't service RX FIFO if interrupts are disabled" tty/serial/8250: use mctrl_gpio helpers serial: mctrl_gpio: Check if GPIO property exisits before requesting it serial: 8250: pericom_do_set_divisor can be static tty: serial_core: Set port active bit in uart_port_activate serial: 8250: Add MSR/MCR TIOCM conversion wrapper functions serial: 8250: factor out serial8250_{set,clear}_THRI() helpers ...
2019-07-11Merge tag 'char-misc-5.3-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char / misc driver updates from Greg KH: "Here is the "large" pull request for char and misc and other assorted smaller driver subsystems for 5.3-rc1. It seems that this tree is becoming the funnel point of lots of smaller driver subsystems, which is fine for me, but that's why it is getting larger over time and does not just contain stuff under drivers/char/ and drivers/misc. Lots of small updates all over the place here from different driver subsystems: - habana driver updates - coresight driver updates - documentation file movements and updates - Android binder fixes and updates - extcon driver updates - google firmware driver updates - fsi driver updates - smaller misc and char driver updates - soundwire driver updates - nvmem driver updates - w1 driver fixes All of these have been in linux-next for a while with no reported issues" * tag 'char-misc-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (188 commits) coresight: Do not default to CPU0 for missing CPU phandle dt-bindings: coresight: Change CPU phandle to required property ocxl: Allow contexts to be attached with a NULL mm fsi: sbefifo: Don't fail operations when in SBE IPL state coresight: tmc: Smatch: Fix potential NULL pointer dereference coresight: etm3x: Smatch: Fix potential NULL pointer dereference coresight: Potential uninitialized variable in probe() coresight: etb10: Do not call smp_processor_id from preemptible coresight: tmc-etf: Do not call smp_processor_id from preemptible coresight: tmc-etr: alloc_perf_buf: Do not call smp_processor_id from preemptible coresight: tmc-etr: Do not call smp_processor_id() from preemptible docs: misc-devices: convert files without extension to ReST fpga: dfl: fme: align PR buffer size per PR datawidth fpga: dfl: fme: remove copy_to_user() in ioctl for PR fpga: dfl-fme-mgr: fix FME_PR_INTFC_ID register address. intel_th: msu: Start read iterator from a non-empty window intel_th: msu: Split sgt array and pointer in multiwindow mode intel_th: msu: Support multipage blocks intel_th: pci: Add Ice Lake NNPI support intel_th: msu: Fix single mode with disabled IOMMU ...
2019-07-11Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching Pull livepatching updates from Jiri Kosina: - stacktrace handling improvements from Miroslav benes - debug output improvements from Petr Mladek * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching: livepatch: Remove duplicate warning about missing reliable stacktrace support Revert "livepatch: Remove reliable stacktrace check in klp_try_switch_task()" stacktrace: Remove weak version of save_stack_trace_tsk_reliable() livepatch: Use static buffer for debugging messages under rq lock livepatch: Remove stale kobj_added entries from kernel-doc descriptions
2019-07-11Merge tag 'scsi-sg' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds
Pull SCSI scatter-gather list updates from James Bottomley: "This topic branch covers a fundamental change in how our sg lists are allocated to make mq more efficient by reducing the size of the preallocated sg list. This necessitates a large number of driver changes because the previous guarantee that if a driver specified SG_ALL as the size of its scatter list, it would get a non-chained list and didn't need to bother with scatterlist iterators is now broken and every driver *must* use scatterlist iterators. This was broken out as a separate topic because we need to convert all the drivers before pulling the trigger and unconverted drivers kept being found, necessitating a rebase" * tag 'scsi-sg' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (21 commits) scsi: core: don't preallocate small SGL in case of NO_SG_CHAIN scsi: lib/sg_pool.c: clear 'first_chunk' in case of no preallocation scsi: core: avoid preallocating big SGL for data scsi: core: avoid preallocating big SGL for protection information scsi: lib/sg_pool.c: improve APIs for allocating sg pool scsi: esp: use sg helper to iterate over scatterlist scsi: NCR5380: use sg helper to iterate over scatterlist scsi: wd33c93: use sg helper to iterate over scatterlist scsi: ppa: use sg helper to iterate over scatterlist scsi: pcmcia: nsp_cs: use sg helper to iterate over scatterlist scsi: imm: use sg helper to iterate over scatterlist scsi: aha152x: use sg helper to iterate over scatterlist scsi: s390: zfcp_fc: use sg helper to iterate over scatterlist scsi: staging: unisys: visorhba: use sg helper to iterate over scatterlist scsi: usb: image: microtek: use sg helper to iterate over scatterlist scsi: pmcraid: use sg helper to iterate over scatterlist scsi: ipr: use sg helper to iterate over scatterlist scsi: mvumi: use sg helper to iterate over scatterlist scsi: lpfc: use sg helper to iterate over scatterlist scsi: advansys: use sg helper to iterate over scatterlist ...
2019-07-11Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds
Pull SCSI updates from James Bottomley: "This is mostly update of the usual drivers: qla2xxx, hpsa, lpfc, ufs, mpt3sas, ibmvscsi, megaraid_sas, bnx2fc and hisi_sas as well as the removal of the osst driver (I heard from Willem privately that he would like the driver removed because all his test hardware has failed). Plus number of minor changes, spelling fixes and other trivia. The big merge conflict this time around is the SPDX licence tags. Following discussion on linux-next, we believe our version to be more accurate than the one in the tree, so the resolution is to take our version for all the SPDX conflicts" Note on the SPDX license tag conversion conflicts: the SCSI tree had done its own SPDX conversion, which in some cases conflicted with the treewide ones done by Thomas & co. In almost all cases, the conflicts were purely syntactic: the SCSI tree used the old-style SPDX tags ("GPL-2.0" and "GPL-2.0+") while the treewide conversion had used the new-style ones ("GPL-2.0-only" and "GPL-2.0-or-later"). In these cases I picked the new-style one. In a few cases, the SPDX conversion was actually different, though. As explained by James above, and in more detail in a pre-pull-request thread: "The other problem is actually substantive: In the libsas code Luben Tuikov originally specified gpl 2.0 only by dint of stating: * This file is licensed under GPLv2. In all the libsas files, but then muddied the water by quoting GPLv2 verbatim (which includes the or later than language). So for these files Christoph did the conversion to v2 only SPDX tags and Thomas converted to v2 or later tags" So in those cases, where the spdx tag substantially mattered, I took the SCSI tree conversion of it, but then also took the opportunity to turn the old-style "GPL-2.0" into a new-style "GPL-2.0-only" tag. Similarly, when there were whitespace differences or other differences to the comments around the copyright notices, I took the version from the SCSI tree as being the more specific conversion. Finally, in the spdx conversions that had no conflicts (because the treewide ones hadn't been done for those files), I just took the SCSI tree version as-is, even if it was old-style. The old-style conversions are perfectly valid, even if the "-only" and "-or-later" versions are perhaps more descriptive. * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (185 commits) scsi: qla2xxx: move IO flush to the front of NVME rport unregistration scsi: qla2xxx: Fix NVME cmd and LS cmd timeout race condition scsi: qla2xxx: on session delete, return nvme cmd scsi: qla2xxx: Fix kernel crash after disconnecting NVMe devices scsi: megaraid_sas: Update driver version to 07.710.06.00-rc1 scsi: megaraid_sas: Introduce various Aero performance modes scsi: megaraid_sas: Use high IOPS queues based on IO workload scsi: megaraid_sas: Set affinity for high IOPS reply queues scsi: megaraid_sas: Enable coalescing for high IOPS queues scsi: megaraid_sas: Add support for High IOPS queues scsi: megaraid_sas: Add support for MPI toolbox commands scsi: megaraid_sas: Offload Aero RAID5/6 division calculations to driver scsi: megaraid_sas: RAID1 PCI bandwidth limit algorithm is applicable for only Ventura scsi: megaraid_sas: megaraid_sas: Add check for count returned by HOST_DEVICE_LIST DCMD scsi: megaraid_sas: Handle sequence JBOD map failure at driver level scsi: megaraid_sas: Don't send FPIO to RL Bypass queue scsi: megaraid_sas: In probe context, retry IOC INIT once if firmware is in fault scsi: megaraid_sas: Release Mutex lock before OCR in case of DCMD timeout scsi: megaraid_sas: Call disable_irq from process IRQ poll scsi: megaraid_sas: Remove few debug counters from IO path ...
2019-07-11Merge tag 'mlx5-fixes-2019-07-11' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== Mellanox, mlx5 fixes 2019-07-11 This series introduces some fixes to mlx5 driver. Please pull and let me know if there is any problem. For -stable v4.15 ('net/mlx5e: IPoIB, Add error path in mlx5_rdma_setup_rn') For -stable v5.1 ('net/mlx5e: Fix port tunnel GRE entropy control') ('net/mlx5e: Rx, Fix checksum calculation for new hardware') ('net/mlx5e: Fix return value from timeout recover function') ('net/mlx5e: Fix error flow in tx reporter diagnose') For -stable v5.2 ('net/mlx5: E-Switch, Fix default encap mode') Conflict note: This pull request will produce a small conflict when merged with net-next. In drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c Take the hunk from net and replace: esw_offloads_steering_init(esw, vf_nvports, total_nvports); with: esw_offloads_steering_init(esw); ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-12Merge tag 'imx-drm-next-2019-07-05' of ↵Dave Airlie
git://git.pengutronix.de/git/pza/linux into drm-next drm/imx: IPUv3 image converter improvements, enable scanout FIFO watermark - Fix a saturation bit position in the colorspace converter configuration memory. - Fully describe colorspace conversions in the API to the imx-media driver. - Add support for limited range and Rec.709 YUV encoding. - Enable colorimetry configuration via the media-controller API. - Enable the double write reduction feature for memory bandwidth savings when the image converter writes YUV 4:2:0 output. - Enable a scanout FIFO watermark feature that can increase priority of scanout read transfers at the memory controller whenever the FIFO runs low. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Philipp Zabel <p.zabel@pengutronix.de> Link: https://patchwork.freedesktop.org/patch/msgid/1562326831.4291.8.camel@pengutronix.de
2019-07-11net: fib_rules: do not flow dissect local packetsPetar Penkov
Rules matching on loopback iif do not need early flow dissection as the packet originates from the host. Stop counting such rules in fib_rule_requires_fldissect Signed-off-by: Petar Penkov <ppenkov@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-11net/mlx5e: Rx, Fix checksum calculation for new hardwareSaeed Mahameed
CQE checksum full mode in new HW, provides a full checksum of rx frame. Covering bytes starting from eth protocol up to last byte in the received frame (frame_size - ETH_HLEN), as expected by the stack. Fixing up skb->csum by the driver is not required in such case. This fix is to avoid wrong checksum calculation in drivers which already support the new hardware with the new checksum mode. Fixes: 85327a9c4150 ("net/mlx5: Update the list of the PCI supported devices") Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-07-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: "Some highlights from this development cycle: 1) Big refactoring of ipv6 route and neigh handling to support nexthop objects configurable as units from userspace. From David Ahern. 2) Convert explored_states in BPF verifier into a hash table, significantly decreased state held for programs with bpf2bpf calls, from Alexei Starovoitov. 3) Implement bpf_send_signal() helper, from Yonghong Song. 4) Various classifier enhancements to mvpp2 driver, from Maxime Chevallier. 5) Add aRFS support to hns3 driver, from Jian Shen. 6) Fix use after free in inet frags by allocating fqdirs dynamically and reworking how rhashtable dismantle occurs, from Eric Dumazet. 7) Add act_ctinfo packet classifier action, from Kevin Darbyshire-Bryant. 8) Add TFO key backup infrastructure, from Jason Baron. 9) Remove several old and unused ISDN drivers, from Arnd Bergmann. 10) Add devlink notifications for flash update status to mlxsw driver, from Jiri Pirko. 11) Lots of kTLS offload infrastructure fixes, from Jakub Kicinski. 12) Add support for mv88e6250 DSA chips, from Rasmus Villemoes. 13) Various enhancements to ipv6 flow label handling, from Eric Dumazet and Willem de Bruijn. 14) Support TLS offload in nfp driver, from Jakub Kicinski, Dirk van der Merwe, and others. 15) Various improvements to axienet driver including converting it to phylink, from Robert Hancock. 16) Add PTP support to sja1105 DSA driver, from Vladimir Oltean. 17) Add mqprio qdisc offload support to dpaa2-eth, from Ioana Radulescu. 18) Add devlink health reporting to mlx5, from Moshe Shemesh. 19) Convert stmmac over to phylink, from Jose Abreu. 20) Add PTP PHC (Physical Hardware Clock) support to mlxsw, from Shalom Toledo. 21) Add nftables SYNPROXY support, from Fernando Fernandez Mancera. 22) Convert tcp_fastopen over to use SipHash, from Ard Biesheuvel. 23) Track spill/fill of constants in BPF verifier, from Alexei Starovoitov. 24) Support bounded loops in BPF, from Alexei Starovoitov. 25) Various page_pool API fixes and improvements, from Jesper Dangaard Brouer. 26) Just like ipv4, support ref-countless ipv6 route handling. From Wei Wang. 27) Support VLAN offloading in aquantia driver, from Igor Russkikh. 28) Add AF_XDP zero-copy support to mlx5, from Maxim Mikityanskiy. 29) Add flower GRE encap/decap support to nfp driver, from Pieter Jansen van Vuuren. 30) Protect against stack overflow when using act_mirred, from John Hurley. 31) Allow devmap map lookups from eBPF, from Toke Høiland-Jørgensen. 32) Use page_pool API in netsec driver, Ilias Apalodimas. 33) Add Google gve network driver, from Catherine Sullivan. 34) More indirect call avoidance, from Paolo Abeni. 35) Add kTLS TX HW offload support to mlx5, from Tariq Toukan. 36) Add XDP_REDIRECT support to bnxt_en, from Andy Gospodarek. 37) Add MPLS manipulation actions to TC, from John Hurley. 38) Add sending a packet to connection tracking from TC actions, and then allow flower classifier matching on conntrack state. From Paul Blakey. 39) Netfilter hw offload support, from Pablo Neira Ayuso" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2080 commits) net/mlx5e: Return in default case statement in tx_post_resync_params mlx5: Return -EINVAL when WARN_ON_ONCE triggers in mlx5e_tls_resync(). net: dsa: add support for BRIDGE_MROUTER attribute pkt_sched: Include const.h net: netsec: remove static declaration for netsec_set_tx_de() net: netsec: remove superfluous if statement netfilter: nf_tables: add hardware offload support net: flow_offload: rename tc_cls_flower_offload to flow_cls_offload net: flow_offload: add flow_block_cb_is_busy() and use it net: sched: remove tcf block API drivers: net: use flow block API net: sched: use flow block API net: flow_offload: add flow_block_cb_{priv, incref, decref}() net: flow_offload: add list handling functions net: flow_offload: add flow_block_cb_alloc() and flow_block_cb_free() net: flow_offload: rename TCF_BLOCK_BINDER_TYPE_* to FLOW_BLOCK_BINDER_TYPE_* net: flow_offload: rename TC_BLOCK_{UN}BIND to FLOW_BLOCK_{UN}BIND net: flow_offload: add flow_block_cb_setup_simple() net: hisilicon: Add an tx_desc to adapt HI13X1_GMAC net: hisilicon: Add an rx_desc to adapt HI13X1_GMAC ...
2019-07-11Merge tag 'clone3-v5.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull clone3 system call from Christian Brauner: "This adds the clone3 syscall which is an extensible successor to clone after we snagged the last flag with CLONE_PIDFD during the 5.2 merge window for clone(). It cleanly supports all of the flags from clone() and thus all legacy workloads. There are few user visible differences between clone3 and clone. First, CLONE_DETACHED will cause EINVAL with clone3 so we can reuse this flag. Second, the CSIGNAL flag is deprecated and will cause EINVAL to be reported. It is superseeded by a dedicated "exit_signal" argument in struct clone_args thus freeing up even more flags. And third, clone3 gives CLONE_PIDFD a dedicated return argument in struct clone_args instead of abusing CLONE_PARENT_SETTID's parent_tidptr argument. The clone3 uapi is designed to be easy to handle on 32- and 64 bit: /* uapi */ struct clone_args { __aligned_u64 flags; __aligned_u64 pidfd; __aligned_u64 child_tid; __aligned_u64 parent_tid; __aligned_u64 exit_signal; __aligned_u64 stack; __aligned_u64 stack_size; __aligned_u64 tls; }; and a separate kernel struct is used that uses proper kernel typing: /* kernel internal */ struct kernel_clone_args { u64 flags; int __user *pidfd; int __user *child_tid; int __user *parent_tid; int exit_signal; unsigned long stack; unsigned long stack_size; unsigned long tls; }; The system call comes with a size argument which enables the kernel to detect what version of clone_args userspace is passing in. clone3 validates that any additional bytes a given kernel does not know about are set to zero and that the size never exceeds a page. A nice feature is that this patchset allowed us to cleanup and simplify various core kernel codepaths in kernel/fork.c by making the internal _do_fork() function take struct kernel_clone_args even for legacy clone(). This patch also unblocks the time namespace patchset which wants to introduce a new CLONE_TIMENS flag. Note, that clone3 has only been wired up for x86{_32,64}, arm{64}, and xtensa. These were the architectures that did not require special massaging. Other architectures treat fork-like system calls individually and after some back and forth neither Arnd nor I felt confident that we dared to add clone3 unconditionally to all architectures. We agreed to leave this up to individual architecture maintainers. This is why there's an additional patch that introduces __ARCH_WANT_SYS_CLONE3 which any architecture can set once it has implemented support for clone3. The patch also adds a cond_syscall(clone3) for architectures such as nios2 or h8300 that generate their syscall table by simply including asm-generic/unistd.h. The hope is to get rid of __ARCH_WANT_SYS_CLONE3 and cond_syscall() rather soon" * tag 'clone3-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: arch: handle arches who do not yet define clone3 arch: wire-up clone3() syscall fork: add clone3
2019-07-11PCI: Enable NVIDIA HDA controllersLukas Wunner
Many NVIDIA GPUs can be configured as either a single-function video device or a multi-function device with video at function 0 and an HDA audio controller at function 1. The HDA controller can be enabled or disabled by a bit in the function 0 config space. Some BIOSes leave the HDA disabled, which means the HDMI connector from the NVIDIA GPU may not work. Sometimes the BIOS enables the HDA if an HDMI cable is connected at boot time, but that doesn't handle hotplug cases. Enable the HDA controller on device enumeration and resume and re-read the header type, which tells us whether the GPU is a multi-function device. This quirk is limited to NVIDIA PCI devices with the VGA Controller device class. This is expected to correspond to product configurations where the NVIDIA GPU has connectors attached. Other products where the device class is 3D Controller are expected to correspond to configurations where the NVIDIA GPU is dedicated (dGPU) and has no connectors. See original post (URL below) for more details. This commit takes inspiration from an earlier patch by Daniel Drake. Link: https://lore.kernel.org/r/20190708051744.24039-1-drake@endlessm.com v2 Link: https://lore.kernel.org/r/20190613063514.15317-1-drake@endlessm.com v1 Link: https://devtalk.nvidia.com/default/topic/1024022 Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=75985 Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Daniel Drake <drake@endlessm.com> [bhelgaas: commit log, log message, return early if already enabled] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Cc: Aaron Plattner <aplattner@nvidia.com> Cc: Peter Wu <peter@lekensteyn.nl> Cc: Ilia Mirkin <imirkin@alum.mit.edu> Cc: Karol Herbst <kherbst@redhat.com> Cc: Maik Freudenberg <hhfeuer@gmx.de>