summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2017-10-25locking/qrwlock: Prevent slowpath writers getting held up by fastpathWill Deacon
When a prospective writer takes the qrwlock locking slowpath due to the lock being held, it attempts to cmpxchg the wmode field from 0 to _QW_WAITING so that concurrent lockers also take the slowpath and queue on the spinlock accordingly, allowing the lockers to drain. Unfortunately, this isn't fair, because a fastpath writer that comes in after the lock is made available but before the _QW_WAITING flag is set can effectively jump the queue. If there is a steady stream of prospective writers, then the waiter will be held off indefinitely. This patch restores fairness by separating _QW_WAITING and _QW_LOCKED into two distinct fields: _QW_LOCKED continues to occupy the bottom byte of the lockword so that it can be cleared unconditionally when unlocking, but _QW_WAITING now occupies what used to be the bottom bit of the reader count. This then forces the slow-path for concurrent lockers. Tested-by: Waiman Long <longman@redhat.com> Tested-by: Jeremy Linton <jeremy.linton@arm.com> Tested-by: Adam Wallis <awallis@codeaurora.org> Tested-by: Jan Glauber <jglauber@cavium.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Jeremy.Linton@arm.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/1507810851-306-6-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-25locking/qrwlock: Use atomic_cond_read_acquire() when spinning in qrwlockWill Deacon
The qrwlock slowpaths involve spinning when either a prospective reader is waiting for a concurrent writer to drain, or a prospective writer is waiting for concurrent readers to drain. In both of these situations, atomic_cond_read_acquire() can be used to avoid busy-waiting and make use of any backoff functionality provided by the architecture. This patch replaces the open-code loops and rspin_until_writer_unlock() implementation with atomic_cond_read_acquire(). The write mode transition zero to _QW_WAITING is left alone, since (a) this doesn't need acquire semantics and (b) should be fast. Tested-by: Waiman Long <longman@redhat.com> Tested-by: Jeremy Linton <jeremy.linton@arm.com> Tested-by: Adam Wallis <awallis@codeaurora.org> Tested-by: Jan Glauber <jglauber@cavium.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Jeremy.Linton@arm.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/1507810851-306-4-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-25locking/qrwlock: Use 'struct qrwlock' instead of 'struct __qrwlock'Will Deacon
There's no good reason to keep the internal structure of struct qrwlock hidden from qrwlock.h, particularly as it's actually needed for unlock and ends up being abstracted independently behind the __qrwlock_write_byte() function. Stop pretending we can hide this stuff, and move the __qrwlock definition into qrwlock, removing the __qrwlock_write_byte() nastiness and using the same struct definition everywhere instead. Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Jeremy.Linton@arm.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/1507810851-306-2-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-24locking/barriers: Convert users of lockless_dereference() to READ_ONCE()Will Deacon
READ_ONCE() now has an implicit smp_read_barrier_depends() call, so it can be used instead of lockless_dereference() without any change in semantics. Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1508840570-22169-4-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-24Merge tag 'v4.14-rc6' into locking/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-22Merge branch 'smp-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull smp/hotplug fix from Thomas Gleixner: "The recent rework of the callback invocation missed to cleanup the leftovers of the operation, so under certain circumstances a subsequent CPU hotplug operation accesses stale data and crashes. Clean it up." * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: cpu/hotplug: Reset node state after operation
2017-10-22Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: "A set of small fixes mostly in the irq drivers area: - Make the tango irq chip work correctly, which requires a new function in the generiq irq chip implementation - A set of updates to the GIC-V3 ITS driver removing a bogus BUG_ON() and parsing the VCPU table size correctly" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: generic chip: remove irq_gc_mask_disable_reg_and_ack() irqchip/tango: Use irq_gc_mask_disable_and_ack_set genirq: generic chip: Add irq_gc_mask_disable_and_ack_set() irqchip/gic-v3-its: Add missing changes to support 52bit physical address irqchip/gic-v3-its: Fix the incorrect parsing of VCPU table size irqchip/gic-v3-its: Fix the incorrect BUG_ON in its_init_vpe_domain() DT: arm,gic-v3: Update the ITS size in the examples
2017-10-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: "A little more than usual this time around. Been travelling, so that is part of it. Anyways, here are the highlights: 1) Deal with memcontrol races wrt. listener dismantle, from Eric Dumazet. 2) Handle page allocation failures properly in nfp driver, from Jaku Kicinski. 3) Fix memory leaks in macsec, from Sabrina Dubroca. 4) Fix crashes in pppol2tp_session_ioctl(), from Guillaume Nault. 5) Several fixes in bnxt_en driver, including preventing potential NVRAM parameter corruption from Michael Chan. 6) Fix for KRACK attacks in wireless, from Johannes Berg. 7) rtnetlink event generation fixes from Xin Long. 8) Deadlock in mlxsw driver, from Ido Schimmel. 9) Disallow arithmetic operations on context pointers in bpf, from Jakub Kicinski. 10) Missing sock_owned_by_user() check in sctp_icmp_redirect(), from Xin Long. 11) Only TCP is supported for sockmap, make that explicit with a check, from John Fastabend. 12) Fix IP options state races in DCCP and TCP, from Eric Dumazet. 13) Fix panic in packet_getsockopt(), also from Eric Dumazet. 14) Add missing locked in hv_sock layer, from Dexuan Cui. 15) Various aquantia bug fixes, including several statistics handling cures. From Igor Russkikh et al. 16) Fix arithmetic overflow in devmap code, from John Fastabend. 17) Fix busted socket memory accounting when we get a fault in the tcp zero copy paths. From Willem de Bruijn. 18) Don't leave opt->tot_len uninitialized in ipv6, from Eric Dumazet" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits) stmmac: Don't access tx_q->dirty_tx before netif_tx_lock ipv6: flowlabel: do not leave opt->tot_len with garbage of_mdio: Fix broken PHY IRQ in case of probe deferral textsearch: fix typos in library helpers rxrpc: Don't release call mutex on error pointer net: stmmac: Prevent infinite loop in get_rx_timestamp_status() net: stmmac: Fix stmmac_get_rx_hwtstamp() net: stmmac: Add missing call to dev_kfree_skb() mlxsw: spectrum_router: Configure TIGCR on init mlxsw: reg: Add Tunneling IPinIP General Configuration Register net: ethtool: remove error check for legacy setting transceiver type soreuseport: fix initialization race net: bridge: fix returning of vlan range op errors sock: correct sk_wmem_queued accounting on efault in tcp zerocopy bpf: add test cases to bpf selftests to cover all access tests bpf: fix pattern matches for direct packet access bpf: fix off by one for range markings with L{T, E} patterns bpf: devmap fix arithmetic overflow in bitmap_size calculation net: aquantia: Bad udp rate on default interrupt coalescing net: aquantia: Enable coalescing management via ethtool interface ...
2017-10-22bpf: fix pattern matches for direct packet accessDaniel Borkmann
Alexander had a test program with direct packet access, where the access test was in the form of data + X > data_end. In an unrelated change to the program LLVM decided to swap the branches and emitted code for the test in form of data + X <= data_end. We hadn't seen these being generated previously, thus verifier would reject the program. Therefore, fix up the verifier to detect all test cases, so we don't run into such issues in the future. Fixes: b4e432f1000a ("bpf: enable BPF_J{LT, LE, SLT, SLE} opcodes in verifier") Reported-by: Alexander Alemayhu <alexander@alemayhu.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-22bpf: fix off by one for range markings with L{T, E} patternsDaniel Borkmann
During review I noticed that the current logic for direct packet access marking in check_cond_jmp_op() has an off by one for the upper right range border when marking in find_good_pkt_pointers() with BPF_JLT and BPF_JLE. It's not really harmful given access up to pkt_end is always safe, but we should nevertheless correct the range marking before it becomes ABI. If pkt_data' denotes a pkt_data derived pointer (pkt_data + X), then for pkt_data' < pkt_end in the true branch as well as for pkt_end <= pkt_data' in the false branch we mark the range with X although it should really be X - 1 in these cases. For example, X could be pkt_end - pkt_data, then when testing for pkt_data' < pkt_end the verifier simulation cannot deduce that a byte load of pkt_data' - 1 would succeed in this branch. Fixes: b4e432f1000a ("bpf: enable BPF_J{LT, LE, SLT, SLE} opcodes in verifier") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-22bpf: devmap fix arithmetic overflow in bitmap_size calculationJohn Fastabend
An integer overflow is possible in dev_map_bitmap_size() when calculating the BITS_TO_LONG logic which becomes, after macro replacement, (((n) + (d) - 1)/ (d)) where 'n' is a __u32 and 'd' is (8 * sizeof(long)). To avoid overflow cast to u64 before arithmetic. Reported-by: Richard Weinberger <richard@nod.at> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-21cpu/hotplug: Reset node state after operationThomas Gleixner
The recent rework of the cpu hotplug internals changed the usage of the per cpu state->node field, but missed to clean it up after usage. So subsequent hotplug operations use the stale pointer from a previous operation and hand it into the callback functions. The callbacks then dereference a pointer which either belongs to a different facility or points to freed and potentially reused memory. In either case data corruption and crashes are the obvious consequence. Reset the node and the last pointers in the per cpu state to NULL after the operation which set them has completed. Fixes: 96abb968549c ("smp/hotplug: Allow external multi-instance rollback") Reported-by: Tvrtko Ursulin <tursulin@ursulin.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1710211606130.3213@nanos
2017-10-20waitid(): Avoid unbalanced user_access_end() on access_ok() errorKees Cook
As pointed out by Linus and David, the earlier waitid() fix resulted in a (currently harmless) unbalanced user_access_end() call. This fixes it to just directly return EFAULT on access_ok() failure. Fixes: 96ca579a1ecc ("waitid(): Add missing access_ok() checks") Acked-by: David Daney <david.daney@cavium.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-20bpf: require CAP_NET_ADMIN when using devmapJohn Fastabend
Devmap is used with XDP which requires CAP_NET_ADMIN so lets also make CAP_NET_ADMIN required to use the map. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-20bpf: require CAP_NET_ADMIN when using sockmap mapsJohn Fastabend
Restrict sockmap to CAP_NET_ADMIN. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-20bpf: avoid preempt enable/disable in sockmap using tcp_skb_cb regionJohn Fastabend
SK_SKB BPF programs are run from the socket/tcp context but early in the stack before much of the TCP metadata is needed in tcp_skb_cb. So we can use some unused fields to place BPF metadata needed for SK_SKB programs when implementing the redirect function. This allows us to drop the preempt disable logic. It does however require an API change so sk_redirect_map() has been updated to additionally provide ctx_ptr to skb. Note, we do however continue to disable/enable preemption around actual BPF program running to account for map updates. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-20bpf: enforce TCP only support for sockmapJohn Fastabend
Only TCP sockets have been tested and at the moment the state change callback only handles TCP sockets. This adds a check to ensure that sockets actually being added are TCP sockets. For net-next we can consider UDP support. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-19doc: Fix various RCU docbook comment-header problemsPaul E. McKenney
Because many of RCU's files have not been included into docbook, a number of errors have accumulated. This commit fixes them. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-19membarrier: Provide register expedited private commandMathieu Desnoyers
This introduces a "register private expedited" membarrier command which allows eventual removal of important memory barrier constraints on the scheduler fast-paths. It changes how the "private expedited" membarrier command (new to 4.14) is used from user-space. This new command allows processes to register their intent to use the private expedited command. This affects how the expedited private command introduced in 4.14-rc is meant to be used, and should be merged before 4.14 final. Processes are now required to register before using MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM. This fixes a problem that arose when designing requested extensions to sys_membarrier() to allow JITs to efficiently flush old code from instruction caches. Several potential algorithms are much less painful if the user register intent to use this functionality early on, for example, before the process spawns the second thread. Registering at this time removes the need to interrupt each and every thread in that process at the first expedited sys_membarrier() system call. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-19bpf: do not test for PCPU_MIN_UNIT_SIZE before percpu allocationsDaniel Borkmann
PCPU_MIN_UNIT_SIZE is an implementation detail of the percpu allocator. Given we support __GFP_NOWARN now, lets just let the allocation request fail naturally instead. The two call sites from BPF mistakenly assumed __GFP_NOWARN would work, so no changes needed to their actual __alloc_percpu_gfp() calls which use the flag already. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-19bpf: fix splat for illegal devmap percpu allocationDaniel Borkmann
It was reported that syzkaller was able to trigger a splat on devmap percpu allocation due to illegal/unsupported allocation request size passed to __alloc_percpu(): [ 70.094249] illegal size (32776) or align (8) for percpu allocation [ 70.094256] ------------[ cut here ]------------ [ 70.094259] WARNING: CPU: 3 PID: 3451 at mm/percpu.c:1365 pcpu_alloc+0x96/0x630 [...] [ 70.094325] Call Trace: [ 70.094328] __alloc_percpu_gfp+0x12/0x20 [ 70.094330] dev_map_alloc+0x134/0x1e0 [ 70.094331] SyS_bpf+0x9bc/0x1610 [ 70.094333] ? selinux_task_setrlimit+0x5a/0x60 [ 70.094334] ? security_task_setrlimit+0x43/0x60 [ 70.094336] entry_SYSCALL_64_fastpath+0x1a/0xa5 This was due to too large max_entries for the map such that we surpassed the upper limit of PCPU_MIN_UNIT_SIZE. It's fine to fail naturally here, so switch to __alloc_percpu_gfp() and pass __GFP_NOWARN instead. Fixes: 11393cc9b9be ("xdp: Add batching support to redirect map") Reported-by: Mark Rutland <mark.rutland@arm.com> Reported-by: Shankara Pailoor <sp3485@columbia.edu> Reported-by: Richard Weinberger <richard@nod.at> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-19locking/static_keys: Improve uninitialized key warningBorislav Petkov
Right now it says: static_key_disable_cpuslocked used before call to jump_label_init ------------[ cut here ]------------ WARNING: CPU: 0 PID: 0 at kernel/jump_label.c:161 static_key_disable_cpuslocked+0x68/0x70 Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 4.14.0-rc5+ #1 Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 2.0a 05/09/2016 task: ffffffff81c0e480 task.stack: ffffffff81c00000 RIP: 0010:static_key_disable_cpuslocked+0x68/0x70 RSP: 0000:ffffffff81c03ef0 EFLAGS: 00010096 ORIG_RAX: 0000000000000000 RAX: 0000000000000041 RBX: ffffffff81c32680 RCX: ffffffff81c5cbf8 RDX: 0000000000000001 RSI: 0000000000000092 RDI: 0000000000000002 RBP: ffff88807fffd240 R08: 726f666562206465 R09: 0000000000000136 R10: 0000000000000000 R11: 696e695f6c656261 R12: ffffffff82158900 R13: ffffffff8215f760 R14: 0000000000000001 R15: 0000000000000008 FS: 0000000000000000(0000) GS:ffff883f7f400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff88807ffff000 CR3: 0000000001c09000 CR4: 00000000000606b0 Call Trace: static_key_disable+0x16/0x20 start_kernel+0x15a/0x45d ? load_ucode_intel_bsp+0x11/0x2d secondary_startup_64+0xa5/0xb0 Code: 48 c7 c7 a0 15 cf 81 e9 47 53 4b 00 48 89 df e8 5f fc ff ff eb e8 48 c7 c6 \ c0 97 83 81 48 c7 c7 d0 ff a2 81 31 c0 e8 c5 9d f5 ff <0f> ff eb a7 0f ff eb \ b0 e8 eb a2 4b 00 53 48 89 fb e8 42 0e f0 but it doesn't tell me which key it is. So dump the key's name too: static_key_disable_cpuslocked(): static key 'virt_spin_lock_key' used before call to jump_label_init() And that makes pinpointing which key is causing that a lot easier. include/linux/jump_label.h | 14 +++++++------- include/linux/jump_label_ratelimit.h | 6 +++--- kernel/jump_label.c | 14 +++++++------- 3 files changed, 17 insertions(+), 17 deletions(-) Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Jason Baron <jbaron@akamai.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20171018152428.ffjgak4o25f7ept6@pd.tnic Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-18bpf: disallow arithmetic operations on context pointerJakub Kicinski
Commit f1174f77b50c ("bpf/verifier: rework value tracking") removed the crafty selection of which pointer types are allowed to be modified. This is OK for most pointer types since adjust_ptr_min_max_vals() will catch operations on immutable pointers. One exception is PTR_TO_CTX which is now allowed to be offseted freely. The intent of aforementioned commit was to allow context access via modified registers. The offset passed to ->is_valid_access() verifier callback has been adjusted by the value of the variable offset. What is missing, however, is taking the variable offset into account when the context register is used. Or in terms of the code adding the offset to the value passed to the ->convert_ctx_access() callback. This leads to the following eBPF user code: r1 += 68 r0 = *(u32 *)(r1 + 8) exit being translated to this in kernel space: 0: (07) r1 += 68 1: (61) r0 = *(u32 *)(r1 +180) 2: (95) exit Offset 8 is corresponding to 180 in the kernel, but offset 76 is valid too. Verifier will "accept" access to offset 68+8=76 but then "convert" access to offset 8 as 180. Effective access to offset 248 is beyond the kernel context. (This is a __sk_buff example on a debug-heavy kernel - packet mark is 8 -> 180, 76 would be data.) Dereferencing the modified context pointer is not as easy as dereferencing other types, because we have to translate the access to reading a field in kernel structures which is usually at a different offset and often of a different size. To allow modifying the pointer we would have to make sure that given eBPF instruction will always access the same field or the fields accessed are "compatible" in terms of offset and size... Disallow dereferencing modified context pointers and add to selftests the test case described here. Fixes: f1174f77b50c ("bpf/verifier: rework value tracking") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16Merge tag 'irqchip-4.14-3' of ↵Thomas Gleixner
git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/urgent Pull irqchip updates for 4.14-rc5 from Marc Zyngier: - Fix unfortunate mistake in the GICv3 ITS binding example - Two fixes for the recently merged GICv4 support - GICv3 ITS 52bit PA fixes - Generic irqchip mask-ack fix, and its application to the tango irqchip
2017-10-14Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Three fixes that address an SMP balancing performance regression" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Ensure load_balance() respects the active_mask sched/core: Address more wake_affine() regressions sched/core: Fix wake_affine() performance regression
2017-10-14Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Some tooling fixes plus three kernel fixes: a memory leak fix, a statistics fix and a crash fix" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel/uncore: Fix memory leaks on allocation failures perf/core: Fix cgroup time when scheduling descendants perf/core: Avoid freeing static PMU contexts when PMU is unregistered tools include uapi bpf.h: Sync kernel ABI header with tooling header perf pmu: Unbreak perf record for arm/arm64 with events with explicit PMU perf script: Add missing separator for "-F ip,brstack" (and brstackoff) perf callchain: Compare dsos (as well) for CCKEY_FUNCTION
2017-10-14Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Ingo Molnar: "Two lockdep fixes for bugs introduced by the cross-release dependency tracking feature - plus a commit that disables it because performance regressed in an absymal fashion on some systems" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/lockdep: Disable cross-release features for now locking/selftest: Avoid false BUG report locking/lockdep: Fix stacktrace mess
2017-10-14Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Ingo Molnar: "A CPU hotplug related fix, plus two related sanity checks" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq/cpuhotplug: Enforce affinity setting on startup of managed irqs genirq/cpuhotplug: Add sanity check for effective affinity mask genirq: Warn when effective affinity is not updated
2017-10-13kmemleak: clear stale pointers from task stacksKonstantin Khlebnikov
Kmemleak considers any pointers on task stacks as references. This patch clears newly allocated and reused vmap stacks. Link: http://lkml.kernel.org/r/150728990124.744199.8403409836394318684.stgit@buzz Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-13genirq: generic chip: remove irq_gc_mask_disable_reg_and_ack()Doug Berger
Any usage of the irq_gc_mask_disable_reg_and_ack() function has been replaced with the desired functionality. The incorrect and ambiguously named function is removed here to prevent accidental misuse. Signed-off-by: Doug Berger <opendmb@gmail.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2017-10-13genirq: generic chip: Add irq_gc_mask_disable_and_ack_set()Doug Berger
The irq_gc_mask_disable_reg_and_ack() function name implies that it provides the combined functions of irq_gc_mask_disable_reg() and irq_gc_ack(). However, the implementation does not actually do that since it writes the mask instead of the disable register. It also does not maintain the mask cache which makes it inappropriate to use with other masking functions. In addition, commit 659fb32d1b67 ("genirq: replace irq_gc_ack() with {set,clr}_bit variants (fwd)") effectively renamed irq_gc_ack() to irq_gc_ack_set_bit() so this function probably should have also been renamed at that time. The generic chip code currently provides three functions for use with the irq_mask member of the irq_chip structure and two functions for use with the irq_ack member of the irq_chip structure. These functions could be combined into six functions for use with the irq_mask_ack member of the irq_chip structure. However, since only one of the combinations is currently used, only the function irq_gc_mask_disable_and_ack_set() is added by this commit. The '_reg' and '_bit' portions of the base function name were left out of the new combined function name in an attempt to keep the function name length manageable with the 80 character source code line length while still allowing the distinct aspects of each combination to be captured by the name. If other combinations are desired in the future please add them to the irq generic chip library at that time. Signed-off-by: Doug Berger <opendmb@gmail.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2017-10-12Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching Pull livepatching fix from Jiri Kosina: - bugfix for handling of coming modules (incorrect handling of failure) from Joe Lawrence * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching: livepatch: unpatch all klp_objects if klp_module_coming fails
2017-10-12Merge branch 'waitid-fix'Linus Torvalds
Merge waitid() fix from Kees Cook. I'd have hoped that the unsafe_{get|put}_user() naming would have avoided these kinds of stupid bugs, but no such luck. * waitid-fix: waitid(): Add missing access_ok() checks
2017-10-11livepatch: unpatch all klp_objects if klp_module_coming failsJoe Lawrence
When an incoming module is considered for livepatching by klp_module_coming(), it iterates over multiple patches and multiple kernel objects in this order: list_for_each_entry(patch, &klp_patches, list) { klp_for_each_object(patch, obj) { which means that if one of the kernel objects fails to patch, klp_module_coming()'s error path needs to unpatch and cleanup any kernel objects that were already patched by a previous patch. Reported-by: Miroslav Benes <mbenes@suse.cz> Suggested-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-10-10Revert "net: defer call to cgroup_sk_alloc()"Eric Dumazet
This reverts commit fbb1fb4ad415cb31ce944f65a5ca700aaf73a227. This was not the proper fix, lets cleanly revert it, so that following patch can be carried to stable versions. sock_cgroup_ptr() callers do not expect a NULL return value. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-10Merge tag 'seccomp-v4.14-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull seccomp fixlet from Kees Cook: "Minor seccomp fix for v4.14-rc5. I debated sending this at all for v4.14, but since it fixes a minor issue in the prior fix, which also went to -stable, it seemed better to just get all of it cleaned up right now. - fix missed "static" to avoid Sparse warning (Colin King)" * tag 'seccomp-v4.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: seccomp: make function __get_seccomp_filter static
2017-10-10seccomp: make function __get_seccomp_filter staticColin Ian King
The function __get_seccomp_filter is local to the source and does not need to be in global scope, so make it static. Cleans up sparse warning: symbol '__get_seccomp_filter' was not declared. Should it be static? Signed-off-by: Colin Ian King <colin.king@canonical.com> Fixes: 66a733ea6b61 ("seccomp: fix the usage of get/put_seccomp_filter() in seccomp_get_filter()") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
2017-10-10locking/core: Remove {read,spin,write}_can_lock()Will Deacon
Outside of the locking code itself, {read,spin,write}_can_lock() have no users in tree. Apparmor (the last remaining user of write_can_lock()) got moved over to lockdep by the previous patch. This patch removes the use of {read,spin,write}_can_lock() from the BUILD_LOCK_OPS macro, deferring to the trylock operation for testing the lock status, and subsequently removes the unused macros altogether. They aren't guaranteed to work in a concurrent environment and can give incorrect results in the case of qrwlock. Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: paulmck@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1507055129-12300-2-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10locking/rwsem: Add down_read_killable()Kirill Tkhai
Similar to down_read() and down_write_killable(), add killable version of down_read(), based on __down_read_killable() function, added in previous patches. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: arnd@arndb.de Cc: avagin@virtuozzo.com Cc: davem@davemloft.net Cc: fenghua.yu@intel.com Cc: gorcunov@virtuozzo.com Cc: heiko.carstens@de.ibm.com Cc: hpa@zytor.com Cc: ink@jurassic.park.msu.ru Cc: mattst88@gmail.com Cc: rientjes@google.com Cc: rth@twiddle.net Cc: schwidefsky@de.ibm.com Cc: tony.luck@intel.com Cc: viro@zeniv.linux.org.uk Link: http://lkml.kernel.org/r/150670119884.23930.2585570605960763239.stgit@localhost.localdomain Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10sched/core: Ensure load_balance() respects the active_maskPeter Zijlstra
While load_balance() masks the source CPUs against active_mask, it had a hole against the destination CPU. Ensure the destination CPU is also part of the 'domain-mask & active-mask' set. Reported-by: Levin, Alexander (Sasha Levin) <alexander.levin@verizon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 77d1dfda0e79 ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10sched/core: Address more wake_affine() regressionsPeter Zijlstra
The trivial wake_affine_idle() implementation is very good for a number of workloads, but it comes apart at the moment there are no idle CPUs left, IOW. the overloaded case. hackbench: NO_WA_WEIGHT WA_WEIGHT hackbench-20 : 7.362717561 seconds 6.450509391 seconds (win) netperf: NO_WA_WEIGHT WA_WEIGHT TCP_SENDFILE-1 : Avg: 54524.6 Avg: 52224.3 TCP_SENDFILE-10 : Avg: 48185.2 Avg: 46504.3 TCP_SENDFILE-20 : Avg: 29031.2 Avg: 28610.3 TCP_SENDFILE-40 : Avg: 9819.72 Avg: 9253.12 TCP_SENDFILE-80 : Avg: 5355.3 Avg: 4687.4 TCP_STREAM-1 : Avg: 41448.3 Avg: 42254 TCP_STREAM-10 : Avg: 24123.2 Avg: 25847.9 TCP_STREAM-20 : Avg: 15834.5 Avg: 18374.4 TCP_STREAM-40 : Avg: 5583.91 Avg: 5599.57 TCP_STREAM-80 : Avg: 2329.66 Avg: 2726.41 TCP_RR-1 : Avg: 80473.5 Avg: 82638.8 TCP_RR-10 : Avg: 72660.5 Avg: 73265.1 TCP_RR-20 : Avg: 52607.1 Avg: 52634.5 TCP_RR-40 : Avg: 57199.2 Avg: 56302.3 TCP_RR-80 : Avg: 25330.3 Avg: 26867.9 UDP_RR-1 : Avg: 108266 Avg: 107844 UDP_RR-10 : Avg: 95480 Avg: 95245.2 UDP_RR-20 : Avg: 68770.8 Avg: 68673.7 UDP_RR-40 : Avg: 76231 Avg: 75419.1 UDP_RR-80 : Avg: 34578.3 Avg: 35639.1 UDP_STREAM-1 : Avg: 64684.3 Avg: 66606 UDP_STREAM-10 : Avg: 52701.2 Avg: 52959.5 UDP_STREAM-20 : Avg: 30376.4 Avg: 29704 UDP_STREAM-40 : Avg: 15685.8 Avg: 15266.5 UDP_STREAM-80 : Avg: 8415.13 Avg: 7388.97 (wins and losses) sysbench: NO_WA_WEIGHT WA_WEIGHT sysbench-mysql-2 : 2135.17 per sec. 2142.51 per sec. sysbench-mysql-5 : 4809.68 per sec. 4800.19 per sec. sysbench-mysql-10 : 9158.59 per sec. 9157.05 per sec. sysbench-mysql-20 : 14570.70 per sec. 14543.55 per sec. sysbench-mysql-40 : 22130.56 per sec. 22184.82 per sec. sysbench-mysql-80 : 20995.56 per sec. 21904.18 per sec. sysbench-psql-2 : 1679.58 per sec. 1705.06 per sec. sysbench-psql-5 : 3797.69 per sec. 3879.93 per sec. sysbench-psql-10 : 7253.22 per sec. 7258.06 per sec. sysbench-psql-20 : 11166.75 per sec. 11220.00 per sec. sysbench-psql-40 : 17277.28 per sec. 17359.78 per sec. sysbench-psql-80 : 17112.44 per sec. 17221.16 per sec. (increase on the top end) tbench: NO_WA_WEIGHT Throughput 685.211 MB/sec 2 clients 2 procs max_latency=0.123 ms Throughput 1596.64 MB/sec 5 clients 5 procs max_latency=0.119 ms Throughput 2985.47 MB/sec 10 clients 10 procs max_latency=0.262 ms Throughput 4521.15 MB/sec 20 clients 20 procs max_latency=0.506 ms Throughput 9438.1 MB/sec 40 clients 40 procs max_latency=2.052 ms Throughput 8210.5 MB/sec 80 clients 80 procs max_latency=8.310 ms WA_WEIGHT Throughput 697.292 MB/sec 2 clients 2 procs max_latency=0.127 ms Throughput 1596.48 MB/sec 5 clients 5 procs max_latency=0.080 ms Throughput 2975.22 MB/sec 10 clients 10 procs max_latency=0.254 ms Throughput 4575.14 MB/sec 20 clients 20 procs max_latency=0.502 ms Throughput 9468.65 MB/sec 40 clients 40 procs max_latency=2.069 ms Throughput 8631.73 MB/sec 80 clients 80 procs max_latency=8.605 ms (increase on the top end) Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Rik van Riel <riel@redhat.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10sched/core: Fix wake_affine() performance regressionPeter Zijlstra
Eric reported a sysbench regression against commit: 3fed382b46ba ("sched/numa: Implement NUMA node level wake_affine()") Similarly, Rik was looking at the NAS-lu.C benchmark, which regressed against his v3.10 enterprise kernel. PRE (current tip/master): ivb-ep sysbench: 2: [30 secs] transactions: 64110 (2136.94 per sec.) 5: [30 secs] transactions: 143644 (4787.99 per sec.) 10: [30 secs] transactions: 274298 (9142.93 per sec.) 20: [30 secs] transactions: 418683 (13955.45 per sec.) 40: [30 secs] transactions: 320731 (10690.15 per sec.) 80: [30 secs] transactions: 355096 (11834.28 per sec.) hsw-ex NAS: OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds = 18.01 OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds = 17.89 OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds = 17.93 lu.C.x_threads_144_run_1.log: Time in seconds = 434.68 lu.C.x_threads_144_run_2.log: Time in seconds = 405.36 lu.C.x_threads_144_run_3.log: Time in seconds = 433.83 POST (+patch): ivb-ep sysbench: 2: [30 secs] transactions: 64494 (2149.75 per sec.) 5: [30 secs] transactions: 145114 (4836.99 per sec.) 10: [30 secs] transactions: 278311 (9276.69 per sec.) 20: [30 secs] transactions: 437169 (14571.60 per sec.) 40: [30 secs] transactions: 669837 (22326.73 per sec.) 80: [30 secs] transactions: 631739 (21055.88 per sec.) hsw-ex NAS: lu.C.x_threads_144_run_1.log: Time in seconds = 23.36 lu.C.x_threads_144_run_2.log: Time in seconds = 22.96 lu.C.x_threads_144_run_3.log: Time in seconds = 22.52 This patch takes out all the shiny wake_affine() stuff and goes back to utter basics. Between the two CPUs involved with the wakeup (the CPU doing the wakeup and the CPU we ran on previously) pick the CPU we can run on _now_. This restores much of the regressions against the older kernels, but leaves some ground in the overloaded case. The default-enabled WA_WEIGHT (which will be introduced in the next patch) is an attempt to address the overloaded situation. Reported-by: Eric Farman <farman@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Rosato <mjrosato@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: jinpuwang@gmail.com Cc: vcaputo@pengaru.com Fixes: 3fed382b46ba ("sched/numa: Implement NUMA node level wake_affine()") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10perf/core: Fix cgroup time when scheduling descendantsleilei.lin
Update cgroup time when an event is scheduled in by descendants. Reviewed-and-tested-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: leilei.lin <leilei.lin@alibaba-inc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@kernel.org Cc: alexander.shishkin@linux.intel.com Cc: brendan.d.gregg@gmail.com Cc: yang_oliver@hotmail.com Link: http://lkml.kernel.org/r/CALPjY3mkHiekRkRECzMi9G-bjUQOvOjVBAqxmWkTzc-g+0LwMg@mail.gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10perf/core: Avoid freeing static PMU contexts when PMU is unregisteredWill Deacon
Since commit: 1fd7e4169954 ("perf/core: Remove perf_cpu_context::unique_pmu") ... when a PMU is unregistered then its associated ->pmu_cpu_context is unconditionally freed. Whilst this is fine for dynamically allocated context types (i.e. those registered using perf_invalid_context), this causes a problem for sharing of static contexts such as perf_{sw,hw}_context, which are used by multiple built-in PMUs and effectively have a global lifetime. Whilst testing the ARM SPE driver, which must use perf_sw_context to support per-task AUX tracing, unregistering the driver as a result of a module unload resulted in: Unable to handle kernel NULL pointer dereference at virtual address 00000038 Internal error: Oops: 96000004 [#1] PREEMPT SMP Modules linked in: [last unloaded: arm_spe_pmu] PC is at ctx_resched+0x38/0xe8 LR is at perf_event_exec+0x20c/0x278 [...] ctx_resched+0x38/0xe8 perf_event_exec+0x20c/0x278 setup_new_exec+0x88/0x118 load_elf_binary+0x26c/0x109c search_binary_handler+0x90/0x298 do_execveat_common.isra.14+0x540/0x618 SyS_execve+0x38/0x48 since the software context has been freed and the ctx.pmu->pmu_disable_count field has been set to NULL. This patch fixes the problem by avoiding the freeing of static PMU contexts altogether. Whilst the sharing of dynamic contexts is questionable, this actually requires the caller to share their context pointer explicitly and so the burden is on them to manage the object lifetime. Reported-by: Kim Phillips <kim.phillips@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 1fd7e4169954 ("perf/core: Remove perf_cpu_context::unique_pmu") Link: http://lkml.kernel.org/r/1507040450-7730-1-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10locking/lockdep: Fix stacktrace messPeter Zijlstra
There is some complication between check_prevs_add() and check_prev_add() wrt. saving stack traces. The problem is that we want to be frugal with saving stack traces, since it consumes static resources. We'll only know in check_prev_add() if we need the trace, but we can call into it multiple times. So we want to do on-demand and re-use. A further complication is that check_prev_add() can drop graph_lock and mess with our static resources. In any case, the current state; after commit: ce07a9415f26 ("locking/lockdep: Make check_prev_add() able to handle external stack_trace") is that we'll assume the trace contains valid data once check_prev_add() returns '2'. However, as noted by Josh, this is false, check_prev_add() can return '2' before having saved a trace, this then result in the possibility of using uninitialized data. Testing, as reported by Wu, shows a NULL deref. So simplify. Since the graph_lock() thing is a debug path that hasn't really been used in a long while, take it out back and avoid the head-ache. Further initialize the stack_trace to a known 'empty' state; as long as nr_entries == 0, nothing should deref entries. We can then use the 'entries == NULL' test for a valid trace / on-demand saving. Analyzed-by: Josh Poimboeuf <jpoimboe@redhat.com> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: ce07a9415f26 ("locking/lockdep: Make check_prev_add() able to handle external stack_trace") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-09net: defer call to cgroup_sk_alloc()Eric Dumazet
sk_clone_lock() might run while TCP/DCCP listener already vanished. In order to prevent use after free, it is better to defer cgroup_sk_alloc() to the point we know both parent and child exist, and from process context. Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09waitid(): Add missing access_ok() checksKees Cook
Adds missing access_ok() checks. CVE-2017-5123 Reported-by: Chris Salls <chrissalls5@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Fixes: 4c48abe91be0 ("waitid(): switch copyout of siginfo to unsafe_put_user()") Cc: stable@kernel.org # 4.13 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix object leak on IPSEC offload failure, from Steffen Klassert. 2) Fix range checks in ipset address range addition operations, from Jozsef Kadlecsik. 3) Fix pernet ops unregistration order in ipset, from Florian Westphal. 4) Add missing netlink attribute policy for nl80211 packet pattern attrs, from Peng Xu. 5) Fix PPP device destruction race, from Guillaume Nault. 6) Write marks get lost when BPF verifier processes R1=R2 register assignments, causing incorrect liveness information and less state pruning. Fix from Alexei Starovoitov. 7) Fix blockhole routes so that they are marked dead and therefore not cached in sockets, otherwise IPSEC stops working. From Steffen Klassert. 8) Fix broadcast handling of UDP socket early demux, from Paolo Abeni. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (37 commits) cdc_ether: flag the u-blox TOBY-L2 and SARA-U2 as wwan net: thunderx: mark expected switch fall-throughs in nicvf_main() udp: fix bcast packet reception netlink: do not set cb_running if dump's start() errs ipv4: Fix traffic triggered IPsec connections. ipv6: Fix traffic triggered IPsec connections. ixgbe: incorrect XDP ring accounting in ethtool tx_frame param net: ixgbe: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag Revert commit 1a8b6d76dc5b ("net:add one common config...") ixgbe: fix masking of bits read from IXGBE_VXLANCTRL register ixgbe: Return error when getting PHY address if PHY access is not supported netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1' netfilter: SYNPROXY: skip non-tcp packet in {ipv4, ipv6}_synproxy_hook tipc: Unclone message at secondary destination lookup tipc: correct initialization of skb list gso: fix payload length when gso_size is zero mlxsw: spectrum_router: Avoid expensive lookup during route removal bpf: fix liveness marking doc: Fix typo "8023.ad" in bonding documentation ipv6: fix net.ipv6.conf.all.accept_dad behaviour for real ...
2017-10-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains Netfilter/IPVS fixes for your net tree, they are: 1) Fix packet drops due to incorrect ECN handling in IPVS, from Vadim Fedorenko. 2) Fix splat with mark restoration in xt_socket with non-full-sock, patch from Subash Abhinov Kasiviswanathan. 3) ipset bogusly bails out when adding IPv4 range containing more than 2^31 addresses, from Jozsef Kadlecsik. 4) Incorrect pernet unregistration order in ipset, from Florian Westphal. 5) Races between dump and swap in ipset results in BUG_ON splats, from Ross Lagerwall. 6) Fix chain renames in nf_tables, from JingPiao Chen. 7) Fix race in pernet codepath with ebtables table registration, from Artem Savkov. 8) Memory leak in error path in set name allocation in nf_tables, patch from Arvind Yadav. 9) Don't dump chain counters if they are not available, this fixes a crash when listing the ruleset. 10) Fix out of bound memory read in strlcpy() in x_tables compat code, from Eric Dumazet. 11) Make sure we only process TCP packets in SYNPROXY hooks, patch from Lin Zhang. 12) Cannot load rules incrementally anymore after xt_bpf with pinned objects, added in revision 1. From Shmulik Ladkani. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'Shmulik Ladkani
Commit 2c16d6033264 ("netfilter: xt_bpf: support ebpf") introduced support for attaching an eBPF object by an fd, with the 'bpf_mt_check_v1' ABI expecting the '.fd' to be specified upon each IPT_SO_SET_REPLACE call. However this breaks subsequent iptables calls: # iptables -A INPUT -m bpf --object-pinned /sys/fs/bpf/xxx -j ACCEPT # iptables -A INPUT -s 5.6.7.8 -j ACCEPT iptables: Invalid argument. Run `dmesg' for more information. That's because iptables works by loading existing rules using IPT_SO_GET_ENTRIES to userspace, then issuing IPT_SO_SET_REPLACE with the replacement set. However, the loaded 'xt_bpf_info_v1' has an arbitrary '.fd' number (from the initial "iptables -m bpf" invocation) - so when 2nd invocation occurs, userspace passes a bogus fd number, which leads to 'bpf_mt_check_v1' to fail. One suggested solution [1] was to hack iptables userspace, to perform a "entries fixup" immediatley after IPT_SO_GET_ENTRIES, by opening a new, process-local fd per every 'xt_bpf_info_v1' entry seen. However, in [2] both Pablo Neira Ayuso and Willem de Bruijn suggested to depricate the xt_bpf_info_v1 ABI dealing with pinned ebpf objects. This fix changes the XT_BPF_MODE_FD_PINNED behavior to ignore the given '.fd' and instead perform an in-kernel lookup for the bpf object given the provided '.path'. It also defines an alias for the XT_BPF_MODE_FD_PINNED mode, named XT_BPF_MODE_PATH_PINNED, to better reflect the fact that the user is expected to provide the path of the pinned object. Existing XT_BPF_MODE_FD_ELF behavior (non-pinned fd mode) is preserved. References: [1] https://marc.info/?l=netfilter-devel&m=150564724607440&w=2 [2] https://marc.info/?l=netfilter-devel&m=150575727129880&w=2 Reported-by: Rafael Buchbinder <rafi@rbk.ms> Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>