summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2017-11-10kthread: zero the kthread data structureShaohua Li
kthread() could bail out early before we initialize blkcg_css (if the kthread is killed very early. Please see xchg() statement in kthread()), which confuses free_kthread_struct. Instead of moving the blkcg_css initialization early, we simply zero the whole 'self' data structure, which doesn't sound much overhead. Reported-by: syzbot <syzkaller@googlegroups.com> Fixes: 05e3db95ebfc ("kthread: add a mechanism to store cgroup info") Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Dmitry Vyukov <dvyukov@google.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10blktrace: fix unlocked registration of tracepointsJens Axboe
We need to ensure that tracepoints are registered and unregistered with the users of them. The existing atomic count isn't enough for that. Add a lock around the tracepoints, so we serialize access to them. This fixes cases where we have multiple users setting up and tearing down tracepoints, like this: CPU: 0 PID: 2995 Comm: syzkaller857118 Not tainted 4.14.0-rc5-next-20171018+ #36 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:52 panic+0x1e4/0x41c kernel/panic.c:183 __warn+0x1c4/0x1e0 kernel/panic.c:546 report_bug+0x211/0x2d0 lib/bug.c:183 fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:177 do_trap_no_signal arch/x86/kernel/traps.c:211 [inline] do_trap+0x260/0x390 arch/x86/kernel/traps.c:260 do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:297 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:310 invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905 RIP: 0010:tracepoint_add_func kernel/tracepoint.c:210 [inline] RIP: 0010:tracepoint_probe_register_prio+0x397/0x9a0 kernel/tracepoint.c:283 RSP: 0018:ffff8801d1d1f6c0 EFLAGS: 00010293 RAX: ffff8801d22e8540 RBX: 00000000ffffffef RCX: ffffffff81710f07 RDX: 0000000000000000 RSI: ffffffff85b679c0 RDI: ffff8801d5f19818 RBP: ffff8801d1d1f7c8 R08: ffffffff81710c10 R09: 0000000000000004 R10: ffff8801d1d1f6b0 R11: 0000000000000003 R12: ffffffff817597f0 R13: 0000000000000000 R14: 00000000ffffffff R15: ffff8801d1d1f7a0 tracepoint_probe_register+0x2a/0x40 kernel/tracepoint.c:304 register_trace_block_rq_insert include/trace/events/block.h:191 [inline] blk_register_tracepoints+0x1e/0x2f0 kernel/trace/blktrace.c:1043 do_blk_trace_setup+0xa10/0xcf0 kernel/trace/blktrace.c:542 blk_trace_setup+0xbd/0x180 kernel/trace/blktrace.c:564 sg_ioctl+0xc71/0x2d90 drivers/scsi/sg.c:1089 vfs_ioctl fs/ioctl.c:45 [inline] do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685 SYSC_ioctl fs/ioctl.c:700 [inline] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691 entry_SYSCALL_64_fastpath+0x1f/0xbe RIP: 0033:0x444339 RSP: 002b:00007ffe05bb5b18 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00000000006d66c0 RCX: 0000000000444339 RDX: 000000002084cf90 RSI: 00000000c0481273 RDI: 0000000000000009 RBP: 0000000000000082 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000206 R12: ffffffffffffffff R13: 00000000c0481273 R14: 0000000000000000 R15: 0000000000000000 since we can now run these in parallel. Ensure that the exported helpers for doing this are grabbing the queue trace mutex. Reported-by: Steven Rostedt <rostedt@goodmis.org> Tested-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10blktrace: fix unlocked access to init/start-stop/teardownJens Axboe
sg.c calls into the blktrace functions without holding the proper queue mutex for doing setup, start/stop, or teardown. Add internal unlocked variants, and export the ones that do the proper locking. Fixes: 6da127ad0918 ("blktrace: Add blktrace ioctls to SCSI generic devices") Tested-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-04sysctl: remove /proc/sys/vm/nr_pdflush_threadsJens Axboe
This tunable has been obsolete since 2.6.32, and writes to the file have been failing and complaining in dmesg since then: nr_pdflush_threads exported in /proc is scheduled for removal That was 8 years ago. Remove the file ABI obsolete notice, and the sysfs file. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-26block: fix a build errorShaohua Li
The code is only for blkcg not for all cgroups Fixes: d4478e92d618 ("block/loop: make loop cgroup aware") Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-26kthread: add a mechanism to store cgroup infoShaohua Li
kthread usually runs jobs on behalf of other threads. The jobs should be charged to cgroup of original threads. But the jobs run in a kthread, where we lose the cgroup context of original threads. The patch adds a machanism to record cgroup info of original threads in kthread context. Later we can retrieve the cgroup info and attach the cgroup info to jobs. Since this mechanism is only required by kthread, we store the cgroup info in kthread data instead of generic task_struct. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - Two sets of NVMe pull requests from Christoph: - Fixes for the Fibre Channel host/target to fix spec compliance - Allow a zero keep alive timeout - Make the debug printk for broken SGLs work better - Fix queue zeroing during initialization - Set of RDMA and FC fixes - Target div-by-zero fix - bsg double-free fix. - ndb unknown ioctl fix from Josef. - Buffered vs O_DIRECT page cache inconsistency fix. Has been floating around for a long time, well reviewed. From Lukas. - brd overflow fix from Mikulas. - Fix for a loop regression in this merge window, where using a union for two members of the loop_cmd turned out to be a really bad idea. From Omar. - Fix for an iostat regression fix in this series, using the wrong API to get at the block queue. From Shaohua. - Fix for a potential blktrace delection deadlock. From Waiman. * 'for-linus' of git://git.kernel.dk/linux-block: (30 commits) nvme-fcloop: fix port deletes and callbacks nvmet-fc: sync header templates with comments nvmet-fc: ensure target queue id within range. nvmet-fc: on port remove call put outside lock nvme-rdma: don't fully stop the controller in error recovery nvme-rdma: give up reconnect if state change fails nvme-core: Use nvme_wq to queue async events and fw activation nvme: fix sqhd reference when admin queue connect fails block: fix a crash caused by wrong API fs: Fix page cache inconsistency when mixing buffered and AIO DIO nvmet: implement valid sqhd values in completions nvme-fabrics: Allow 0 as KATO value nvme: allow timed-out ios to retry nvme: stop aer posting if controller state not live nvme-pci: Print invalid SGL only once nvme-pci: initialize queue memory before interrupts nvmet-fc: fix failing max io queue connections nvme-fc: use transport-specific sgl format nvme: add transport SGL definitions nvme.h: remove FC transport-specific error values ...
2017-09-25Merge tag 'trace-v4.14-rc1-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Stack tracing and RCU has been having issues with each other and lockdep has been pointing out constant problems. The changes have been going into the stack tracer, but it has been discovered that the problem isn't with the stack tracer itself, but it is with calling save_stack_trace() from within the internals of RCU. The stack tracer is the one that can trigger the issue the easiest, but examining the problem further, it could also happen from a WARN() in the wrong place, or even if an NMI happened in this area and it did an rcu_read_lock(). The critical area is where RCU is not watching. Which can happen while going to and from idle, or bringing up or taking down a CPU. The final fix was to put the protection in kernel_text_address() as it is the one that requires RCU to be watching while doing the stack trace. To make this work properly, Paul had to allow rcu_irq_enter() happen after rcu_nmi_enter(). This should have been done anyway, since an NMI can page fault (reading vmalloc area), and a page fault triggers rcu_irq_enter(). One patch is just a consolidation of code so that the fix only needed to be done in one location" * tag 'trace-v4.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Remove RCU work arounds from stack tracer extable: Enable RCU if it is not watching in kernel_text_address() extable: Consolidate *kernel_text_address() functions rcu: Allow for page faults in NMI handlers
2017-09-25blktrace: Fix potential deadlock between delete & sysfs opsWaiman Long
The lockdep code had reported the following unsafe locking scenario: CPU0 CPU1 ---- ---- lock(s_active#228); lock(&bdev->bd_mutex/1); lock(s_active#228); lock(&bdev->bd_mutex); *** DEADLOCK *** The deadlock may happen when one task (CPU1) is trying to delete a partition in a block device and another task (CPU0) is accessing tracing sysfs file (e.g. /sys/block/dm-1/trace/act_mask) in that partition. The s_active isn't an actual lock. It is a reference count (kn->count) on the sysfs (kernfs) file. Removal of a sysfs file, however, require a wait until all the references are gone. The reference count is treated like a rwsem using lockdep instrumentation code. The fact that a thread is in the sysfs callback method or in the ioctl call means there is a reference to the opended sysfs or device file. That should prevent the underlying block structure from being removed. Instead of using bd_mutex in the block_device structure, a new blk_trace_mutex is now added to the request_queue structure to protect access to the blk_trace structure. Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Fix typo in patch subject line, and prune a comment detailing how the code used to work. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-23tracing: Remove RCU work arounds from stack tracerSteven Rostedt (VMware)
Currently the stack tracer calls rcu_irq_enter() to make sure RCU is watching when it records a stack trace. But if the stack tracer is triggered while tracing inside of a rcu_irq_enter(), calling rcu_irq_enter() unconditionally can be problematic. The reason for having rcu_irq_enter() in the first place has been fixed from within the saving of the stack trace code, and there's no reason for doing it in the stack tracer itself. Just remove it. Cc: stable@vger.kernel.org Fixes: 0be964be0 ("module: Sanitize RCU usage and locking") Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Suggested-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-09-23extable: Enable RCU if it is not watching in kernel_text_address()Steven Rostedt (VMware)
If kernel_text_address() is called when RCU is not watching, it can cause an RCU bug because is_module_text_address(), the is_kprobe_*insn_slot() and is_bpf_text_address() functions require the use of RCU. Only enable RCU if it is not currently watching before it calls is_module_text_address(). The use of rcu_nmi_enter() is used to enable RCU because kernel_text_address() can happen pretty much anywhere (like an NMI), and even from within an NMI. It is called via save_stack_trace() that can be called by any WARN() or tracing function, which can happen while RCU is not watching (for example, going to or coming from idle, or during CPU take down or bring up). Cc: stable@vger.kernel.org Fixes: 0be964be0 ("module: Sanitize RCU usage and locking") Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-09-23extable: Consolidate *kernel_text_address() functionsSteven Rostedt (VMware)
The functionality between kernel_text_address() and _kernel_text_address() is the same except that _kernel_text_address() does a little more (that function needs a rename, but that can be done another time). Instead of having duplicate code in both, simply have _kernel_text_address() calls kernel_text_address() instead. This is marked for stable because there's an RCU bug that can happen if one of these functions gets called while RCU is not watching. That fix depends on this fix to keep from having to write the fix twice. Cc: stable@vger.kernel.org Fixes: 0be964be0 ("module: Sanitize RCU usage and locking") Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-09-23rcu: Allow for page faults in NMI handlersPaul E. McKenney
A number of architecture invoke rcu_irq_enter() on exception entry in order to allow RCU read-side critical sections in the exception handler when the exception is from an idle or nohz_full CPU. This works, at least unless the exception happens in an NMI handler. In that case, rcu_nmi_enter() would already have exited the extended quiescent state, which would mean that rcu_irq_enter() would (incorrectly) cause RCU to think that it is again in an extended quiescent state. This will in turn result in lockdep splats in response to later RCU read-side critical sections. This commit therefore causes rcu_irq_enter() and rcu_irq_exit() to take no action if there is an rcu_nmi_enter() in effect, thus avoiding the unscheduled return to RCU quiescent state. This in turn should make the kernel safe for on-demand RCU voyeurism. Link: http://lkml.kernel.org/r/20170922211022.GA18084@linux.vnet.ibm.com Cc: stable@vger.kernel.org Fixes: 0be964be0 ("module: Sanitize RCU usage and locking") Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-09-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix NAPI poll list corruption in enic driver, from Christian Lamparter. 2) Fix route use after free, from Eric Dumazet. 3) Fix regression in reuseaddr handling, from Josef Bacik. 4) Assert the size of control messages in compat handling since we copy it in from userspace twice. From Meng Xu. 5) SMC layer bug fixes (missing RCU locking, bad refcounting, etc.) from Ursula Braun. 6) Fix races in AF_PACKET fanout handling, from Willem de Bruijn. 7) Don't use ARRAY_SIZE on spinlock array which might have zero entries, from Geert Uytterhoeven. 8) Fix miscomputation of checksum in ipv6 udp code, from Subash Abhinov Kasiviswanathan. 9) Push the ipv6 header properly in ipv6 GRE tunnel driver, from Xin Long. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (75 commits) inet: fix improper empty comparison net: use inet6_rcv_saddr to compare sockets net: set tb->fast_sk_family net: orphan frags on stand-alone ptype in dev_queue_xmit_nit MAINTAINERS: update git tree locations for ieee802154 subsystem net: prevent dst uses after free net: phy: Fix truncation of large IRQ numbers in phy_attached_print() net/smc: no close wait in case of process shut down net/smc: introduce a delay net/smc: terminate link group if out-of-sync is received net/smc: longer delay for client link group removal net/smc: adapt send request completion notification net/smc: adjust net_device refcount net/smc: take RCU read lock for routing cache lookup net/smc: add receive timeout check net/smc: add missing dev_put net: stmmac: Cocci spatch "of_table" lan78xx: Use default values loaded from EEPROM/OTP after reset lan78xx: Allow EEPROM write for less than MAX_EEPROM_SIZE lan78xx: Fix for eeprom read/write when device auto suspend ...
2017-09-22Merge tag 'seccomp-v4.14-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull seccomp updates from Kees Cook: "Major additions: - sysctl and seccomp operation to discover available actions (tyhicks) - new per-filter configurable logging infrastructure and sysctl (tyhicks) - SECCOMP_RET_LOG to log allowed syscalls (tyhicks) - SECCOMP_RET_KILL_PROCESS as the new strictest possible action - self-tests for new behaviors" [ This is the seccomp part of the security pull request during the merge window that was nixed due to unrelated problems - Linus ] * tag 'seccomp-v4.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: samples: Unrename SECCOMP_RET_KILL selftests/seccomp: Test thread vs process killing seccomp: Implement SECCOMP_RET_KILL_PROCESS action seccomp: Introduce SECCOMP_RET_KILL_PROCESS seccomp: Rename SECCOMP_RET_KILL to SECCOMP_RET_KILL_THREAD seccomp: Action to log before allowing seccomp: Filter flag to log all actions except SECCOMP_RET_ALLOW seccomp: Selftest for detection of filter flag support seccomp: Sysctl to configure actions that are allowed to be logged seccomp: Operation for checking if an action is available seccomp: Sysctl to display available actions seccomp: Provide matching filter for introspection selftests/seccomp: Refactor RET_ERRNO tests selftests/seccomp: Add simple seccomp overhead benchmark selftests/seccomp: Add tests for basic ptrace actions
2017-09-20bpf: one perf event close won't free bpf program attached by another perf eventYonghong Song
This patch fixes a bug exhibited by the following scenario: 1. fd1 = perf_event_open with attr.config = ID1 2. attach bpf program prog1 to fd1 3. fd2 = perf_event_open with attr.config = ID1 <this will be successful> 4. user program closes fd2 and prog1 is detached from the tracepoint. 5. user program with fd1 does not work properly as tracepoint no output any more. The issue happens at step 4. Multiple perf_event_open can be called successfully, but only one bpf prog pointer in the tp_event. In the current logic, any fd release for the same tp_event will free the tp_event->prog. The fix is to free tp_event->prog only when the closing fd corresponds to the one which registered the program. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-20Merge tag 'trace-v4.14-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "This includes three minor fixes. - Have writing to trace file clear the irqsoff (and friends) tracer - trace_pipe behavior for instance buffers was different than top buffer - Show a message of why mmiotrace doesn't start from commandline" * tag 'trace-v4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Fix trace_pipe behavior for instance traces tracing: Ignore mmiotrace from kernel commandline tracing: Erase irqsoff trace with empty write
2017-09-19bpf: fix ri->map_owner pointer on bpf_prog_reallocDaniel Borkmann
Commit 109980b894e9 ("bpf: don't select potentially stale ri->map from buggy xdp progs") passed the pointer to the prog itself to be loaded into r4 prior on bpf_redirect_map() helper call, so that we can store the owner into ri->map_owner out of the helper. Issue with that is that the actual address of the prog is still subject to change when subsequent rewrites occur that require slow path in bpf_prog_realloc() to alloc more memory, e.g. from patching inlining helper functions or constant blinding. Thus, we really need to take prog->aux as the address we're holding, which also works with prog clones as they share the same aux object. Instead of then fetching aux->prog during runtime, which could potentially incur cache misses due to false sharing, we are going to just use aux for comparison on the map owner. This will also keep the patchlet of the same size, and later check in xdp_map_invalid() only accesses read-only aux pointer from the prog, it's also in the same cacheline already from prior access when calling bpf_func. Fixes: 109980b894e9 ("bpf: don't select potentially stale ri->map from buggy xdp progs") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-19bpf: do not disable/enable BH in bpf_map_free_id()Eric Dumazet
syzkaller reported following splat [1] Since hard irq are disabled by the caller, bpf_map_free_id() should not try to enable/disable BH. Another solution would be to change htab_map_delete_elem() to defer the free_htab_elem() call after raw_spin_unlock_irqrestore(&b->lock, flags), but this might be not enough to cover other code paths. [1] WARNING: CPU: 1 PID: 8052 at kernel/softirq.c:161 __local_bh_enable_ip +0x1e/0x160 kernel/softirq.c:161 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 8052 Comm: syz-executor1 Not tainted 4.13.0-next-20170915+ #23 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:52 panic+0x1e4/0x417 kernel/panic.c:181 __warn+0x1c4/0x1d9 kernel/panic.c:542 report_bug+0x211/0x2d0 lib/bug.c:183 fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178 do_trap_no_signal arch/x86/kernel/traps.c:212 [inline] do_trap+0x260/0x390 arch/x86/kernel/traps.c:261 do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311 invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905 RIP: 0010:__local_bh_enable_ip+0x1e/0x160 kernel/softirq.c:161 RSP: 0018:ffff8801cdcd7748 EFLAGS: 00010046 RAX: 0000000000000082 RBX: 0000000000000201 RCX: 0000000000000000 RDX: 1ffffffff0b5933c RSI: 0000000000000201 RDI: ffffffff85ac99e0 RBP: ffff8801cdcd7758 R08: ffffffff85b87158 R09: 1ffff10039b9aec6 R10: ffff8801c99f24c0 R11: 0000000000000002 R12: ffffffff817b0b47 R13: dffffc0000000000 R14: ffff8801cdcd77e8 R15: 0000000000000001 __raw_spin_unlock_bh include/linux/spinlock_api_smp.h:176 [inline] _raw_spin_unlock_bh+0x30/0x40 kernel/locking/spinlock.c:207 spin_unlock_bh include/linux/spinlock.h:361 [inline] bpf_map_free_id kernel/bpf/syscall.c:197 [inline] __bpf_map_put+0x267/0x320 kernel/bpf/syscall.c:227 bpf_map_put+0x1a/0x20 kernel/bpf/syscall.c:235 bpf_map_fd_put_ptr+0x15/0x20 kernel/bpf/map_in_map.c:96 free_htab_elem+0xc3/0x1b0 kernel/bpf/hashtab.c:658 htab_map_delete_elem+0x74d/0x970 kernel/bpf/hashtab.c:1063 map_delete_elem kernel/bpf/syscall.c:633 [inline] SYSC_bpf kernel/bpf/syscall.c:1479 [inline] SyS_bpf+0x2188/0x46a0 kernel/bpf/syscall.c:1451 entry_SYSCALL_64_fastpath+0x1f/0xbe Fixes: f3f1c054c288 ("bpf: Introduce bpf_map ID") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Martin KaFai Lau <kafai@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-19tracing: Fix trace_pipe behavior for instance tracesTahsin Erdogan
When reading data from trace_pipe, tracing_wait_pipe() performs a check to see if tracing has been turned off after some data was read. Currently, this check always looks at global trace state, but it should be checking the trace instance where trace_pipe is located at. Because of this bug, cat instances/i1/trace_pipe in the following script will immediately exit instead of waiting for data: cd /sys/kernel/debug/tracing echo 0 > tracing_on mkdir -p instances/i1 echo 1 > instances/i1/tracing_on echo 1 > instances/i1/events/sched/sched_process_exec/enable cat instances/i1/trace_pipe Link: http://lkml.kernel.org/r/20170917102348.1615-1-tahsin@google.com Cc: stable@vger.kernel.org Fixes: 10246fa35d4f ("tracing: give easy way to clear trace buffer") Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-09-19tracing: Ignore mmiotrace from kernel commandlineZiqian SUN (Zamir)
The mmiotrace tracer cannot be enabled with ftrace=mmiotrace in kernel commandline. With this patch, noboot is added to the tracer struct, and when system boot with a tracer that has noboot=true, it will print out a warning message and continue booting. Link: http://lkml.kernel.org/r/1505111195-31942-1-git-send-email-zsun@redhat.com Signed-off-by: Ziqian SUN (Zamir) <zsun@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-09-19tracing: Erase irqsoff trace with empty writeBo Yan
One convenient way to erase trace is "echo > trace". However, this is currently broken if the current tracer is irqsoff tracer. This is because irqsoff tracer use max_buffer as the default trace buffer. Set the max_buffer as the one to be cleared when it's the trace buffer currently in use. Link: http://lkml.kernel.org/r/1505754215-29411-1-git-send-email-byan@nvidia.com Cc: <mingo@redhat.com> Cc: stable@vger.kernel.org Fixes: 4acd4d00f ("tracing: give easy way to clear trace buffer") Signed-off-by: Bo Yan <byan@nvidia.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-09-18bpf: devmap: pass on return value of bpf_map_precharge_memlockTobias Klauser
If bpf_map_precharge_memlock in dev_map_alloc, -ENOMEM is returned regardless of the actual error produced by bpf_map_precharge_memlock. Fix it by passing on the error returned by bpf_map_precharge_memlock. Also return -EINVAL instead of -ENOMEM if the page count overflow check fails. This makes dev_map_alloc match the behavior of other bpf maps' alloc functions wrt. return values. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-17Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fix from Thomas Gleixner: "Fix for an off by one error in a cpumask result comparison" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Fix cpumask check in __irq_startup_managed()
2017-09-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix hotplug deadlock in hv_netvsc, from Stephen Hemminger. 2) Fix double-free in rmnet driver, from Dan Carpenter. 3) INET connection socket layer can double put request sockets, fix from Eric Dumazet. 4) Don't match collect metadata-mode tunnels if the device is down, from Haishuang Yan. 5) Do not perform TSO6/GSO on ipv6 packets with extensions headers in be2net driver, from Suresh Reddy. 6) Fix scaling error in gen_estimator, from Eric Dumazet. 7) Fix 64-bit statistics deadlock in systemport driver, from Florian Fainelli. 8) Fix use-after-free in sctp_sock_dump, from Xin Long. 9) Reject invalid BPF_END instructions in verifier, from Edward Cree. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (43 commits) mlxsw: spectrum_router: Only handle IPv4 and IPv6 events Documentation: link in networking docs tcp: fix data delivery rate bpf/verifier: reject BPF_ALU64|BPF_END sctp: do not mark sk dumped when inet_sctp_diag_fill returns err sctp: fix an use-after-free issue in sctp_sock_dump netvsc: increase default receive buffer size tcp: update skb->skb_mstamp more carefully net: ipv4: fix l3slave check for index returned in IP_PKTINFO net: smsc911x: Quieten netif during suspend net: systemport: Fix 64-bit stats deadlock net: vrf: avoid gcc-4.6 warning qed: remove unnecessary call to memset tg3: clean up redundant initialization of tnapi tls: make tls_sw_free_resources static sctp: potential read out of bounds in sctp_ulpevent_type_enabled() MAINTAINERS: review Renesas DT bindings as well net_sched: gen_estimator: fix scaling error in bytes/packets samples nfp: wait for the NSP resource to appear on boot nfp: wait for board state before talking to the NSP ...
2017-09-16genirq: Fix cpumask check in __irq_startup_managed()Thomas Gleixner
The result of cpumask_any_and() is invalid when result greater or equal nr_cpu_ids. The current check is checking for greater only. Fix it. Fixes: 761ea388e8c4 ("genirq: Handle managed irqs gracefully in irq_startup()") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Chen Yu <yu.c.chen@intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: stable@vger.kernel.org Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Len Brown <lenb@kernel.org> Link: http://lkml.kernel.org/r/20170913213152.272283444@linutronix.de
2017-09-15bpf/verifier: reject BPF_ALU64|BPF_ENDEdward Cree
Neither ___bpf_prog_run nor the JITs accept it. Also adds a new test case. Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)") Signed-off-by: Edward Cree <ecree@solarflare.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-14Merge branch 'work.set_fs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more set_fs removal from Al Viro: "Christoph's 'use kernel_read and friends rather than open-coding set_fs()' series" * 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: unexport vfs_readv and vfs_writev fs: unexport vfs_read and vfs_write fs: unexport __vfs_read/__vfs_write lustre: switch to kernel_write gadget/f_mass_storage: stop messing with the address limit mconsole: switch to kernel_read btrfs: switch write_buf to kernel_write net/9p: switch p9_fd_read to kernel_write mm/nommu: switch do_mmap_private to kernel_read serial2002: switch serial2002_tty_write to kernel_{read/write} fs: make the buf argument to __kernel_write a void pointer fs: fix kernel_write prototype fs: fix kernel_read prototype fs: move kernel_read to fs/read_write.c fs: move kernel_write to fs/read_write.c autofs4: switch autofs4_write to __kernel_write ashmem: switch to ->read_iter
2017-09-14Merge branch 'work.ipc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull ipc compat cleanup and 64-bit time_t from Al Viro: "IPC copyin/copyout sanitizing, including 64bit time_t work from Deepa Dinamani" * 'work.ipc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: utimes: Make utimes y2038 safe ipc: shm: Make shmid_kernel timestamps y2038 safe ipc: sem: Make sem_array timestamps y2038 safe ipc: msg: Make msg_queue timestamps y2038 safe ipc: mqueue: Replace timespec with timespec64 ipc: Make sys_semtimedop() y2038 safe get rid of SYSVIPC_COMPAT on ia64 semtimedop(): move compat to native shmat(2): move compat to native msgrcv(2), msgsnd(2): move compat to native ipc(2): move compat to native ipc: make use of compat ipc_perm helpers semctl(): move compat to native semctl(): separate all layout-dependent copyin/copyout msgctl(): move compat to native msgctl(): split the actual work from copyin/copyout ipc: move compat shmctl to native shmctl: split the work from copyin/copyout
2017-09-14Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "A few leftovers" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm, page_owner: skip unnecessary stack_trace entries arm64: stacktrace: avoid listing stacktrace functions in stacktrace mm: treewide: remove GFP_TEMPORARY allocation flag IB/mlx4: fix sprintf format warning fscache: fix fscache_objlist_show format processing lib/test_bitmap.c: use ULL suffix for 64-bit constants procfs: remove unused variable drivers/media/cec/cec-adap.c: fix build with gcc-4.4.4 idr: remove WARN_ON_ONCE() when trying to replace negative ID
2017-09-14sched/wait: Introduce wakeup boomark in wake_up_page_bitTim Chen
Now that we have added breaks in the wait queue scan and allow bookmark on scan position, we put this logic in the wake_up_page_bit function. We can have very long page wait list in large system where multiple pages share the same wait list. We break the wake up walk here to allow other cpus a chance to access the list, and not to disable the interrupts when traversing the list for too long. This reduces the interrupt and rescheduling latency, and excessive page wait queue lock hold time. [ v2: Remove bookmark_wake_function ] Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-14sched/wait: Break up long wake list walkTim Chen
We encountered workloads that have very long wake up list on large systems. A waker takes a long time to traverse the entire wake list and execute all the wake functions. We saw page wait list that are up to 3700+ entries long in tests of large 4 and 8 socket systems. It took 0.8 sec to traverse such list during wake up. Any other CPU that contends for the list spin lock will spin for a long time. It is a result of the numa balancing migration of hot pages that are shared by many threads. Multiple CPUs waking are queued up behind the lock, and the last one queued has to wait until all CPUs did all the wakeups. The page wait list is traversed with interrupt disabled, which caused various problems. This was the original cause that triggered the NMI watch dog timer in: https://patchwork.kernel.org/patch/9800303/ . Only extending the NMI watch dog timer there helped. This patch bookmarks the waker's scan position in wake list and break the wake up walk, to allow access to the list before the waker resume its walk down the rest of the wait list. It lowers the interrupt and rescheduling latency. This patch also provides a performance boost when combined with the next patch to break up page wakeup list walk. We saw 22% improvement in the will-it-scale file pread2 test on a Xeon Phi system running 256 threads. [ v2: Merged in Linus' changes to remove the bookmark_wake_function, and simply access to flags. ] Reported-by: Kan Liang <kan.liang@intel.com> Tested-by: Kan Liang <kan.liang@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-13mm: treewide: remove GFP_TEMPORARY allocation flagMichal Hocko
GFP_TEMPORARY was introduced by commit e12ba74d8ff3 ("Group short-lived and reclaimable kernel allocations") along with __GFP_RECLAIMABLE. It's primary motivation was to allow users to tell that an allocation is short lived and so the allocator can try to place such allocations close together and prevent long term fragmentation. As much as this sounds like a reasonable semantic it becomes much less clear when to use the highlevel GFP_TEMPORARY allocation flag. How long is temporary? Can the context holding that memory sleep? Can it take locks? It seems there is no good answer for those questions. The current implementation of GFP_TEMPORARY is basically GFP_KERNEL | __GFP_RECLAIMABLE which in itself is tricky because basically none of the existing caller provide a way to reclaim the allocated memory. So this is rather misleading and hard to evaluate for any benefits. I have checked some random users and none of them has added the flag with a specific justification. I suspect most of them just copied from other existing users and others just thought it might be a good idea to use without any measuring. This suggests that GFP_TEMPORARY just motivates for cargo cult usage without any reasoning. I believe that our gfp flags are quite complex already and especially those with highlevel semantic should be clearly defined to prevent from confusion and abuse. Therefore I propose dropping GFP_TEMPORARY and replace all existing users to simply use GFP_KERNEL. Please note that SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and so they will be placed properly for memory fragmentation prevention. I can see reasons we might want some gfp flag to reflect shorterm allocations but I propose starting from a clear semantic definition and only then add users with proper justification. This was been brought up before LSF this year by Matthew [1] and it turned out that GFP_TEMPORARY really doesn't have a clear semantic. It seems to be a heuristic without any measured advantage for most (if not all) its current users. The follow up discussion has revealed that opinions on what might be temporary allocation differ a lot between developers. So rather than trying to tweak existing users into a semantic which they haven't expected I propose to simply remove the flag and start from scratch if we really need a semantic for short term allocations. [1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org [akpm@linux-foundation.org: fix typo] [akpm@linux-foundation.org: coding-style fixes] [sfr@canb.auug.org.au: drm/i915: fix up] Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Neil Brown <neilb@suse.de> Cc: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-13Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Three CPU hotplug related fixes and a debugging improvement" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/debug: Add debugfs knob for "sched_debug" sched/core: WARN() when migrating to an offline CPU sched/fair: Plug hole between hotplug and active_load_balance() sched/fair: Avoid newidle balance for !active CPUs
2017-09-13Merge tag 'modules-for-v4.14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux Pull modules updates from Jessica Yu: "Summary of modules changes for the 4.14 merge window: - minor code cleanups and fixes - modpost: avoid building modules that have names that exceed the size of the name field in struct module" * tag 'modules-for-v4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux: module: Remove const attribute from alias for MODULE_DEVICE_TABLE module: fix ddebug_remove_module() modpost: abort if module name is too long
2017-09-12Merge tag 'selinux-pr-20170831' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux Pull selinux updates from Paul Moore: "A relatively quiet period for SELinux, 11 patches with only two/three having any substantive changes. These noteworthy changes include another tweak to the NNP/nosuid handling, per-file labeling for cgroups, and an object class fix for AF_UNIX/SOCK_RAW sockets; the rest of the changes are minor tweaks or administrative updates (Stephen's email update explains the file explosion in the diffstat). Everything passes the selinux-testsuite" [ Also a couple of small patches from the security tree from Tetsuo Handa for Tomoyo and LSM cleanup. The separation of security policy updates wasn't all that clean - Linus ] * tag 'selinux-pr-20170831' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux: selinux: constify nf_hook_ops selinux: allow per-file labeling for cgroupfs lsm_audit: update my email address selinux: update my email address MAINTAINERS: update the NetLabel and Labeled Networking information selinux: use GFP_NOWAIT in the AVC kmem_caches selinux: Generalize support for NNP/nosuid SELinux domain transitions selinux: genheaders should fail if too many permissions are defined selinux: update the selinux info in MAINTAINERS credits: update Paul Moore's info selinux: Assign proper class to PF_UNIX/SOCK_RAW sockets tomoyo: Update URLs in Documentation/admin-guide/LSM/tomoyo.rst LSM: Remove security_task_create() hook.
2017-09-12Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Three fixes: - fix a suspend/resume cpusets bug - fix a !CONFIG_NUMA_BALANCING bug - fix a kerneldoc warning" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix nuisance kernel-doc warning sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs sched/fair: Fix wake_affine_llc() balancing rules
2017-09-12Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Ingo Molnar: "A sparse irq race/locking fix, and a MSI irq domains population fix" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Make sparse_irq_lock protect what it should protect genirq/msi: Fix populating multiple interrupts
2017-09-12sched/debug: Add debugfs knob for "sched_debug"Peter Zijlstra
I'm forever late for editing my kernel cmdline, add a runtime knob to disable the "sched_debug" thing. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170907150614.142924283@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-12sched/core: WARN() when migrating to an offline CPUPeter Zijlstra
Migrating tasks to offline CPUs is a pretty big fail, warn about it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170907150614.094206976@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-12sched/fair: Plug hole between hotplug and active_load_balance()Peter Zijlstra
The load balancer applies cpu_active_mask to whatever sched_domains it finds, however in the case of active_balance there is a hole between setting rq->{active_balance,push_cpu} and running the stop_machine work doing the actual migration. The @push_cpu can go offline in this window, which would result in us moving a task onto a dead cpu, which is a fairly bad thing. Double check the active mask before the stop work does the migration. CPU0 CPU1 <SoftIRQ> stop_machine(takedown_cpu) load_balance() cpu_stopper_thread() ... work = multi_cpu_stop stop_one_cpu_nowait( /* wait for CPU0 */ .func = active_load_balance_cpu_stop ); </SoftIRQ> cpu_stopper_thread() work = multi_cpu_stop /* sync with CPU1 */ take_cpu_down() <idle> play_dead(); work = active_load_balance_cpu_stop set_task_cpu(p, CPU1); /* oops!! */ Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20170907150614.044460912@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-12sched/fair: Avoid newidle balance for !active CPUsPeter Zijlstra
On CPU hot unplug, when parking the last kthread we'll try and schedule into idle to kill the CPU. This last schedule can (and does) trigger newidle balance because at this point the sched domains are still up because of commit: 77d1dfda0e79 ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds") Obviously pulling tasks to an already offline CPU is a bad idea, and all balancing operations _should_ be subject to cpu_active_mask, make it so. Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Fixes: 77d1dfda0e79 ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds") Link: http://lkml.kernel.org/r/20170907150613.994135806@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-11Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace updates from Eric Biederman: "Life has been busy and I have not gotten half as much done this round as I would have liked. I delayed it so that a minor conflict resolution with the mips tree could spend a little time in linux-next before I sent this pull request. This includes two long delayed user namespace changes from Kirill Tkhai. It also includes a very useful change from Serge Hallyn that allows the security capability attribute to be used inside of user namespaces. The practical effect of this is people can now untar tarballs and install rpms in user namespaces. It had been suggested to generalize this and encode some of the namespace information information in the xattr name. Upon close inspection that makes the things that should be hard easy and the things that should be easy more expensive. Then there is my bugfix/cleanup for signal injection that removes the magic encoding of the siginfo union member from the kernel internal si_code. The mips folks reported the case where I had used FPE_FIXME me is impossible so I have remove FPE_FIXME from mips, while at the same time including a return statement in that case to keep gcc from complaining about unitialized variables. I almost finished the work to get make copy_siginfo_to_user a trivial copy to user. The code is available at: git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git neuter-copy_siginfo_to_user-v3 But I did not have time/energy to get the code posted and reviewed before the merge window opened. I was able to see that the security excuse for just copying fields that we know are initialized doesn't work in practice there are buggy initializations that don't initialize the proper fields in siginfo. So we still sometimes copy unitialized data to userspace" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: Introduce v3 namespaced file capabilities mips/signal: In force_fcr31_sig return in the impossible case signal: Remove kernel interal si_code magic fcntl: Don't use ambiguous SIG_POLL si_codes prctl: Allow local CAP_SYS_ADMIN changing exe_file security: Use user_namespace::level to avoid redundant iterations in cap_capable() userns,pidns: Verify the userns for new pid namespaces signal/testing: Don't look for __SI_FAULT in userspace signal/mips: Document a conflict with SI_USER with SIGFPE signal/sparc: Document a conflict with SI_USER with SIGFPE signal/ia64: Document a conflict with SI_USER with SIGFPE signal/alpha: Document a conflict with SI_USER for SIGTRAP
2017-09-11perf/bpf: fix a clang compilation issueYonghong Song
clang does not support variable length array for structure member. It has the following error during compilation: kernel/trace/trace_syscalls.c:568:17: error: fields must have a constant size: 'variable length array in structure' extension will never be supported unsigned long args[sys_data->nb_args]; ^ The fix is to use a fixed array length instead. Reported-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-11sched/fair: Fix nuisance kernel-doc warningRandy Dunlap
Work around kernel-doc warning ('*' in Sphinx doc means "emphasis"): ../kernel/sched/fair.c:7584: WARNING: Inline emphasis start-string without end-string. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/f18b30f9-6251-6d86-9d44-16501e386891@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: "The iwlwifi firmware compat fix is in here as well as some other stuff: 1) Fix request socket leak introduced by BPF deadlock fix, from Eric Dumazet. 2) Fix VLAN handling with TXQs in mac80211, from Johannes Berg. 3) Missing __qdisc_drop conversions in prio and qfq schedulers, from Gao Feng. 4) Use after free in netlink nlk groups handling, from Xin Long. 5) Handle MTU update properly in ipv6 gre tunnels, from Xin Long. 6) Fix leak of ipv6 fib tables on netns teardown, from Sabrina Dubroca with follow-on fix from Eric Dumazet. 7) Need RCU and preemption disabled during generic XDP data patch, from John Fastabend" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (54 commits) bpf: make error reporting in bpf_warn_invalid_xdp_action more clear Revert "mdio_bus: Remove unneeded gpiod NULL check" bpf: devmap, use cond_resched instead of cpu_relax bpf: add support for sockmap detach programs net: rcu lock and preempt disable missing around generic xdp bpf: don't select potentially stale ri->map from buggy xdp progs net: tulip: Constify tulip_tbl net: ethernet: ti: netcp_core: no need in netif_napi_del davicom: Display proper debug level up to 6 net: phy: sfp: rename dt properties to match the binding dt-binding: net: sfp binding documentation dt-bindings: add SFF vendor prefix dt-bindings: net: don't confuse with generic PHY property ip6_tunnel: fix setting hop_limit value for ipv6 tunnel ip_tunnel: fix setting ttl and tos value in collect_md mode ipv6: fix typo in fib6_net_exit() tcp: fix a request socket leak sctp: fix missing wake ups in some situations netfilter: xt_hashlimit: fix build error caused by 64bit division netfilter: xt_hashlimit: alloc hashtable with right size ...
2017-09-09Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge more updates from Andrew Morton: - most of the rest of MM - a small number of misc things - lib/ updates - checkpatch - autofs updates - ipc/ updates * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (126 commits) ipc: optimize semget/shmget/msgget for lots of keys ipc/sem: play nicer with large nsops allocations ipc/sem: drop sem_checkid helper ipc: convert kern_ipc_perm.refcount from atomic_t to refcount_t ipc: convert sem_undo_list.refcnt from atomic_t to refcount_t ipc: convert ipc_namespace.count from atomic_t to refcount_t kcov: support compat processes sh: defconfig: cleanup from old Kconfig options mn10300: defconfig: cleanup from old Kconfig options m32r: defconfig: cleanup from old Kconfig options drivers/pps: use surrounding "if PPS" to remove numerous dependency checks drivers/pps: aesthetic tweaks to PPS-related content cpumask: make cpumask_next() out-of-line kmod: move #ifdef CONFIG_MODULES wrapper to Makefile kmod: split off umh headers into its own file MAINTAINERS: clarify kmod is just a kernel module loader kmod: split out umh code into its own file test_kmod: flip INT checks to be consistent test_kmod: remove paranoid UINT_MAX check on uint range processing vfat: deduplicate hex2bin() ...
2017-09-08bpf: devmap, use cond_resched instead of cpu_relaxJohn Fastabend
Be a bit more friendly about waiting for flush bits to complete. Replace the cpu_relax() with a cond_resched(). Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-08bpf: add support for sockmap detach programsJohn Fastabend
The bpf map sockmap supports adding programs via attach commands. This patch adds the detach command to keep the API symmetric and allow users to remove previously added programs. Otherwise the user would have to delete the map and re-add it to get in this state. This also adds a series of additional tests to capture detach operation and also attaching/detaching invalid prog types. API note: socks will run (or not run) programs depending on the state of the map at the time the sock is added. We do not for example walk the map and remove programs from previously attached socks. Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-08bpf: don't select potentially stale ri->map from buggy xdp progsDaniel Borkmann
We can potentially run into a couple of issues with the XDP bpf_redirect_map() helper. The ri->map in the per CPU storage can become stale in several ways, mostly due to misuse, where we can then trigger a use after free on the map: i) prog A is calling bpf_redirect_map(), returning XDP_REDIRECT and running on a driver not supporting XDP_REDIRECT yet. The ri->map on that CPU becomes stale when the XDP program is unloaded on the driver, and a prog B loaded on a different driver which supports XDP_REDIRECT return code. prog B would have to omit calling to bpf_redirect_map() and just return XDP_REDIRECT, which would then access the freed map in xdp_do_redirect() since not cleared for that CPU. ii) prog A is calling bpf_redirect_map(), returning a code other than XDP_REDIRECT. prog A is then detached, which triggers release of the map. prog B is attached which, similarly as in i), would just return XDP_REDIRECT without having called bpf_redirect_map() and thus be accessing the freed map in xdp_do_redirect() since not cleared for that CPU. iii) prog A is attached to generic XDP, calling the bpf_redirect_map() helper and returning XDP_REDIRECT. xdp_do_generic_redirect() is currently not handling ri->map (will be fixed by Jesper), so it's not being reset. Later loading a e.g. native prog B which would, say, call bpf_xdp_redirect() and then returns XDP_REDIRECT would find in xdp_do_redirect() that a map was set and uses that causing use after free on map access. Fix thus needs to avoid accessing stale ri->map pointers, naive way would be to call a BPF function from drivers that just resets it to NULL for all XDP return codes but XDP_REDIRECT and including XDP_REDIRECT for drivers not supporting it yet (and let ri->map being handled in xdp_do_generic_redirect()). There is a less intrusive way w/o letting drivers call a reset for each BPF run. The verifier knows we're calling into bpf_xdp_redirect_map() helper, so it can do a small insn rewrite transparent to the prog itself in the sense that it fills R4 with a pointer to the own bpf_prog. We have that pointer at verification time anyway and R4 is allowed to be used as per calling convention we scratch R0 to R5 anyway, so they become inaccessible and program cannot read them prior to a write. Then, the helper would store the prog pointer in the current CPUs struct redirect_info. Later in xdp_do_*_redirect() we check whether the redirect_info's prog pointer is the same as passed xdp_prog pointer, and if that's the case then all good, since the prog holds a ref on the map anyway, so it is always valid at that point in time and must have a reference count of at least 1. If in the unlikely case they are not equal, it means we got a stale pointer, so we clear and bail out right there. Also do reset map and the owning prog in bpf_xdp_redirect(), so that bpf_xdp_redirect_map() and bpf_xdp_redirect() won't get mixed up, only the last call should take precedence. A tc bpf_redirect() doesn't use map anywhere yet, so no need to clear it there since never accessed in that layer. Note that in case the prog is released, and thus the map as well we're still under RCU read critical section at that time and have preemption disabled as well. Once we commit with the __dev_map_insert_ctx() from xdp_do_redirect_map() and set the map to ri->map_to_flush, we still wait for a xdp_do_flush_map() to finish in devmap dismantle time once flush_needed bit is set, so that is fine. Fixes: 97f91a7cf04f ("bpf: add bpf_redirect_map helper routine") Reported-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>