summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2017-12-06sched/fair: Update and fix the runnable propagation ruleVincent Guittot
Unlike running, the runnable part can't be directly propagated through the hierarchy when we migrate a task. The main reason is that runnable time can be shared with other sched_entities that stay on the rq and this runnable time will also remain on prev cfs_rq and must not be removed. Instead, we can estimate what should be the new runnable of the prev cfs_rq and check that this estimation stay in a possible range. The prop_runnable_sum is a good estimation when adding runnable_sum but fails most often when we remove it. Instead, we could use the formula below instead: gcfs_rq's runnable_sum = gcfs_rq->avg.load_sum / gcfs_rq->load.weight which assumes that tasks are equally runnable which is not true but easy to compute. Beside these estimates, we have several simple rules that help us to filter out wrong ones: - ge->avg.runnable_sum <= than LOAD_AVG_MAX - ge->avg.runnable_sum >= ge->avg.running_sum (ge->avg.util_sum << LOAD_AVG_MAX) - ge->avg.runnable_sum can't increase when we detach a task The effect of these fixes is better cgroups balancing. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ben Segall <bsegall@google.com> Cc: Chris Mason <clm@fb.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yuyang Du <yuyang.du@intel.com> Link: http://lkml.kernel.org/r/1510842112-21028-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-06sched/wait: Fix add_wait_queue() behavioral changeOmar Sandoval
The following cleanup commit: 50816c48997a ("sched/wait: Standardize internal naming of wait-queue entries") ... unintentionally changed the behavior of add_wait_queue() from inserting the wait entry at the head of the wait queue to the tail of the wait queue. Beyond a negative performance impact this change in behavior theoretically also breaks wait queues which mix exclusive and non-exclusive waiters, as non-exclusive waiters will not be woken up if they are queued behind enough exclusive waiters. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-team@fb.com Fixes: ("sched/wait: Standardize internal naming of wait-queue entries") Link: http://lkml.kernel.org/r/a16c8ccffd39bd08fdaa45a5192294c784b803a7.1512544324.git.osandov@fb.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-06locking/lockdep: Fix possible NULL derefPeter Zijlstra
We can't invalidate xhlocks when we've not yet allocated any. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Fixes: f52be5708076 ("locking/lockdep: Untangle xhlock history save/restore from task independence") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-06cpu/hotplug: Fix state name in takedown_cpu() commentBrendan Jackman
CPUHP_AP_SCHED_MIGRATE_DYING doesn't exist, it looks like this was supposed to refer to CPUHP_AP_SCHED_STARTING's teardown callback, i.e. sched_cpu_dying(). Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20171206105911.28093-1-brendan.jackman@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-05remove task and stack pointer printout from oops dumpLinus Torvalds
Geert Uytterhoeven reported a NFS oops, and pointed out that some of the numbers were hashed and useless. We could just turn them from '%p' into '%px', but those numbers are really just legacy, and useless even when not hashed. So just remove them entirely. Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-12-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Small overlapping change conflict ('net' changed a line, 'net-next' added a line right afterwards) in flexcan.c Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05bpf: correct broken uapi for BPF_PROG_TYPE_PERF_EVENT program typeHendrik Brueckner
Commit 0515e5999a466dfe ("bpf: introduce BPF_PROG_TYPE_PERF_EVENT program type") introduced the bpf_perf_event_data structure which exports the pt_regs structure. This is OK for multiple architectures but fail for s390 and arm64 which do not export pt_regs. Programs using them, for example, the bpf selftest fail to compile on these architectures. For s390, exporting the pt_regs is not an option because s390 wants to allow changes to it. For arm64, there is a user_pt_regs structure that covers parts of the pt_regs structure for use by user space. To solve the broken uapi for s390 and arm64, introduce an abstract type for pt_regs and add an asm/bpf_perf_event.h file that concretes the type. An asm-generic header file covers the architectures that export pt_regs today. The arch-specific enablement for s390 and arm64 follows in separate commits. Reported-by: Thomas Richter <tmricht@linux.vnet.ibm.com> Fixes: 0515e5999a466dfe ("bpf: introduce BPF_PROG_TYPE_PERF_EVENT program type") Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com> Reviewed-and-tested-by: Thomas Richter <tmricht@linux.vnet.ibm.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-04Revert "cgroup/cpuset: remove circular dependency deadlock"Tejun Heo
This reverts commit aa24163b2ee5c92120e32e99b5a93143a0f4258e. This and the following commit led to another circular locking scenario and the scenario which is fixed by this commit no longer exists after e8b3f8db7aad ("workqueue/hotplug: simplify workqueue_offline_cpu()") which removes work item flushing from hotplug path. Revert it for now. Signed-off-by: Tejun Heo <tj@kernel.org>
2017-12-04workqueue/hotplug: remove the workaround in rebind_workers()Lai Jiangshan
Since the cpu/hotplug refactoring, DOWN_FAILED is never called without preceding DOWN_PREPARE making the workaround unnecessary. Remove it. Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-12-04workqueue/hotplug: simplify workqueue_offline_cpu()Lai Jiangshan
Since the recent cpu/hotplug refactoring, workqueue_offline_cpu() is guaranteed to run on the local cpu which is going offline. This also fixes the following deadlock by removing work item scheduling and flushing from CPU hotplug path. http://lkml.kernel.org/r/1504764252-29091-1-git-send-email-prsood@codeaurora.org tj: Description update. Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-12-04Revert "cpuset: Make cpuset hotplug synchronous"Tejun Heo
This reverts commit 1599a185f0e6113be185b9fb809c621c73865829. This and the previous commit led to another circular locking scenario and the scenario which is fixed by this commit no longer exists after e8b3f8db7aad ("workqueue/hotplug: simplify workqueue_offline_cpu()") which removes work item flushing from hotplug path. Revert it for now. Signed-off-by: Tejun Heo <tj@kernel.org>
2017-12-04livepatch: send a fake signal to all blocking tasksMiroslav Benes
Live patching consistency model is of LEAVE_PATCHED_SET and SWITCH_THREAD. This means that all tasks in the system have to be marked one by one as safe to call a new patched function. Safe means when a task is not (sleeping) in a set of patched functions. That is, no patched function is on the task's stack. Another clearly safe place is the boundary between kernel and userspace. The patching waits for all tasks to get outside of the patched set or to cross the boundary. The transition is completed afterwards. The problem is that a task can block the transition for quite a long time, if not forever. It could sleep in a set of patched functions, for example. Luckily we can force the task to leave the set by sending it a fake signal, that is a signal with no data in signal pending structures (no handler, no sign of proper signal delivered). Suspend/freezer use this to freeze the tasks as well. The task gets TIF_SIGPENDING set and is woken up (if it has been sleeping in the kernel before) or kicked by rescheduling IPI (if it was running on other CPU). This causes the task to go to kernel/userspace boundary where the signal would be handled and the task would be marked as safe in terms of live patching. There are tasks which are not affected by this technique though. The fake signal is not sent to kthreads. They should be handled differently. They can be woken up so they leave the patched set and their TIF_PATCH_PENDING can be cleared thanks to stack checking. For the sake of completeness, if the task is in TASK_RUNNING state but not currently running on some CPU it doesn't get the IPI, but it would eventually handle the signal anyway. Second, if the task runs in the kernel (in TASK_RUNNING state) it gets the IPI, but the signal is not handled on return from the interrupt. It would be handled on return to the userspace in the future when the fake signal is sent again. Stack checking deals with these cases in a better way. If the task was sleeping in a syscall it would be woken by our fake signal, it would check if TIF_SIGPENDING is set (by calling signal_pending() predicate) and return ERESTART* or EINTR. Syscalls with ERESTART* return values are restarted in case of the fake signal (see do_signal()). EINTR is propagated back to the userspace program. This could disturb the program, but... * each process dealing with signals should react accordingly to EINTR return values. * syscalls returning EINTR happen to be quite common situation in the system even if no fake signal is sent. * freezer sends the fake signal and does not deal with EINTR anyhow. Thus EINTR values are returned when the system is resumed. The very safe marking is done in architectures' "entry" on syscall and interrupt/exception exit paths, and in a stack checking functions of livepatch. TIF_PATCH_PENDING is cleared and the next recalc_sigpending() drops TIF_SIGPENDING. In connection with this, also call klp_update_patch_state() before do_signal(), so that recalc_sigpending() in dequeue_signal() can clear TIF_PATCH_PENDING immediately and thus prevent a double call of do_signal(). Note that the fake signal is not sent to stopped/traced tasks. Such task prevents the patching to finish till it continues again (is not traced anymore). Last, sending the fake signal is not automatic. It is done only when admin requests it by writing 1 to signal sysfs attribute in livepatch sysfs directory. Signed-off-by: Miroslav Benes <mbenes@suse.cz> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: linuxppc-dev@lists.ozlabs.org Cc: x86@kernel.org Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-12-04genirq/matrix: Fix the precedence fix for realThomas Gleixner
The previous commit which made the operator precedence in irq_matrix_available() explicit made the implicit brokenness explicitely wrong. It was wrong in the original commit already. The overworked maintainer did not notice it either when merging the patch. Replace the confusing '?' construct by a simple and obvious if (). Fixes: 75f1133873d6 ("genirq/matrix: Make - vs ?: Precedence explicit") Reported-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Kees Cook <keescook@chromium.org>
2017-12-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Various TCP control block fixes, including one that crashes with SELinux, from David Ahern and Eric Dumazet. 2) Fix ACK generation in rxrpc, from David Howells. 3) ipvlan doesn't set the mark properly in the ipv4 route lookup key, from Gao Feng. 4) SIT configuration doesn't take on the frag_off ipv4 field configuration properly, fix from Hangbin Liu. 5) TSO can fail after device down/up on stmmac, fix from Lars Persson. 6) Various bpftool fixes (mostly in JSON handling) from Quentin Monnet. 7) Various SKB leak fixes in vhost/tun/tap (mostly observed as performance problems). From Wei Xu. 8) mvpps's TX descriptors were not zero initialized, from Yan Markman. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (57 commits) tcp: use IPCB instead of TCP_SKB_CB in inet_exact_dif_match() tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb() rxrpc: Fix the MAINTAINERS record rxrpc: Use correct netns source in rxrpc_release_sock() liquidio: fix incorrect indentation of assignment statement stmmac: reset last TSO segment size after device open ipvlan: Add the skb->mark as flow4's member to lookup route s390/qeth: build max size GSO skbs on L2 devices s390/qeth: fix GSO throughput regression s390/qeth: fix thinko in IPv4 multicast address tracking tap: free skb if flags error tun: free skb in early errors vhost: fix skb leak in handle_rx() bnxt_en: Fix a variable scoping in bnxt_hwrm_do_send_msg() bnxt_en: fix dst/src fid for vxlan encap/decap actions bnxt_en: wildcard smac while creating tunnel decap filter bnxt_en: Need to unconditionally shut down RoCE in bnxt_shutdown phylink: ensure we take the link down when phylink_stop() is called sfp: warn about modules requiring address change sequence sfp: improve RX_LOS handling ...
2017-12-04tracepoint: Remove smp_read_barrier_depends() from commentPaul E. McKenney
The comment in tracepoint_add_func() mentions smp_read_barrier_depends(), whose use should be quite restricted. This commit updates the comment to instead mention the smp_store_release() and rcu_dereference_sched() that the current code actually uses. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
2017-12-04locking: Remove smp_read_barrier_depends() from queued_spin_lock_slowpath()Paul E. McKenney
Queued spinlocks are not used by DEC Alpha, and furthermore operations such as READ_ONCE() and release/relaxed RMW atomics are being changed to imply smp_read_barrier_depends(). This commit therefore removes the now-redundant smp_read_barrier_depends() from queued_spin_lock_slowpath(), and adjusts the comments accordingly. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com>
2017-12-04uprobes: Remove now-redundant smp_read_barrier_depends()Paul E. McKenney
Now that READ_ONCE() implies smp_read_barrier_depends(), the get_xol_area() and get_trampoline_vaddr() no longer need their smp_read_barrier_depends() calls, which this commit removes. While we are here, convert the corresponding smp_wmb() to an smp_store_release(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
2017-12-04softirq: Eliminate cond_resched_rcu_qs() in favor of cond_resched()Paul E. McKenney
Now that cond_resched() also provides RCU quiescent states when needed, it can be used in place of cond_resched_rcu_qs(). This commit therefore makes this change. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: NeilBrown <neilb@suse.com> Cc: Ingo Molnar <mingo@kernel.org>
2017-12-04trace: Eliminate cond_resched_rcu_qs() in favor of cond_resched()Paul E. McKenney
Now that cond_resched() also provides RCU quiescent states when needed, it can be used in place of cond_resched_rcu_qs(). This commit therefore makes this change. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@redhat.com>
2017-12-04workqueue: Eliminate cond_resched_rcu_qs() in favor of cond_resched()Paul E. McKenney
Now that cond_resched() also provides RCU quiescent states when needed, it can be used in place of cond_resched_rcu_qs(). This commit therefore makes this change. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2017-12-04PM: Provide a config snippet for disabling PMMark Brown
A frequent source of build problems is poor handling of optional PM support, almost all development is done with the PM options enabled but they can be turned off. Currently few if any of the build test services do this as standard as there is no standard config for it and the use of selects and def_bool means that simply setting CONFIG_PM=n doesn't do what is expected. To make this easier provide a fragement that can be used with KCONFIG_ALLCONFIG to force PM off. CONFIG_XEN is disabled as Xen uses hibernation callbacks which end up turning on power management on architectures with Xen. Some cpuidle implementations on ARM select PM so CONFIG_CPU_IDLE is disabled, and some ARM architectures unconditionally enable PM so they are also disabled. Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-12-04tracing: Pass export pointer as argument to ->write()Felipe Balbi
By passing an export descriptor to the write function, users don't need to keep a global static pointer and can rely on container_of() to fetch their own structure. Link: http://lkml.kernel.org/r/20170602102025.5140-1-felipe.balbi@linux.intel.com Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by: Chunyan Zhang <zhang.chunyan@linaro.org> Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-12-04ring-buffer: Remove unused function __rb_data_page_index()Matthias Kaehlcke
This fixes the following warning when building with clang: kernel/trace/ring_buffer.c:1842:1: error: unused function '__rb_data_page_index' [-Werror,-Wunused-function] Link: http://lkml.kernel.org/r/20170518001415.5223-1-mka@chromium.org Reviewed-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-12-04tracing: make PREEMPTIRQ_EVENTS depend on TRACINGArnd Bergmann
When CONFIG_TRACING is disabled, the new preemptirq events tracer produces a build failure: In file included from kernel/trace/trace_irqsoff.c:17:0: kernel/trace/trace.h: In function 'trace_test_and_set_recursion': kernel/trace/trace.h:542:28: error: 'struct task_struct' has no member named 'trace_recursion' Adding an explicit dependency avoids the broken configuration. Link: http://lkml.kernel.org/r/20171103104031.270375-1-arnd@arndb.de Fixes: d59158162e03 ("tracing: Add support for preempt and irq enable/disable events") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-12-04tracing: Allocate mask_str buffer dynamicallyChangbin Du
The default NR_CPUS can be very large, but actual possible nr_cpu_ids usually is very small. For my x86 distribution, the NR_CPUS is 8192 and nr_cpu_ids is 4. About 2 pages are wasted. Most machines don't have so many CPUs, so define a array with NR_CPUS just wastes memory. So let's allocate the buffer dynamically when need. With this change, the mutext tracing_cpumask_update_lock also can be removed now, which was used to protect mask_str. Link: http://lkml.kernel.org/r/1512013183-19107-1-git-send-email-changbin.du@intel.com Fixes: 36dfe9252bd4c ("ftrace: make use of tracing_cpumask") Cc: stable@vger.kernel.org Signed-off-by: Changbin Du <changbin.du@intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-12-04tracing: Fix code comments in trace.cChunyu Hu
Naming in code comments for tracing_snapshot, tracing_snapshot_alloc and trace_pid_filter_add_remove_task don't match the real function names. And latency_trace has been removed from tracing directory. Fix them. Link: http://lkml.kernel.org/r/1508394753-20887-1-git-send-email-chuhu@redhat.com Fixes: cab5037 ("tracing/ftrace: Enable snapshot function trigger") Fixes: 886b5b7 ("tracing: remove /debug/tracing/latency_trace") Signed-off-by: Chunyu Hu <chuhu@redhat.com> [ Replaced /sys/kernel/debug/tracing with /sys/kerne/tracing ] Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-12-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2017-12-02 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix a compilation warning in xdp redirect tracepoint due to missing bpf.h include that pulls in struct bpf_map, from Xie. 2) Limit the maximum number of attachable BPF progs for a given perf event as long as uabi is not frozen yet. The hard upper limit is now 64 and therefore the same as with BPF multi-prog for cgroups. Also add related error checking for the sample BPF loader when enabling and attaching to the perf event, from Yonghong. 3) Specifically set the RLIMIT_MEMLOCK for the test_verifier_log case, so that the test case can always pass and not fail in some environments due to too low default limit, also from Yonghong. 4) Fix up a missing license header comment for kernel/bpf/offload.c, from Jakub. 5) Several fixes for bpftool, among others a crash on incorrect arguments when json output is used, error message handling fixes on unknown options and proper destruction of json writer for some exit cases, all from Quentin. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-01Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "A selection of fixes/changes that should make it into this series. This contains: - NVMe, two merges, containing: - pci-e, rdma, and fc fixes - Device quirks - Fix for a badblocks leak in null_blk - bcache fix from Rui Hua for a race condition regression where -EINTR was returned to upper layers that didn't expect it. - Regression fix for blktrace for a bug introduced in this series. - blktrace cleanup for cgroup id. - bdi registration error handling. - Small series with cleanups for blk-wbt. - Various little fixes for typos and the like. Nothing earth shattering, most important are the NVMe and bcache fixes" * 'for-linus' of git://git.kernel.dk/linux-block: (34 commits) nvme-pci: fix NULL pointer dereference in nvme_free_host_mem() nvme-rdma: fix memory leak during queue allocation blktrace: fix trace mutex deadlock nvme-rdma: Use mr pool nvme-rdma: Check remotely invalidated rkey matches our expected rkey nvme-rdma: wait for local invalidation before completing a request nvme-rdma: don't complete requests before a send work request has completed nvme-rdma: don't suppress send completions bcache: check return value of register_shrinker bcache: recover data from backing when data is clean bcache: Fix building error on MIPS bcache: add a comment in journal bucket reading nvme-fc: don't use bit masks for set/test_bit() numbers blk-wbt: fix comments typo blk-wbt: move wbt_clear_stat to common place in wbt_done blk-sysfs: remove NULL pointer checking in queue_wb_lat_store blk-wbt: remove duplicated setting in wbt_init nvme-pci: add quirk for delay before CHK RDY for WDC SN200 block: remove useless assignment in bio_split null_blk: fix dev->badblocks leak ...
2017-12-01bpf: cleanup register_is_null()Alexei Starovoitov
don't pass large struct bpf_reg_state by value. Instead pass it by pointer. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-01bpf: improve JEQ/JNE path walkingAlexei Starovoitov
verifier knows how to trim paths that are known not to be taken at run-time when register containing run-time constant is compared with another constant. It was done only for JEQ comparison. Extend it to include JNE as well. More cases can be added in the future. before after bpf_lb-DLB_L3.o 2270 2051 bpf_lb-DLB_L4.o 3682 3287 bpf_lb-DUNKNOWN.o 1110 1080 bpf_lxc-DDROP_ALL.o 27876 24980 bpf_lxc-DUNKNOWN.o 38780 34308 bpf_netdev.o 16937 15404 bpf_overlay.o 7929 7191 Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-01bpf: improve verifier liveness marksAlexei Starovoitov
registers with pointers filled from stack were missing live_written marks which caused liveness propagation to unnecessary mark more registers as live_read and miss state pruning opportunities later on. before after bpf_lb-DLB_L3.o 2285 2270 bpf_lb-DLB_L4.o 3723 3682 bpf_lb-DUNKNOWN.o 1110 1110 bpf_lxc-DDROP_ALL.o 27954 27876 bpf_lxc-DUNKNOWN.o 38954 38780 bpf_netdev.o 16943 16937 bpf_overlay.o 7929 7929 Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-01bpf: don't mark FP reg as uninitAlexei Starovoitov
when verifier hits an internal bug don't mark register R10==FP as uninit, since it's read only register and it's not technically correct to let verifier run further, since it may assume that R10 has valid auxiliary state. While developing subsequent patches this issue was discovered, though the code eventually changed that aux reg state doesn't have pointers any more it is still safer to avoid clearing readonly register. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-01bpf: print liveness info to verifier logAlexei Starovoitov
let verifier print register and stack liveness information into verifier log Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-01bpf: fix stack state printing in verifier logAlexei Starovoitov
fix incorrect stack state prints in print_verifier_state() Fixes: 638f5b90d460 ("bpf: reduce verifier memory consumption") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-01bpf: set maximum number of attached progs to 64 for a single perf tpYonghong Song
cgropu+bpf prog array has a maximum number of 64 programs. Let us apply the same limit here. Fixes: e87c6bc3852b ("bpf: permit multiple bpf attachments for a single perf event") Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-11-29kallsyms: take advantage of the new '%px' formatLinus Torvalds
The conditional kallsym hex printing used a special fixed-width '%lx' output (KALLSYM_FMT) in preparation for the hashing of %p, but that series ended up adding a %px specifier to help with the conversions. Use it, and avoid the "print pointer as an unsigned long" code. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-29Merge branch 'perf/urgent' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent Pull perf tooling fixes from Arnaldo Carvalho de Melo: "- Fix window dimensions change handling in 'perf top' (Jiri Olsa) - Fix 'perf record -c/-F' options for CPU event aliases (Andi Kleen) - Generate PERF_RECORD_{MMAP,COMM,EXEC} with 'perf record --delay' fixing symbol resolution for processes created, maps put in place while --delay happens (Arnaldo Carvalho de Melo) - Fix up leftover perf_evsel_stat usage via evsel->priv, plugging a SEGV when using event groups as in: $ perf stat -e '{cpu-clock,instructions}' workload - Fix 'perf script --per-event-dump' for auxtrace synth evsels (Arnaldo Carvalho de Melo) - Ignore kptr_restrict when not sampling the kernel (Arnaldo Carvalho de Melo) - Synchronize kernel ABI headers wrt SPDX tags and ABI changes, taking minimal action to handle new syscall args and silencing perf build warnings (Arnaldo Carvalho de Melo, Ingo Molnar) - Fix header.size for namespace events (Jiri Olsa) - Fix a bug during strstart() conversion in 'perf help' (Namhyung Kim) - Do not truncate instruction names at 6 chars in 'perf annotate', there are really long instruction names in PPC (Ravi Bangoria) - Fixup discontiguous/sparse numa nodes in 'perf bench numa' (Satheesh Rajendran) - Fix an exit code of trace__symbols_init in 'perf trace' (Andrei Vagin) - Fix 'perf test' entries on s/390 (Thomas Richter) - Bring instruction decoder files used by Intel PT into line with the kernel, silencing build warning (Adrian Hunter)" Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-29Merge branch 'linus' into perf/urgent, to pick up dependent commitsIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-28sched: Stop switched_to_rt() from sending IPIs to offline CPUsPaul E. McKenney
The rcutorture test suite occasionally provokes a splat due to invoking rt_mutex_lock() which needs to boost the priority of a task currently sitting on a runqueue that belongs to an offline CPU: WARNING: CPU: 0 PID: 12 at /home/paulmck/public_git/linux-rcu/arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x37/0x40 Modules linked in: CPU: 0 PID: 12 Comm: rcub/7 Not tainted 4.14.0-rc4+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 task: ffff9ed3de5f8cc0 task.stack: ffffbbf80012c000 RIP: 0010:native_smp_send_reschedule+0x37/0x40 RSP: 0018:ffffbbf80012fd10 EFLAGS: 00010082 RAX: 000000000000002f RBX: ffff9ed3dd9cb300 RCX: 0000000000000004 RDX: 0000000080000004 RSI: 0000000000000086 RDI: 00000000ffffffff RBP: ffffbbf80012fd10 R08: 000000000009da7a R09: 0000000000007b9d R10: 0000000000000001 R11: ffffffffbb57c2cd R12: 000000000000000d R13: ffff9ed3de5f8cc0 R14: 0000000000000061 R15: ffff9ed3ded59200 FS: 0000000000000000(0000) GS:ffff9ed3dea00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000080686f0 CR3: 000000001b9e0000 CR4: 00000000000006f0 Call Trace: resched_curr+0x61/0xd0 switched_to_rt+0x8f/0xa0 rt_mutex_setprio+0x25c/0x410 task_blocks_on_rt_mutex+0x1b3/0x1f0 rt_mutex_slowlock+0xa9/0x1e0 rt_mutex_lock+0x29/0x30 rcu_boost_kthread+0x127/0x3c0 kthread+0x104/0x140 ? rcu_report_unblock_qs_rnp+0x90/0x90 ? kthread_create_on_node+0x40/0x40 ret_from_fork+0x22/0x30 Code: f0 00 0f 92 c0 84 c0 74 14 48 8b 05 34 74 c5 00 be fd 00 00 00 ff 90 a0 00 00 00 5d c3 89 fe 48 c7 c7 a0 c6 fc b9 e8 d5 b5 06 00 <0f> ff 5d c3 0f 1f 44 00 00 8b 05 a2 d1 13 02 85 c0 75 38 55 48 But the target task's priority has already been adjusted, so the only purpose of switched_to_rt() invoking resched_curr() is to wake up the CPU running some task that needs to be preempted by the boosted task. But the CPU is offline, which presumably means that the task must be migrated to some other CPU, and that this other CPU will undertake any needed preemption at the time of migration. Because the runqueue lock is held when resched_curr() is invoked, we know that the boosted task cannot go anywhere, so it is not necessary to invoke resched_curr() in this particular case. This commit therefore makes switched_to_rt() refrain from invoking resched_curr() when the target CPU is offline. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org>
2017-11-28sched: Stop resched_cpu() from sending IPIs to offline CPUsPaul E. McKenney
The rcutorture test suite occasionally provokes a splat due to invoking resched_cpu() on an offline CPU: WARNING: CPU: 2 PID: 8 at /home/paulmck/public_git/linux-rcu/arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x37/0x40 Modules linked in: CPU: 2 PID: 8 Comm: rcu_preempt Not tainted 4.14.0-rc4+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 task: ffff902ede9daf00 task.stack: ffff96c50010c000 RIP: 0010:native_smp_send_reschedule+0x37/0x40 RSP: 0018:ffff96c50010fdb8 EFLAGS: 00010096 RAX: 000000000000002e RBX: ffff902edaab4680 RCX: 0000000000000003 RDX: 0000000080000003 RSI: 0000000000000000 RDI: 00000000ffffffff RBP: ffff96c50010fdb8 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000000 R11: 00000000299f36ae R12: 0000000000000001 R13: ffffffff9de64240 R14: 0000000000000001 R15: ffffffff9de64240 FS: 0000000000000000(0000) GS:ffff902edfc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000f7d4c642 CR3: 000000001e0e2000 CR4: 00000000000006e0 Call Trace: resched_curr+0x8f/0x1c0 resched_cpu+0x2c/0x40 rcu_implicit_dynticks_qs+0x152/0x220 force_qs_rnp+0x147/0x1d0 ? sync_rcu_exp_select_cpus+0x450/0x450 rcu_gp_kthread+0x5a9/0x950 kthread+0x142/0x180 ? force_qs_rnp+0x1d0/0x1d0 ? kthread_create_on_node+0x40/0x40 ret_from_fork+0x27/0x40 Code: 14 01 0f 92 c0 84 c0 74 14 48 8b 05 14 4f f4 00 be fd 00 00 00 ff 90 a0 00 00 00 5d c3 89 fe 48 c7 c7 38 89 ca 9d e8 e5 56 08 00 <0f> ff 5d c3 0f 1f 44 00 00 8b 05 52 9e 37 02 85 c0 75 38 55 48 ---[ end trace 26df9e5df4bba4ac ]--- This splat cannot be generated by expedited grace periods because they always invoke resched_cpu() on the current CPU, which is good because expedited grace periods require that resched_cpu() unconditionally succeed. However, other parts of RCU can tolerate resched_cpu() acting as a no-op, at least as long as it doesn't happen too often. This commit therefore makes resched_cpu() invoke resched_curr() only if the CPU is either online or is the current CPU. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org>
2017-11-28torture: Suppress CPU stall warnings during shutdown ftrace dumpPaul E. McKenney
The torture_shutdown() function directly invokes ftrace_dump(), which can result in RCU CPU stall warnings when the ftrace buffer is large, which it usually is. This commit therefore invoks rcu_ftrace_dump() in place of ftrace_dump(), suppressing RCU CPU stall warnings during this time. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-11-28srcu: Prohibit call_srcu() use under raw spinlocksPaul E. McKenney
Invoking queue_delayed_work() while holding a raw spinlock is forbidden in -rt kernels, which is exactly what __call_srcu() does, indirectly via srcu_funnel_gp_start(). This commit therefore downgrades Tree SRCU's locking from raw to non-raw spinlocks, which works because call_srcu() is not ever called while holding a raw spinlock. Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-11-28rcu: Simplify rcu_eqs_{enter,exit}() non-idle task debug codePaul E. McKenney
The code that checks for non-idle non-nohz_idle-usermode tasks invoking rcu_eqs_enter() and rcu_eqs_exit() prints a considerable quantity of helpful information. However, these checks fire rarely, so the extra complexity is no longer worth it. This commit therefore replaces this debug code with simple WARN_ON_ONCE() statements. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-11-28rcu: Fold rcu_eqs_exit_common() into rcu_eqs_exit()Paul E. McKenney
There is now only one call to rcu_eqs_exit_common() and there is no other reason to keep it separate. This commit therefore inlines it into its sole call site, saving a few lines of code in the process. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-11-28rcu: Fold rcu_eqs_enter_common() into rcu_eqs_enter()Paul E. McKenney
There is now only one call to rcu_eqs_enter_common() and there is no other reason to keep it separate. This commit therefore inlines it into its sole call site, saving a few lines of code in the process. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-11-28rcu: Avoid ->dynticks_nesting store tearingPaul E. McKenney
Although ->dynticks_nesting is updated only by process level, it is accessed from hardirq to check for interrupt-from-idle quiescent states. Store tearing is thus possible, so this commit applies WRITE_ONCE() to ->dynticks_nesting stores. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-11-28rcu: Stop duplicating lockdep checks in RCU's idle-entry codePaul E. McKenney
The three RCU_LOCKDEP_WARN() calls in rcu_eqs_enter_common() are redundant with other lockdep checks, so this commit removes them. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-11-28rcu: Add ->dynticks field to rcu_dyntick trace eventPaul E. McKenney
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-11-28rcu: Shrink ->dynticks_{nmi_,}nesting from long long to longPaul E. McKenney
Because the ->dynticks_nesting field now only contains the process-based nesting level instead of a value encoding both the process nesting level and the irq "nesting" level, we no longer need a long long, even on 32-bit systems. This commit therefore changes both the ->dynticks_nesting and ->dynticks_nmi_nesting fields to long. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-11-28rcu: Add tracing to irq/NMI dyntick-idle transitionsPaul E. McKenney
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>