summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2015-12-04sched/core, locking: Document Program-Order guaranteesPeter Zijlstra
These are some notes on the scheduler locking and how it provides program order guarantees on SMP systems. ( This commit is in the locking tree, because the new documentation refers to a newly introduced locking primitive. ) Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04locking, sched: Introduce smp_cond_acquire() and use itPeter Zijlstra
Introduce smp_cond_acquire() which combines a control dependency and a read barrier to form acquire semantics. This primitive has two benefits: - it documents control dependencies, - its typically cheaper than using smp_load_acquire() in a loop. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04Merge branch 'sched/urgent' into locking/core, to pick up scheduler fix we ↵Ingo Molnar
rely on So we want to change a locking API, but the scheduler uses it, and a conflict is generated by a recent scheduler fix. Pick up the pending scheduler fixes to make life easier. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04Merge branch 'sched/urgent' into sched/core, to pick up fixes before ↵Ingo Molnar
applying new changes Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04sched/core: Fix an SMP ordering race in try_to_wake_up() vs. schedule()Peter Zijlstra
Oleg noticed that its possible to falsely observe p->on_cpu == 0 such that we'll prematurely continue with the wakeup and effectively run p on two CPUs at the same time. Even though the overlap is very limited; the task is in the middle of being scheduled out; it could still result in corruption of the scheduler data structures. CPU0 CPU1 set_current_state(...) <preempt_schedule> context_switch(X, Y) prepare_lock_switch(Y) Y->on_cpu = 1; finish_lock_switch(X) store_release(X->on_cpu, 0); try_to_wake_up(X) LOCK(p->pi_lock); t = X->on_cpu; // 0 context_switch(Y, X) prepare_lock_switch(X) X->on_cpu = 1; finish_lock_switch(Y) store_release(Y->on_cpu, 0); </preempt_schedule> schedule(); deactivate_task(X); X->on_rq = 0; if (X->on_rq) // false if (t) while (X->on_cpu) cpu_relax(); context_switch(X, ..) finish_lock_switch(X) store_release(X->on_cpu, 0); Avoid the load of X->on_cpu being hoisted over the X->on_rq load. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04sched/core: Better document the try_to_wake_up() barriersPeter Zijlstra
Explain how the control dependency and smp_rmb() end up providing ACQUIRE semantics and pair with smp_store_release() in finish_lock_switch(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04sched/cputime: Fix invalid gtime in procHiroshi Shimamoto
/proc/stats shows invalid gtime when the thread is running in guest. When vtime accounting is not enabled, we cannot get a valid delta. The delta is calculated with now - tsk->vtime_snap, but tsk->vtime_snap is only updated when vtime accounting is runtime enabled. This patch makes task_gtime() just return gtime without computing the buggy non-existing tickless delta when vtime accounting is not enabled. Use context_tracking_is_enabled() to check if vtime is accounting on some cpu, in which case only we need to check the tickless delta. This way we fix the gtime value regression on machines not running nohz full. The kernel config contains CONFIG_VIRT_CPU_ACCOUNTING_GEN=y and CONFIG_NO_HZ_FULL_ALL=n and boot without nohz_full. I ran and stop a busy loop in VM and see the gtime in host. Dump the 43rd field which shows the gtime in every second: # while :; do awk '{print $3" "$43}' /proc/3955/task/4014/stat; sleep 1; done S 4348 R 7064566 R 7064766 R 7064967 R 7065168 S 4759 S 4759 During running busy loop, it returns large value. After applying this patch, we can see right gtime. # while :; do awk '{print $3" "$43}' /proc/10913/task/10956/stat; sleep 1; done S 5338 R 5365 R 5465 R 5566 R 5666 S 5726 S 5726 Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1447948054-28668-2-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04sched/core: Clear the root_domain cpumasks in init_rootdomain()Xunlei Pang
root_domain::rto_mask allocated through alloc_cpumask_var() contains garbage data, this may cause problems. For instance, When doing pull_rt_task(), it may do useless iterations if rto_mask retains some extra garbage bits. Worse still, this violates the isolated domain rule for clustered scheduling using cpuset, because the tasks(with all the cpus allowed) belongs to one root domain can be pulled away into another root domain. The patch cleans the garbage by using zalloc_cpumask_var() instead of alloc_cpumask_var() for root_domain::rto_mask allocation, thereby addressing the issues. Do the same thing for root_domain's other cpumask memembers: dlo_mask, span, and online. Signed-off-by: Xunlei Pang <xlpang@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1449057179-29321-1-git-send-email-xlpang@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04sched/core: Remove false-positive warning from wake_up_process()Sasha Levin
Because wakeups can (fundamentally) be late, a task might not be in the expected state. Therefore testing against a task's state is racy, and can yield false positives. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: oleg@redhat.com Fixes: 9067ac85d533 ("wake_up_process() should be never used to wakeup a TASK_STOPPED/TRACED task") Link: http://lkml.kernel.org/r/1448933660-23082-1-git-send-email-sasha.levin@oracle.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04sched/wait: Fix signal handling in bit wait helpersPeter Zijlstra
Vladimir reported getting RCU stall warnings and bisected it back to commit: 743162013d40 ("sched: Remove proliferation of wait_on_bit() action functions") That commit inadvertently reversed the calls to schedule() and signal_pending(), thereby not handling the case where the signal receives while we sleep. Reported-by: Vladimir Murzin <vladimir.murzin@arm.com> Tested-by: Vladimir Murzin <vladimir.murzin@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: mark.rutland@arm.com Cc: neilb@suse.de Cc: oleg@redhat.com Fixes: 743162013d40 ("sched: Remove proliferation of wait_on_bit() action functions") Fixes: cbbce8220949 ("SCHED: add some "wait..on_bit...timeout()" interfaces.") Link: http://lkml.kernel.org/r/20151201130404.GL3816@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04perf: Fix PERF_EVENT_IOC_PERIOD deadlockPeter Zijlstra
Dmitry reported a fairly silly recursive lock deadlock for PERF_EVENT_IOC_PERIOD, fix this by explicitly doing the inactive part of __perf_event_period() instead of calling that function. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: c7999c6f3fed ("perf: Fix PERF_EVENT_IOC_PERIOD migration race") Link: http://lkml.kernel.org/r/20151130115615.GJ17308@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-03alarmtimer: Avoid unexpected rtc interrupt when system resume from S3zhuo-hao
Before the system go to suspend (S3), if user create a timer with clockid CLOCK_REALTIME_ALARM/CLOCK_BOOTTIME_ALARM and set a "large" timeout value to this timer. The function alarmtimer_suspend will be called to setup a timeout value to RTC timer to avoid the system sleep over time. However, if the system wakeup early than RTC timeout, the RTC timer will not be cleared. And this will cause the hpet_rtc_interrupt come unexpectedly until the RTC timeout. To fix this problem, just adding alarmtimer_resume to cancel the RTC timer. This was noticed because the HPET RTC emulation fires an interrupt every 16ms(=1/2^DEFAULT_RTC_SHIFT) up to the point where the alarm time is reached. This program always hits this situation (https://lkml.org/lkml/2015/11/8/326), if system wake up earlier than alarm time. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Signed-off-by: Zhuo-hao Lee <zhuo-hao.lee@intel.com> [jstultz: Tweak commit subject & formatting slightly] Signed-off-by: John Stultz <john.stultz@linaro.org>
2015-12-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/renesas/ravb_main.c kernel/bpf/syscall.c net/ipv4/ipmr.c All three conflicts were cases of overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: "A lot of Thanksgiving turkey leftovers accumulated, here goes: 1) Fix bluetooth l2cap_chan object leak, from Johan Hedberg. 2) IDs for some new iwlwifi chips, from Oren Givon. 3) Fix rtlwifi lockups on boot, from Larry Finger. 4) Fix memory leak in fm10k, from Stephen Hemminger. 5) We have a route leak in the ipv6 tunnel infrastructure, fix from Paolo Abeni. 6) Fix buffer pointer handling in arm64 bpf JIT,f rom Zi Shen Lim. 7) Wrong lockdep annotations in tcp md5 support, fix from Eric Dumazet. 8) Work around some middle boxes which prevent proper handling of TCP Fast Open, from Yuchung Cheng. 9) TCP repair can do huge kmalloc() requests, build paged SKBs instead. From Eric Dumazet. 10) Fix msg_controllen overflow in scm_detach_fds, from Daniel Borkmann. 11) Fix device leaks on ipmr table destruction in ipv4 and ipv6, from Nikolay Aleksandrov. 12) Fix use after free in epoll with AF_UNIX sockets, from Rainer Weikusat. 13) Fix double free in VRF code, from Nikolay Aleksandrov. 14) Fix skb leaks on socket receive queue in tipc, from Ying Xue. 15) Fix ifup/ifdown crach in xgene driver, from Iyappan Subramanian. 16) Fix clearing of persistent array maps in bpf, from Daniel Borkmann. 17) In TCP, for the cross-SYN case, we don't initialize tp->copied_seq early enough. From Eric Dumazet. 18) Fix out of bounds accesses in bpf array implementation when updating elements, from Daniel Borkmann. 19) Fill gaps in RCU protection of np->opt in ipv6 stack, from Eric Dumazet. 20) When dumping proxy neigh entries, we have to accomodate NULL device pointers properly, from Konstantin Khlebnikov. 21) SCTP doesn't release all ipv6 socket resources properly, fix from Eric Dumazet. 22) Prevent underflows of sch->q.qlen for multiqueue packet schedulers, also from Eric Dumazet. 23) Fix MAC and unicast list handling in bnxt_en driver, from Jeffrey Huang and Michael Chan. 24) Don't actively scan radar channels, from Antonio Quartulli" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (110 commits) net: phy: reset only targeted phy bnxt_en: Setup uc_list mac filters after resetting the chip. bnxt_en: enforce proper storing of MAC address bnxt_en: Fixed incorrect implementation of ndo_set_mac_address net: lpc_eth: remove irq > NR_IRQS check from probe() net_sched: fix qdisc_tree_decrease_qlen() races openvswitch: fix hangup on vxlan/gre/geneve device deletion ipv4: igmp: Allow removing groups from a removed interface ipv6: sctp: implement sctp_v6_destroy_sock() arm64: bpf: add 'store immediate' instruction ipv6: kill sk_dst_lock ipv6: sctp: add rcu protection around np->opt net/neighbour: fix crash at dumping device-agnostic proxy entries sctp: use GFP_USER for user-controlled kmalloc sctp: convert sack_needed and sack_generation to bits ipv6: add complete rcu protection around np->opt bpf: fix allocation warnings in bpf maps and integer overflow mvebu: dts: enable IP checksum with jumbo frames for Armada 38x on Port0 net: mvneta: enable setting custom TX IP checksum limit net: mvneta: fix error path for building skb ...
2015-12-03Merge tag 'trace-v4.4-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "During the merge window I added a new file that is used to filter trace events on pids. It filters all events where only tasks with their pid in that file exists. It also handles the sched_switch and sched_wakeup trace events where the current task does not have its pid in the file, but the task either being switched to or awaken does. Unfortunately, I forgot about sched_wakeup_new and sched_waking. Both of these tracepoints use the same class as the sched_wakeup tracepoint, and they too should be included in what gets filtered by the set_event_pid file" * tag 'trace-v4.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Add sched_wakeup_new and sched_waking tracepoints for pid filter
2015-12-03livepatch: function,sympos scheme in livepatch sysfs directoryChris J Arges
The following directory structure will allow for cases when the same function name exists in a single object. /sys/kernel/livepatch/<patch>/<object>/<function,sympos> The sympos number corresponds to the nth occurrence of the symbol name in kallsyms for the patched object. An example of patching multiple symbols can be found here: https://github.com/dynup/kpatch/issues/493 Signed-off-by: Chris J Arges <chris.j.arges@canonical.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-12-03livepatch: add sympos as disambiguator field to klp_relocChris J Arges
In cases of duplicate symbols, sympos will be used to disambiguate instead of val. By default sympos will be 0, and patching will only succeed if the symbol is unique. Specifying a positive value will ensure that occurrence of the symbol in kallsyms for the patched object will be used for patching if it is valid. For external relocations sympos is not supported. Remove klp_verify_callback, klp_verify_args and klp_verify_vmlinux_symbol as they are no longer used. From the klp_reloc structure remove val, as it can be refactored as a local variable in klp_write_object_relocations. Signed-off-by: Chris J Arges <chris.j.arges@canonical.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-12-03livepatch: add old_sympos as disambiguator field to klp_funcChris J Arges
Currently, patching objects with duplicate symbol names fail because the creation of the sysfs function directory collides with the previous attempt. Appending old_addr to the function name is problematic as it reveals the address of the function being patch to a normal user. Using the symbol's occurrence in kallsyms to postfix the function name in the sysfs directory solves the issue of having consistent unique names and ensuring that the address is not exposed to a normal user. In addition, using the symbol position as the user's method to disambiguate symbols instead of addr allows for disambiguating symbols in modules as well for both function addresses and for relocs. This also simplifies much of the code. Special handling for kASLR is no longer needed and can be removed. The klp_find_verify_func_addr function can be replaced by klp_find_object_symbol, and klp_verify_vmlinux_symbol and its callback can be removed completely. In cases of duplicate symbols, old_sympos will be used to disambiguate instead of old_addr. By default old_sympos will be 0, and patching will only succeed if the symbol is unique. Specifying a positive value will ensure that occurrence of the symbol in kallsyms for the patched object will be used for patching if it is valid. In addition, make old_addr an internal structure field not to be specified by the user. Finally, remove klp_find_verify_func_addr as it can be replaced by klp_find_object_symbol directly. Support for symbol position disambiguation for relocations is added in the next patch in this series. Signed-off-by: Chris J Arges <chris.j.arges@canonical.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-12-03cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friendsOleg Nesterov
Now that nobody use the "priv" arg passed to can_fork/cancel_fork/fork we can kill CGROUP_CANFORK_COUNT/SUBSYS_TAG/etc and cgrp_ss_priv[] in copy_process(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2015-12-03Merge branch 'for-4.4-fixes' into for-4.5Tejun Heo
2015-12-03cgroup_pids: don't account for the root cgroupTejun Heo
Because accounting resources for the root cgroup sometimes incurs measureable overhead for workloads which don't care about cgroup and often ends up calculating a number which is available elsewhere in a slightly different form, cgroup is not in the business of providing system-wide statistics. The pids controller which was introduced recently was exposing "pids.current" at the root. This patch disable accounting for root cgroup and removes the file from the root directory. While this is a userland visible behavior change, pids has been available only in one version and was badly broken there, so I don't think this will be noticeable. If it turns out to be a problem, we can reinstate it for v1 hierarchies. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Aleksa Sarai <cyphar@cyphar.com>
2015-12-03cgroup: fix handling of multi-destination migration from subtree_control ↵Tejun Heo
enabling Consider the following v2 hierarchy. P0 (+memory) --- P1 (-memory) --- A \- B P0 has memory enabled in its subtree_control while P1 doesn't. If both A and B contain processes, they would belong to the memory css of P1. Now if memory is enabled on P1's subtree_control, memory csses should be created on both A and B and A's processes should be moved to the former and B's processes the latter. IOW, enabling controllers can cause atomic migrations into different csses. The core cgroup migration logic has been updated accordingly but the controller migration methods haven't and still assume that all tasks migrate to a single target css; furthermore, the methods were fed the css in which subtree_control was updated which is the parent of the target csses. pids controller depends on the migration methods to move charges and this made the controller attribute charges to the wrong csses often triggering the following warning by driving a counter negative. WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40() Modules linked in: CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29 ... ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000 ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00 ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8 Call Trace: [<ffffffff81551ffc>] dump_stack+0x4e/0x82 [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0 [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20 [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40 [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0 [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330 [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190 [<ffffffff81189016>] cgroup_attach_task+0x176/0x200 [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460 [<ffffffff81189684>] cgroup_procs_write+0x14/0x20 [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0 [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190 [<ffffffff81265f88>] __vfs_write+0x28/0xe0 [<ffffffff812666fc>] vfs_write+0xac/0x1a0 [<ffffffff81267019>] SyS_write+0x49/0xb0 [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76 This patch fixes the bug by removing @css parameter from the three migration methods, ->can_attach, ->cancel_attach() and ->attach() and updating cgroup_taskset iteration helpers also return the destination css in addition to the task being migrated. All controllers are updated accordingly. * Controllers which don't care whether there are one or multiple target csses can be converted trivially. cpu, io, freezer, perf, netclassid and netprio fall in this category. * cpuset's current implementation assumes that there's single source and destination and thus doesn't support v2 hierarchy already. The only change made by this patchset is how that single destination css is obtained. * memory migration path already doesn't do anything on v2. How the single destination css is obtained is updated and the prep stage of mem_cgroup_can_attach() is reordered to accomodate the change. * pids is the only controller which was affected by this bug. It now correctly handles multi-destination migrations and no longer causes counter underflow from incorrect accounting. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Aleksa Sarai <cyphar@cyphar.com>
2015-12-03cgroup_freezer: simplify propagation of CGROUP_FROZEN clearing in ↵Tejun Heo
freezer_attach() If one or more tasks get moved into a frozen css, the frozen state is cleared up from the destination css so that it can be reasserted once the migrated tasks are frozen. freezer_attach() implements this in two separate steps - clearing CGROUP_FROZEN on the target css while processing each task and propagating the clearing upwards after the task loop is done if necessary. This patch merges the two steps. Propagation now takes place inside the task loop. This simplifies the code and prepares it for the fix of multi-destination migration. Signed-off-by: Tejun Heo <tj@kernel.org>
2015-12-02bpf: fix allocation warnings in bpf maps and integer overflowAlexei Starovoitov
For large map->value_size the user space can trigger memory allocation warnings like: WARNING: CPU: 2 PID: 11122 at mm/page_alloc.c:2989 __alloc_pages_nodemask+0x695/0x14e0() Call Trace: [< inline >] __dump_stack lib/dump_stack.c:15 [<ffffffff82743b56>] dump_stack+0x68/0x92 lib/dump_stack.c:50 [<ffffffff81244ec9>] warn_slowpath_common+0xd9/0x140 kernel/panic.c:460 [<ffffffff812450f9>] warn_slowpath_null+0x29/0x30 kernel/panic.c:493 [< inline >] __alloc_pages_slowpath mm/page_alloc.c:2989 [<ffffffff81554e95>] __alloc_pages_nodemask+0x695/0x14e0 mm/page_alloc.c:3235 [<ffffffff816188fe>] alloc_pages_current+0xee/0x340 mm/mempolicy.c:2055 [< inline >] alloc_pages include/linux/gfp.h:451 [<ffffffff81550706>] alloc_kmem_pages+0x16/0xf0 mm/page_alloc.c:3414 [<ffffffff815a1c89>] kmalloc_order+0x19/0x60 mm/slab_common.c:1007 [<ffffffff815a1cef>] kmalloc_order_trace+0x1f/0xa0 mm/slab_common.c:1018 [< inline >] kmalloc_large include/linux/slab.h:390 [<ffffffff81627784>] __kmalloc+0x234/0x250 mm/slub.c:3525 [< inline >] kmalloc include/linux/slab.h:463 [< inline >] map_update_elem kernel/bpf/syscall.c:288 [< inline >] SYSC_bpf kernel/bpf/syscall.c:744 To avoid never succeeding kmalloc with order >= MAX_ORDER check that elem->value_size and computed elem_size are within limits for both hash and array type maps. Also add __GFP_NOWARN to kmalloc(value_size | elem_size) to avoid OOM warnings. Note kmalloc(key_size) is highly unlikely to trigger OOM, since key_size <= 512, so keep those kmalloc-s as-is. Large value_size can cause integer overflows in elem_size and map.pages formulas, so check for that as well. Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-01bpf, array: fix heap out-of-bounds access when updating elementsDaniel Borkmann
During own review but also reported by Dmitry's syzkaller [1] it has been noticed that we trigger a heap out-of-bounds access on eBPF array maps when updating elements. This happens with each map whose map->value_size (specified during map creation time) is not multiple of 8 bytes. In array_map_alloc(), elem_size is round_up(attr->value_size, 8) and used to align array map slots for faster access. However, in function array_map_update_elem(), we update the element as ... memcpy(array->value + array->elem_size * index, value, array->elem_size); ... where we access 'value' out-of-bounds, since it was allocated from map_update_elem() from syscall side as kmalloc(map->value_size, GFP_USER) and later on copied through copy_from_user(value, uvalue, map->value_size). Thus, up to 7 bytes, we can access out-of-bounds. Same could happen from within an eBPF program, where in worst case we access beyond an eBPF program's designated stack. Since 1be7f75d1668 ("bpf: enable non-root eBPF programs") didn't hit an official release yet, it only affects priviledged users. In case of array_map_lookup_elem(), the verifier prevents eBPF programs from accessing beyond map->value_size through check_map_access(). Also from syscall side map_lookup_elem() only copies map->value_size back to user, so nothing could leak. [1] http://github.com/google/syzkaller Fixes: 28fbcfa08d8e ("bpf: add array type of eBPF maps") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-01tracing: Add sched_wakeup_new and sched_waking tracepoints for pid filterSteven Rostedt (Red Hat)
The set_event_pid filter relies on attaching to the sched_switch and sched_wakeup tracepoints to see if it should filter the tracing on schedule tracepoints. By adding the callbacks to sched_wakeup, pids in the set_event_pid file will trace the wakeups of those tasks with those pids. But sched_wakeup_new and sched_waking were missed. These two should also be traced. Luckily, these tracepoints share the same class as sched_wakeup which means they can use the same pre and post callbacks as sched_wakeup does. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-12-01vfs: add copy_file_range syscall and vfs helperZach Brown
Add a copy_file_range() system call for offloading copies between regular files. This gives an interface to underlying layers of the storage stack which can copy without reading and writing all the data. There are a few candidates that should support copy offloading in the nearer term: - btrfs shares extent references with its clone ioctl - NFS has patches to add a COPY command which copies on the server - SCSI has a family of XCOPY commands which copy in the device This system call avoids the complexity of also accelerating the creation of the destination file by operating on an existing destination file descriptor, not a path. Currently the high level vfs entry point limits copy offloading to files on the same mount and super (and not in the same file). This can be relaxed if we get implementations which can copy between file systems safely. Signed-off-by: Zach Brown <zab@redhat.com> [Anna Schumaker: Change -EINVAL to -EBADF during file verification, Change flags parameter from int to unsigned int, Add function to include/linux/syscalls.h, Check copy len after file open mode, Don't forbid ranges inside the same file, Use rw_verify_area() to veriy ranges, Use file_out rather than file_in, Add COPY_FR_REFLINK flag] Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-11-30Merge tag 'trace-v4.4-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "I found two minor bugs while doing development on the ring buffer code. The first is something that's been there since its creation. If a reader reads a page out of the ring buffer before there's any events on it, it can get an out of date timestamp for that event. It may be off by a few microseconds, more if the first event gets discarded. The fix was to only update the reader time stamp when it actually sees an event on the page, instead of just reading the timestamp from the page even if it has no events on it. That timestamp is still volatile until an event is present. The second bug is more recent. Instead of passing around parameters a descriptor was made and the parameters are passed via a single descriptor. This simplified the code a bit. But there was one place that expected the parameter to be passed by value not reference (which a descriptor now does). And it added to the length of the event, which may be ignored later, but the length should not have been increased. The only real problem with this bug is that it may allocate more than was needed for the event" * tag 'trace-v4.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ring-buffer: Put back the length if crossed page with add_timestamp ring-buffer: Update read stamp with first real commit on page
2015-12-01Merge tag 'drm-intel-next-2015-11-20-merged' of ↵Dave Airlie
git://anongit.freedesktop.org/drm-intel into drm-next drm-intel-next-2015-11-20-rebased: 4 weeks because of my vacation, so a bit more: - final bits of the typesafe register mmio functions (Ville) - power domain fix for hdmi detection (Imre) - tons of fixes and improvements to the psr code (Rodrigo) - refactoring of the dp detection code (Ander) - complete rework of the dmc loader and dc5/dc6 handling (Imre, Patrik and others) - dp compliance improvements from Shubhangi Shrivastava - stop_machine hack from Chris to fix corruptions when updating GTT ptes on bsw - lots of fifo underrun fixes from Ville - big pile of fbc fixes and improvements from Paulo - fix fbdev failures paths (Tvrtko and Lukas Wunner) - dp link training refactoring (Ander) - interruptible prepare_plane for atomic (Maarten) - basic kabylake support (Deepak&Rodrigo) - don't leak ringspace on resets (Chris) drm-intel-next-2015-10-23: - 2nd attempt at atomic watermarks from Matt, but just prep for now - fixes all over * tag 'drm-intel-next-2015-11-20-merged' of git://anongit.freedesktop.org/drm-intel: (209 commits) drm/i915: Update DRIVER_DATE to 20151120 drm/i915: take a power domain reference while checking the HDMI live status drm/i915: take a power domain ref only when needed during HDMI detect drm/i915: Tear down fbdev if initialization fails async: export current_is_async() Revert "drm/i915: Initialize HWS page address after GPU reset" drm/i915: Fix oops caused by fbdev initialization failure drm/i915: Fix i915_ggtt_view_equal to handle rotation correctly drm/i915: Stuff rotation params into view union drm/i915: Drop return value from intel_fill_fb_ggtt_view drm/i915 : Fix to remove unnecsessary checks in postclose function. drm/i915: add MISSING_CASE to a few port/aux power domain helpers drm/i915/ddi: fix intel_display_port_aux_power_domain() after HDMI detect drm/i915: Remove platform specific *_dp_detect() functions drm/i915: Don't do edp panel detection in g4x_dp_detect() drm/i915: Send TP1 TP2/3 even when panel claims no NO_TRAIN_ON_EXIT. drm/i915: PSR: Don't Skip aux handshake on DP_PSR_NO_TRAIN_ON_EXIT. drm/i915: Reduce PSR re-activation time for VLV/CHV. drm/i915: Delay first PSR activation. drm/i915: Type safe register read/write ...
2015-11-30cgroup: pids: kill pids_fork(), simplify pids_can_fork() and pids_cancel_fork()Oleg Nesterov
Now that we know that the forking task can't migrate amd the child is always moved to the same cgroup by cgroup_post_fork()->css_set_move_task() we can change pids_can_fork() and pids_cancel_fork() to just use task_css(current). And since we no longer need to pin this css, we can remove pid_fork(). Note: the patch uses task_css_check(true), perhaps it makes sense to add a helper or change task_css_set_check() to take cgroup_threadgroup_rwsem into account. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2015-11-30cgroup: pids: fix race between cgroup_post_fork() and cgroup_migrate()Oleg Nesterov
If the new child migrates to another cgroup before cgroup_post_fork() calls subsys->fork(), then both pids_can_attach() and pids_fork() will do the same pids_uncharge(old_pids) + pids_charge(pids) sequence twice. Change copy_process() to call threadgroup_change_begin/threadgroup_change_end unconditionally. percpu_down_read() is cheap and this allows other cleanups, see the next changes. Also, this way we can unify cgroup_threadgroup_rwsem and dup_mmap_sem. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2015-11-30cgroup: make css_set pin its css's to avoid use-afer-freeTejun Heo
A css_set represents the relationship between a set of tasks and css's. css_set never pinned the associated css's. This was okay because tasks used to always disassociate immediately (in RCU sense) - either a task is moved to a different css_set or exits and never accesses css_set again. Unfortunately, afcf6c8b7544 ("cgroup: add cgroup_subsys->free() method and use it to fix pids controller") and patches leading up to it made a zombie hold onto its css_set and deref the associated css's on its release. Nothing pins the css's after exit and it might have already been freed leading to use-after-free. general protection fault: 0000 [#1] PREEMPT SMP task: ffffffff81bf2500 ti: ffffffff81be4000 task.ti: ffffffff81be4000 RIP: 0010:[<ffffffff810fa205>] [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40 ... Call Trace: <IRQ> [<ffffffff810fb02d>] ? pids_free+0x3d/0xa0 [<ffffffff810f8893>] cgroup_free+0x53/0xe0 [<ffffffff8104ed62>] __put_task_struct+0x42/0x130 [<ffffffff81053557>] delayed_put_task_struct+0x77/0x130 [<ffffffff810c6b34>] rcu_process_callbacks+0x2f4/0x820 [<ffffffff810c6af3>] ? rcu_process_callbacks+0x2b3/0x820 [<ffffffff81056e54>] __do_softirq+0xd4/0x460 [<ffffffff81057369>] irq_exit+0x89/0xa0 [<ffffffff81876212>] smp_apic_timer_interrupt+0x42/0x50 [<ffffffff818747f4>] apic_timer_interrupt+0x84/0x90 <EOI> ... Code: 5b 5d c3 48 89 df 48 c7 c2 c9 f9 ae 81 48 c7 c6 91 2c ae 81 e8 1d 94 0e 00 31 c0 5b 5d c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <f0> 48 83 87 e0 00 00 00 ff 78 01 c3 80 3d 08 7a c1 00 00 74 02 RIP [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40 RSP <ffff88001fc03e20> ---[ end trace 89a4a4b916b90c49 ]--- Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: disabled ---[ end Kernel panic - not syncing: Fatal exception in interrupt Fix it by making css_set pin the associate css's until its release. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Dave Jones <davej@codemonkey.org.uk> Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Link: http://lkml.kernel.org/g/20151120041836.GA18390@codemonkey.org.uk Link: http://lkml.kernel.org/g/5652D448.3080002@bmw-carit.de Fixes: afcf6c8b7544 ("cgroup: add cgroup_subsys->free() method and use it to fix pids controller")
2015-11-25nohz: Clarify magic in tick_nohz_stop_sched_tick()Peter Zijlstra
While going through the nohz code I got stumped by some of it. This patch adds a few comments clarifying the code; based on discussion with Thomas. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20151119162106.GO3816@twins.programming.kicks-ass.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-11-25ftrace: Show all tramps registered to a record on ftrace_bug()Steven Rostedt (Red Hat)
When an anomaly is detected in the function call modification code, ftrace_bug() is called to disable function tracing as well as give any information that may help debug the problem. Currently, only the first found trampoline that is attached to the failed record is reported. Instead, show all trampolines that are hooked to it. Also, not only show the ops pointer but also report the function it calls. While at it, add this info to the enabled_functions debug file too. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-25ftrace: Add variable ftrace_expected for archs to show expected codeSteven Rostedt (Red Hat)
When an anomaly is found while modifying function code, ftrace_bug() is called which disables the function tracing infrastructure and reports information about what failed. If the code that is to be replaced does not match what is expected, then actual code is shown. Currently there is no arch generic way to show what was expected. Add a new variable pointer calld ftrace_expected that the arch code can set to point to what it expected so that ftrace_bug() can report the actual text as well as the text that was expected to be there. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-25ftrace: Add new type to distinguish what kind of ftrace_bug()Steven Rostedt (Red Hat)
The ftrace function hook utility has several internal checks to make sure that whatever it modifies is exactly what it expects to be modifying. This is essential as modifying running code can be extremely dangerous to the system. When an anomaly is detected, ftrace_bug() is called which sends a splat to the console and disables function tracing. There's some extra information that is printed to help diagnose the issue. One thing that is missing though is output of what ftrace was doing at the time of the crash. Was it updating a call site or perhaps converting a call site to a nop? A new global enum variable is created to state what ftrace was doing at the time of the anomaly, and this is reported in ftrace_bug(). Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-25tracing: Update cond flag when enabling or disabling a triggerTom Zanussi
When a trigger is enabled, the cond flag should be set beforehand, otherwise a trigger that's expecting to process a trace record (e.g. one with post_trigger set) could be invoked without one. Likewise a trigger's cond flag should be reset after it's disabled, not before. Link: http://lkml.kernel.org/r/a420b52a67b1c2d3cab017914362d153255acb99.1448303214.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Reviewed-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-25ring-buffer: Process commits whenever moving to a new page.Steven Rostedt (Red Hat)
When crossing over to a new page, commit the current work. This will allow readers to get data with less latency, and also simplifies the work to get timestamps working for interrupted events. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-25cpuset: Replace all instances of time_t with time64_tArnd Bergmann
The following patch replaces all instances of time_t with time64_t i.e. change the type used for representing time from 32-bit to 64-bit. All 32-bit kernels to date use a signed 32-bit time_t type, which can only represent time until January 2038. Since embedded systems running 32-bit Linux are going to survive beyond that date, we have to change all current uses, in a backwards compatible way. The patch also changes the function get_seconds() that returns a 32-bit integer to ktime_get_seconds() that returns seconds as 64-bit integer. The patch changes the type of ticks from time_t to u32. We keep ticks as 32-bits as the function uses 32-bit arithmetic which would prove less expensive than 64-bit arithmetic and the function is expected to be called atleast once every 32 seconds. Signed-off-by: Heena Sirwani <heenasirwani@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Tejun Heo <tj@kernel.org>
2015-11-25bpf: fix clearing on persistent program array mapsDaniel Borkmann
Currently, when having map file descriptors pointing to program arrays, there's still the issue that we unconditionally flush program array contents via bpf_fd_array_map_clear() in bpf_map_release(). This happens when such a file descriptor is released and is independent of the map's refcount. Having this flush independent of the refcount is for a reason: there can be arbitrary complex dependency chains among tail calls, also circular ones (direct or indirect, nesting limit determined during runtime), and we need to make sure that the map drops all references to eBPF programs it holds, so that the map's refcount can eventually drop to zero and initiate its freeing. Btw, a walk of the whole dependency graph would not be possible for various reasons, one being complexity and another one inconsistency, i.e. new programs can be added to parts of the graph at any time, so there's no guaranteed consistent state for the time of such a walk. Now, the program array pinning itself works, but the issue is that each derived file descriptor on close would nevertheless call unconditionally into bpf_fd_array_map_clear(). Instead, keep track of users and postpone this flush until the last reference to a user is dropped. As this only concerns a subset of references (f.e. a prog array could hold a program that itself has reference on the prog array holding it, etc), we need to track them separately. Short analysis on the refcounting: on map creation time usercnt will be one, so there's no change in behaviour for bpf_map_release(), if unpinned. If we already fail in map_create(), we are immediately freed, and no file descriptor has been made public yet. In bpf_obj_pin_user(), we need to probe for a possible map in bpf_fd_probe_obj() already with a usercnt reference, so before we drop the reference on the fd with fdput(). Therefore, if actual pinning fails, we need to drop that reference again in bpf_any_put(), otherwise we keep holding it. When last reference drops on the inode, the bpf_any_put() in bpf_evict_inode() will take care of dropping the usercnt again. In the bpf_obj_get_user() case, the bpf_any_get() will grab a reference on the usercnt, still at a time when we have the reference on the path. Should we later on fail to grab a new file descriptor, bpf_any_put() will drop it, otherwise we hold it until bpf_map_release() time. Joint work with Alexei. Fixes: b2197755b263 ("bpf: add support for persistent maps/progs") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-24pidns: fix NULL dereference in __task_pid_nr_ns()Eric Dumazet
I got a crash during a "perf top" session that was caused by a race in __task_pid_nr_ns() : pid_nr_ns() was inlined, but apparently compiler chose to read task->pids[type].pid twice, and the pid->level dereference crashed because we got a NULL pointer at the second read : if (pid && ns->level <= pid->level) { // CRASH Just use RCU API properly to solve this race, and not worry about "perf top" crashing hosts :( get_task_pid() can benefit from same fix. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-24ring-buffer: Remove redundant update of page timestampSteven Rostedt (Red Hat)
The first commit of a buffer page updates the timestamp of that page. No need to have the update to the next page add the timestamp too. It will only be replaced by the first commit on that page anyway. Only update to a page if it contains an event. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-24ring-buffer: Use READ_ONCE() for most tail_page accessSteven Rostedt (Red Hat)
As cpu_buffer->tail_page may be modified by interrupts at almost any time, the flow of logic is very important. Do not let gcc get smart with re-reading cpu_buffer->tail_page by adding READ_ONCE() around most of its accesses. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-24ring-buffer: Put back the length if crossed page with add_timestampSteven Rostedt (Red Hat)
Commit fcc742eaad7c "ring-buffer: Add event descriptor to simplify passing data" added a descriptor that holds various data instead of passing around several variables through parameters. The problem was that one of the parameters was modified in a function and the code was designed not to have an effect on that modified parameter. Now that the parameter is a descriptor and any modifications to it are non-volatile, the size of the data could be unnecessarily expanded. Remove the extra space added if a timestamp was added and the event went across the page. Cc: stable@vger.kernel.org # 4.3+ Fixes: fcc742eaad7c "ring-buffer: Add event descriptor to simplify passing data" Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-24ring-buffer: Update read stamp with first real commit on pageSteven Rostedt (Red Hat)
Do not update the read stamp after swapping out the reader page from the write buffer. If the reader page is swapped out of the buffer before an event is written to it, then the read_stamp may get an out of date timestamp, as the page timestamp is updated on the first commit to that page. rb_get_reader_page() only returns a page if it has an event on it, otherwise it will return NULL. At that point, check if the page being returned has events and has not been read yet. Then at that point update the read_stamp to match the time stamp of the reader page. Cc: stable@vger.kernel.org # 2.6.30+ Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-24context_tracking: Switch to new static_branch APIAndy Lutomirski
This is much less error-prone than the old code. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/812df7e64f120c5c7c08481f36a8caa9f53b2199.1447361906.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23rcu: Add transitivity to remaining rcu_node ->lock acquisitionsPaul E. McKenney
The rule is that all acquisitions of the rcu_node structure's ->lock must provide transitivity: The lock is not acquired that frequently, and sorting out exactly which required it and which did not would be a maintenance nightmare. This commit therefore supplies the needed transitivity to the remaining ->lock acquisitions. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-11-23rcu: Create transitive rnp->lock acquisition functionsPeter Zijlstra
Providing RCU's memory-ordering guarantees requires that the rcu_node tree's locking provide transitive memory ordering, which the Linux kernel's spinlocks currently do not provide unless smp_mb__after_unlock_lock() is used. Having a separate smp_mb__after_unlock_lock() after each and every lock acquisition is error-prone, hard to read, and a bit annoying, so this commit provides wrapper functions that pull in the smp_mb__after_unlock_lock() invocations. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-11-23locking/pvqspinlock, x86: Optimize the PV unlock code pathWaiman Long
The unlock function in queued spinlocks was optimized for better performance on bare metal systems at the expense of virtualized guests. For x86-64 systems, the unlock call needs to go through a PV_CALLEE_SAVE_REGS_THUNK() which saves and restores 8 64-bit registers before calling the real __pv_queued_spin_unlock() function. The thunk code may also be in a separate cacheline from __pv_queued_spin_unlock(). This patch optimizes the PV unlock code path by: 1) Moving the unlock slowpath code from the fastpath into a separate __pv_queued_spin_unlock_slowpath() function to make the fastpath as simple as possible.. 2) For x86-64, hand-coded an assembly function to combine the register saving thunk code with the fastpath code. Only registers that are used in the fastpath will be saved and restored. If the fastpath fails, the slowpath function will be called via another PV_CALLEE_SAVE_REGS_THUNK(). For 32-bit, it falls back to the C __pv_queued_spin_unlock() code as the thunk saves and restores only one 32-bit register. With a microbenchmark of 5M lock-unlock loop, the table below shows the execution times before and after the patch with different number of threads in a VM running on a 32-core Westmere-EX box with x86-64 4.2-rc1 based kernels: Threads Before patch After patch % Change ------- ------------ ----------- -------- 1 134.1 ms 119.3 ms -11% 2 1286 ms 953 ms -26% 3 3715 ms 3480 ms -6.3% 4 4092 ms 3764 ms -8.0% Signed-off-by: Waiman Long <Waiman.Long@hpe.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Douglas Hatch <doug.hatch@hpe.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott J Norton <scott.norton@hpe.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1447114167-47185-5-git-send-email-Waiman.Long@hpe.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23locking/qspinlock: Avoid redundant read of next pointerWaiman Long
With optimistic prefetch of the next node cacheline, the next pointer may have been properly inititalized. As a result, the reading of node->next in the contended path may be redundant. This patch eliminates the redundant read if the next pointer value is not NULL. Signed-off-by: Waiman Long <Waiman.Long@hpe.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Douglas Hatch <doug.hatch@hpe.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott J Norton <scott.norton@hpe.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1447114167-47185-4-git-send-email-Waiman.Long@hpe.com Signed-off-by: Ingo Molnar <mingo@kernel.org>