summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2014-07-05locking/mutexes: Delete the MUTEX_SHOW_NO_WAITER macroJason Low
MUTEX_SHOW_NO_WAITER() is a macro which checks for if there are "no waiters" on a mutex by checking if the lock count is non-negative. Based on feedback from the discussion in the earlier version of this patchset, the macro is not very readable. Furthermore, checking lock->count isn't always the correct way to determine if there are "no waiters" on a mutex. For example, a negative count on a mutex really only means that there "potentially" are waiters. Likewise, there can be waiters on the mutex even if the count is non-negative. Thus, "MUTEX_SHOW_NO_WAITER" doesn't always do what the name of the macro suggests. So this patch deletes the MUTEX_SHOW_NO_WAITERS() macro, directly use atomic_read() instead of the macro, and adds comments which elaborate on how the extra atomic_read() checks can help reduce unnecessary xchg() operations. Signed-off-by: Jason Low <jason.low2@hp.com> Acked-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: akpm@linux-foundation.org Cc: tim.c.chen@linux.intel.com Cc: paulmck@linux.vnet.ibm.com Cc: rostedt@goodmis.org Cc: davidlohr@hp.com Cc: scott.norton@hp.com Cc: aswin@hp.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1402511843-4721-3-git-send-email-jason.low2@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05locking/mutexes: Correct documentation on mutex optimistic spinningJason Low
The mutex optimistic spinning documentation states that we spin for acquisition when we find that there are no pending waiters. However, in actuality, whether or not there are waiters for the mutex doesn't determine if we will spin for it. This patch removes that statement and also adds a comment which mentions that we spin for the mutex while we don't need to reschedule. Signed-off-by: Jason Low <jason.low2@hp.com> Acked-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: akpm@linux-foundation.org Cc: tim.c.chen@linux.intel.com Cc: paulmck@linux.vnet.ibm.com Cc: rostedt@goodmis.org Cc: Waiman.Long@hp.com Cc: scott.norton@hp.com Cc: aswin@hp.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1402511843-4721-2-git-send-email-jason.low2@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05perf: Make perf_event_init_context() function staticJiri Olsa
Leftover from '8dc85d5 perf: Multiple task contexts'. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lkml.kernel.org/r/1403598026-2310-1-git-send-email-jolsa@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05sched: Rework check_for_tasks()Kirill Tkhai
1) Iterate thru all of threads in the system. Check for all threads, not only for group leaders. 2) Check for p->on_rq instead of p->state and cputime. Preempted task in !TASK_RUNNING state OR just created task may be queued, that we want to be reported too. 3) Use read_lock() instead of write_lock(). This function does not change any structures, and read_lock() is enough. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ben Segall <bsegall@google.com> Cc: Fabian Frederick <fabf@skynet.be> Cc: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Cc: Konstantin Khorenko <khorenko@parallels.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael wang <wangyun@linux.vnet.ibm.com> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Paul Turner <pjt@google.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Todd E Brandt <todd.e.brandt@linux.intel.com> Cc: Toshi Kani <toshi.kani@hp.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1403684395.3462.44.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05sched/rt: Enqueue just unthrottled rt_rq back on the stack in ↵Kirill Tkhai
__disable_runtime() Make rt_rq available for pick_next_task(). Otherwise, their tasks stay prisoned long time till dead cpu becomes alive again. Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> CC: Konstantin Khorenko <khorenko@parallels.com> CC: Ben Segall <bsegall@google.com> CC: Paul Turner <pjt@google.com> CC: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1403684388.3462.43.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05sched/fair: Disable runtime_enabled on dying rqKirill Tkhai
We kill rq->rd on the CPU_DOWN_PREPARE stage: cpuset_cpu_inactive -> cpuset_update_active_cpus -> partition_sched_domains -> -> cpu_attach_domain -> rq_attach_root -> set_rq_offline This unthrottles all throttled cfs_rqs. But the cpu is still able to call schedule() till take_cpu_down->__cpu_disable() is called from stop_machine. This case the tasks from just unthrottled cfs_rqs are pickable in a standard scheduler way, and they are picked by dying cpu. The cfs_rqs becomes throttled again, and migrate_tasks() in migration_call skips their tasks (one more unthrottle in migrate_tasks()->CPU_DYING does not happen, because rq->rd is already NULL). Patch sets runtime_enabled to zero. This guarantees, the runtime is not accounted, and the cfs_rqs won't exceed given cfs_rq->runtime_remaining = 1, and tasks will be pickable in migrate_tasks(). runtime_enabled is recalculated again when rq becomes online again. Ben Segall also noticed, we always enable runtime in tg_set_cfs_bandwidth(). Actually, we should do that for online cpus only. To prevent races with unthrottle_offline_cfs_rqs() we take get_online_cpus() lock. Reviewed-by: Ben Segall <bsegall@google.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> CC: Konstantin Khorenko <khorenko@parallels.com> CC: Paul Turner <pjt@google.com> CC: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1403684382.3462.42.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05sched/numa: Change scan period code to match intentRik van Riel
Reading through the scan period code and comment, it appears the intent was to slow down NUMA scanning when a majority of accesses are on the local node, specifically a local:remote ratio of 3:1. However, the code actually tests local / (local + remote), and the actual cut-off point was around 30% local accesses, well before a task has actually converged on a node. Changing the threshold to 7 means scanning slows down when a task has around 70% of its accesses local, which appears to match the intent of the code more closely. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: mgorman@suse.de Cc: chegu_vinod@hp.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1403538095-31256-8-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05sched/numa: Rework best node setting in task_numa_migrate()Rik van Riel
Fix up the best node setting in task_numa_migrate() to deal with a task in a pseudo-interleaved NUMA group, which is already running in the best location. Set the task's preferred nid to the current nid, so task migration is not retried at a high rate. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: mgorman@suse.de Cc: chegu_vinod@hp.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1403538095-31256-7-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05sched/numa: Examine a task move when examining a task swapRik van Riel
Running "perf bench numa mem -0 -m -P 1000 -p 8 -t 20" on a 4 node system results in 160 runnable threads on a system with 80 CPU threads. Once a process has nearly converged, with 39 threads on one node and 1 thread on another node, the remaining thread will be unable to migrate to its preferred node through a task swap. However, a simple task move would make the workload converge, witout causing an imbalance. Test for this unlikely occurrence, and attempt a task move to the preferred nid when it happens. # Running main, "perf bench numa mem -p 8 -t 20 -0 -m -P 1000" ### # 160 tasks will execute (on 4 nodes, 80 CPUs): # -1x 0MB global shared mem operations # -1x 1000MB process shared mem operations # -1x 0MB thread local mem operations ### ### # # 0.0% [0.2 mins] 0/0 1/1 36/2 0/0 [36/3 ] l: 0-0 ( 0) {0-2} # 0.0% [0.3 mins] 43/3 37/2 39/2 41/3 [ 6/10] l: 0-1 ( 1) {1-2} # 0.0% [0.4 mins] 42/3 38/2 40/2 40/2 [ 4/9 ] l: 1-2 ( 1) [50.0%] {1-2} # 0.0% [0.6 mins] 41/3 39/2 40/2 40/2 [ 2/9 ] l: 2-4 ( 2) [50.0%] {1-2} # 0.0% [0.7 mins] 40/2 40/2 40/2 40/2 [ 0/8 ] l: 3-5 ( 2) [40.0%] ( 41.8s converged) Without this patch, this same perf bench numa mem run had to rely on the scheduler load balancer to first balance out the load (moving a random task), before a task swap could complete the NUMA convergence. The load balancer does not normally take action unless the load difference exceeds 25%. Convergence times of over half an hour have been observed without this patch. With this patch, the NUMA balancing code will simply migrate the task, if that does not cause an imbalance. Also skip examining a CPU in detail if the improvement on that CPU is no more than the best we already have. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: chegu_vinod@hp.com Cc: mgorman@suse.de Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-ggthh0rnh0yua6o5o3p6cr1o@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05sched/numa: Simplify task_numa_compare()Rik van Riel
When a task is part of a numa_group, the comparison should always use the group weight, in order to make workloads converge. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: chegu_vinod@hp.com Cc: mgorman@suse.de Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1403538378-31571-4-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05sched/numa: Use effective_load() to balance NUMA loadsRik van Riel
When CONFIG_FAIR_GROUP_SCHED is enabled, the load that a task places on a CPU is determined by the group the task is in. The active groups on the source and destination CPU can be different, resulting in a different load contribution by the same task at its source and at its destination. As a result, the load needs to be calculated separately for each CPU, instead of estimated once with task_h_load(). Getting this calculation right allows some workloads to converge, where previously the last thread could get stuck on another node, without being able to migrate to its final destination. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: mgorman@suse.de Cc: chegu_vinod@hp.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1403538378-31571-3-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05sched/numa: Move power adjustment into load_too_imbalanced()Rik van Riel
Currently the NUMA code scales the load on each node with the amount of CPU power available on that node, but it does not apply any adjustment to the load of the task that is being moved over. On systems with SMT/HT, this results in a task being weighed much more heavily than a CPU core, and a task move that would even out the load between nodes being disallowed. The correct thing is to apply the power correction to the numbers after we have first applied the move of the tasks' loads to them. This also allows us to do the power correction with a multiplication, rather than a division. Also drop two function arguments for load_too_unbalanced, since it takes various factors from env already. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: chegu_vinod@hp.com Cc: mgorman@suse.de Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1403538378-31571-2-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05sched/numa: Use group's max nid as task's preferred nidRik van Riel
From task_numa_placement, always try to consolidate the tasks in a group on the group's top nid. In case this task is part of a group that is interleaved over multiple nodes, task_numa_migrate will set the task's preferred nid to the best node it could find for the task, so this patch will cause at most one run through task_numa_migrate. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: mgorman@suse.de Cc: chegu_vinod@hp.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1403538095-31256-2-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05sched/fair: Implement fast idling of CPUs when the system is partially loadedTim Chen
When a system is lightly loaded (i.e. no more than 1 job per cpu), attempt to pull job to a cpu before putting it to idle is unnecessary and can be skipped. This patch adds an indicator so the scheduler can know when there's no more than 1 active job is on any CPU in the system to skip needless job pulls. On a 4 socket machine with a request/response kind of workload from clients, we saw about 0.13 msec delay when we go through a full load balance to try pull job from all the other cpus. While 0.1 msec was spent on processing the request and generating a response, the 0.13 msec load balance overhead was actually more than the actual work being done. This overhead can be skipped much of the time for lightly loaded systems. With this patch, we tested with a netperf request/response workload that has the server busy with half the cpus in a 4 socket system. We found the patch eliminated 75% of the load balance attempts before idling a cpu. The overhead of setting/clearing the indicator is low as we already gather the necessary info while we call add_nr_running() and update_sd_lb_stats.() We switch to full load balance load immediately if any cpu got more than one job on its run queue in add_nr_running. We'll clear the indicator to avoid load balance when we detect no cpu's have more than one job when we scan the work queues in update_sg_lb_stats(). We are aggressive in turning on the load balance and opportunistic in skipping the load balance. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Jason Low <jason.low2@hp.com> Cc: "Paul E.McKenney" <paulmck@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Alex Shi <alex.shi@linaro.org> Cc: Michel Lespinasse <walken@google.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1403551009.2970.613.camel@schen9-DESK Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05sched/idle: Drop !! while calculating 'broadcast'Viresh Kumar
We don't need 'broadcast' to be set to 'zero or one', but to 'zero or non-zero' and so the extra operation to convert it to 'zero or one' can be skipped. Also change type of 'broadcast' to unsigned int, i.e. type of drv->states[*].flags. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: linaro-kernel@lists.linaro.org Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/0dfbe2976aa108c53e08d3477ea90f6360c1f54c.1403584026.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05sched: Fix clock_gettime(CLOCK_[PROCESS/THREAD]_CPUTIME_ID) monotonicityMike Galbraith
If a task has been dequeued, it has been accounted. Do not project cycles that may or may not ever be accounted to a dequeued task, as that may make clock_gettime() both inaccurate and non-monotonic. Protect update_rq_clock() from slight TSC skew while at it. Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: kosaki.motohiro@jp.fujitsu.com Cc: pjt@google.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1403588980.29711.11.camel@marge.simpson.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05sched: Fix potential near-infinite distribute_cfs_runtime() loopBen Segall
distribute_cfs_runtime() intentionally only hands out enough runtime to bring each cfs_rq to 1 ns of runtime, expecting the cfs_rqs to then take the runtime they need only once they actually get to run. However, if they get to run sufficiently quickly, the period timer is still in distribute_cfs_runtime() and no runtime is available, causing them to throttle. Then distribute has to handle them again, and this can go on until distribute has handed out all of the runtime 1ns at a time, which takes far too long. Instead allow access to the same runtime that distribute is handing out, accepting that corner cases with very low quota may be able to spend the entire cfs_b->runtime during distribute_cfs_runtime, meaning that the runtime directly handed out by distribute_cfs_runtime was over quota. In addition, if a cfs_rq does manage to throttle like this, make sure the existing distribute_cfs_runtime no longer loops over it again. Signed-off-by: Ben Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140620222120.13814.21652.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05sched/core: Fix formatting issues in sched_can_stop_tick()Viresh Kumar
sched_can_stop_tick() is using 7 spaces instead of 8 spaces or a 'tab' at the beginning of few lines. Which doesn't align well with the Coding Guidelines. Also remove local variable 'rq' as it is used at only one place and we can directly use this_rq() instead. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: fweisbec@gmail.com Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/afb781733e4a9ffbced5eb9fd25cc0aa5c6ffd7a.1403596966.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05irq_work: Remove BUG_ON in irq_work_run()Peter Zijlstra
Because of a collision with 8d056c48e486 ("CPU hotplug, smp: flush any pending IPI callbacks before CPU offline"), which ends up calling hotplug_cfd()->flush_smp_call_function_queue()->irq_work_run(), which is not from IRQ context. And since that already calls irq_work_run() from the hotplug path, remove our entire hotplug handling. Reported-by: Stephen Warren <swarren@wwwdotorg.org> Tested-by: Stephen Warren <swarren@wwwdotorg.org> Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-busatzs2gvz4v62258agipuf@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05Merge branch 'timers/nohz' into sched/coreIngo Molnar
Merge these two, because upcoming patches will touch both areas. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-03Merge tag 'trace-fixes-v3.16-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Oleg Nesterov found and fixed a bug in the perf/ftrace/uprobes code where running: # perf probe -x /lib/libc.so.6 syscall # echo 1 >> /sys/kernel/debug/tracing/events/probe_libc/enable # perf record -e probe_libc:syscall whatever kills the uprobe. Along the way he found some other minor bugs and clean ups that he fixed up making it a total of 4 patches. Doing unrelated work, I found that the reading of the ftrace trace file disables all function tracer callbacks. This was fine when ftrace was the only user, but now that it's used by perf and kprobes, this is a bug where reading trace can disable kprobes and perf. A very unexpected side effect and should be fixed" * tag 'trace-fixes-v3.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Remove ftrace_stop/start() from reading the trace file tracing/uprobes: Fix the usage of uprobe_buffer_enable() in probe_event_enable() tracing/uprobes: Kill the bogus UPROBE_HANDLER_REMOVE code in uprobe_dispatcher() uprobes: Change unregister/apply to WARN() if uprobe/consumer is gone tracing/uprobes: Revert "Support mix of ftrace and perf"
2014-07-03kernel/printk/printk.c: revert "printk: enable interrupts before calling ↵Andrew Morton
console_trylock_for_printk()" Revert commit 939f04bec1a4 ("printk: enable interrupts before calling console_trylock_for_printk()"). Andreas reported: : None of the post 3.15 kernel boot for me. They all hang at the GRUB : screen telling me it loaded and started the kernel, but the kernel : itself stops before it prints anything (or even replaces the GRUB : background graphics). 939f04bec1a4 is modest latency reduction. Revert it until we understand the reason for these failures. Reported-by: Andreas Bombe <aeb@debian.org> Cc: Jan Kara <jack@suse.cz> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-07-03crypto: fips - only panic on bad/missing crypto mod signaturesJarod Wilson
Per further discussion with NIST, the requirements for FIPS state that we only need to panic the system on failed kernel module signature checks for crypto subsystem modules. This moves the fips-mode-only module signature check out of the generic module loading code, into the crypto subsystem, at points where we can catch both algorithm module loads and mode module loads. At the same time, make CONFIG_CRYPTO_FIPS dependent on CONFIG_MODULE_SIG, as this is entirely necessary for FIPS mode. v2: remove extraneous blank line, perform checks in static inline function, drop no longer necessary fips.h include. CC: "David S. Miller" <davem@davemloft.net> CC: Rusty Russell <rusty@rustcorp.com.au> CC: Stephan Mueller <stephan.mueller@atsec.com> Signed-off-by: Jarod Wilson <jarod@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-07-02perf: Do not allow optimized switch for non-cloned eventsJiri Olsa
The context check in perf_event_context_sched_out allows non-cloned context to be part of the optimized schedule out switch. This could move non-cloned context into another workload child. Once this child exits, the context is closed and leaves all original (parent) events in closed state. Any other new cloned event will have closed state and not measure anything. And probably causing other odd bugs. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1403598026-2310-2-git-send-email-jolsa@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-01workqueue: stronger test in process_one_work()Lai Jiangshan
When POOL_DISASSOCIATED is cleared, the running worker's local CPU should be the same as pool->cpu without any exception even during cpu-hotplug. This patch changes "(proposition_A && proposition_B && proposition_C)" to "(proposition_B && proposition_C)", so if the old compound proposition is true, the new one must be true too. so this won't hide any possible bug which can be hit by old test. tj: Minor description update and dropped the obvious comment. CC: Jason J. Herne <jjherne@linux.vnet.ibm.com> CC: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-07-01workqueue: clear POOL_DISASSOCIATED in rebind_workers()Lai Jiangshan
a9ab775bcadf ("workqueue: directly restore CPU affinity of workers from CPU_ONLINE") moved pool locking into rebind_workers() but left "pool->flags &= ~POOL_DISASSOCIATED" in workqueue_cpu_up_callback(). There is nothing necessarily wrong with it, but there is no benefit either. Let's move it into rebind_workers() and achieve the following benefits: 1) better readability, POOL_DISASSOCIATED is cleared in rebind_workers() as expected. 2) we can guarantee that, when POOL_DISASSOCIATED is clear, the running workers of the pool are on the local CPU (pool->cpu). tj: Minor description update. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-07-01cpuset: break kernfs active protection in cpuset_write_resmask()Tejun Heo
Writing to either "cpuset.cpus" or "cpuset.mems" file flushes cpuset_hotplug_work so that cpu or memory hotunplug doesn't end up migrating tasks off a cpuset after new resources are added to it. As cpuset_hotplug_work calls into cgroup core via cgroup_transfer_tasks(), this flushing adds the dependency to cgroup core locking from cpuset_write_resmak(). This used to be okay because cgroup interface files were protected by a different mutex; however, 8353da1f91f1 ("cgroup: remove cgroup_tree_mutex") simplified the cgroup core locking and this dependency became a deadlock hazard - cgroup file removal performed under cgroup core lock tries to drain on-going file operation which is trying to flush cpuset_hotplug_work blocked on the same cgroup core lock. The locking simplification was done because kernfs added an a lot easier way to deal with circular dependencies involving kernfs active protection. Let's use the same strategy in cpuset and break active protection in cpuset_write_resmask(). While it isn't the prettiest, this is a very rare, likely unique, situation which also goes away on the unified hierarchy. The commands to trigger the deadlock warning without the patch and the lockdep output follow. localhost:/ # mount -t cgroup -o cpuset xxx /cpuset localhost:/ # mkdir /cpuset/tmp localhost:/ # echo 1 > /cpuset/tmp/cpuset.cpus localhost:/ # echo 0 > cpuset/tmp/cpuset.mems localhost:/ # echo $$ > /cpuset/tmp/tasks localhost:/ # echo 0 > /sys/devices/system/cpu/cpu1/online ====================================================== [ INFO: possible circular locking dependency detected ] 3.16.0-rc1-0.1-default+ #7 Not tainted ------------------------------------------------------- kworker/1:0/32649 is trying to acquire lock: (cgroup_mutex){+.+.+.}, at: [<ffffffff8110e3d7>] cgroup_transfer_tasks+0x37/0x150 but task is already holding lock: (cpuset_hotplug_work){+.+...}, at: [<ffffffff81085412>] process_one_work+0x192/0x520 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (cpuset_hotplug_work){+.+...}: ... -> #1 (s_active#175){++++.+}: ... -> #0 (cgroup_mutex){+.+.+.}: ... other info that might help us debug this: Chain exists of: cgroup_mutex --> s_active#175 --> cpuset_hotplug_work Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(cpuset_hotplug_work); lock(s_active#175); lock(cpuset_hotplug_work); lock(cgroup_mutex); *** DEADLOCK *** 2 locks held by kworker/1:0/32649: #0: ("events"){.+.+.+}, at: [<ffffffff81085412>] process_one_work+0x192/0x520 #1: (cpuset_hotplug_work){+.+...}, at: [<ffffffff81085412>] process_one_work+0x192/0x520 stack backtrace: CPU: 1 PID: 32649 Comm: kworker/1:0 Not tainted 3.16.0-rc1-0.1-default+ #7 ... Call Trace: [<ffffffff815a5f78>] dump_stack+0x72/0x8a [<ffffffff810c263f>] print_circular_bug+0x10f/0x120 [<ffffffff810c481e>] check_prev_add+0x43e/0x4b0 [<ffffffff810c4ee6>] validate_chain+0x656/0x7c0 [<ffffffff810c53d2>] __lock_acquire+0x382/0x660 [<ffffffff810c57a9>] lock_acquire+0xf9/0x170 [<ffffffff815aa13f>] mutex_lock_nested+0x6f/0x380 [<ffffffff8110e3d7>] cgroup_transfer_tasks+0x37/0x150 [<ffffffff811129c0>] hotplug_update_tasks_insane+0x110/0x1d0 [<ffffffff81112bbd>] cpuset_hotplug_update_tasks+0x13d/0x180 [<ffffffff811148ec>] cpuset_hotplug_workfn+0x18c/0x630 [<ffffffff810854d4>] process_one_work+0x254/0x520 [<ffffffff810875dd>] worker_thread+0x13d/0x3d0 [<ffffffff8108e0c8>] kthread+0xf8/0x100 [<ffffffff815acaec>] ret_from_fork+0x7c/0xb0 Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Li Zefan <lizefan@huawei.com> Tested-by: Li Zefan <lizefan@huawei.com>
2014-07-01ipv6: Allow accepting RA from local IP addresses.Ben Greear
This can be used in virtual networking applications, and may have other uses as well. The option is disabled by default. A specific use case is setting up virtual routers, bridges, and hosts on a single OS without the use of network namespaces or virtual machines. With proper use of ip rules, routing tables, veth interface pairs and/or other virtual interfaces, and applications that can bind to interfaces and/or IP addresses, it is possibly to create one or more virtual routers with multiple hosts attached. The host interfaces can act as IPv6 systems, with radvd running on the ports in the virtual routers. With the option provided in this patch enabled, those hosts can now properly obtain IPv6 addresses from the radvd. Signed-off-by: Ben Greear <greearb@candelatech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-01tracing: Remove ftrace_stop/start() from reading the trace fileSteven Rostedt (Red Hat)
Disabling reading and writing to the trace file should not be able to disable all function tracing callbacks. There's other users today (like kprobes and perf). Reading a trace file should not stop those from happening. Cc: stable@vger.kernel.org # 3.0+ Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-07-01tracing: Add description of set_graph_notrace to tracing/READMENamhyung Kim
It was missing the description of set_graph_notrace file. Add it. Link: http://lkml.kernel.org/p/1402590233-22321-5-git-send-email-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-07-01tracing: Improve message of empty set_ftrace_notrace fileNamhyung Kim
When there's no entry in set_ftrace_notrace, it'll print nothing, but it's better to print something like below like set_graph_notrace does: #### no functions disabled #### Link: http://lkml.kernel.org/p/1402644246-4649-1-git-send-email-namhyung@kernel.org Reported-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-07-01tracing: Improve message of empty set_graph_notrace fileNamhyung Kim
When there's no entry in set_graph_notrace, it'll print below message #### all functions enabled #### While this is technically correct, it's better to print like below: #### no functions disabled #### Link: http://lkml.kernel.org/p/1402590233-22321-3-git-send-email-namhyung@kernel.org Reported-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-07-01tracing: Add ftrace_graph_notrace boot parameterNamhyung Kim
The ftrace_graph_notrace option is for specifying notrace filter for function graph tracer at boot time. It can be altered after boot using set_graph_notrace file on the debugfs. Link: http://lkml.kernel.org/p/1402590233-22321-2-git-send-email-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-07-01tracing: Convert pr_warning() to pr_warn() in trace_events.cFabian Frederick
Convert pr_warning to standard pr_warn Define pr_fmt(fmt) fmt to avoid any future default fmt definition Link: http://lkml.kernel.org/p/1402141388-21144-1-git-send-email-fabf@skynet.be Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-07-01ftrace: Do not copy hash if O_TRUNC is setNamhyung Kim
When a filter file is open for writing and O_TRUNC is set, there's no need to copy and free the filter entries. Link: http://lkml.kernel.org/p/1402474014-28655-2-git-send-email-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-07-01ftrace: Fix memory leak on failure path in ftrace_allocate_pages()Namhyung Kim
As struct ftrace_page is managed in a single linked list, it should free from the start page. Link: http://lkml.kernel.org/p/1402474014-28655-1-git-send-email-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-07-01ftrace: Get rid of obsolete global_start_up variableNamhyung Kim
It seems like it's a leftover from commit 4104d326b670 ("ftrace: Remove global function list and call function directly"). As it isn't updated at all, checking its value is meaningless. Let's get rid of it. Link: http://lkml.kernel.org/p/1402584972-17824-1-git-send-email-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-07-01tracing: Add trace_seq_buffer_ptr() helper functionSteven Rostedt (Red Hat)
There's several locations in the kernel that open code the calculation of the next location in the trace_seq buffer. This is usually done with p->buffer + p->len Instead of having this open coded, supply a helper function in the header to do it for them. This function is called trace_seq_buffer_ptr(). Link: http://lkml.kernel.org/p/20140626220129.452783019@goodmis.org Acked-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-07-01tracing: Remove unnecessary null test before debugfs_remove()Fabian Frederick
This fixes checkpatch warning: "WARNING: debugfs_remove(NULL) is safe this check is probably not required" Link: http://lkml.kernel.org/p/1403802871-8599-1-git-send-email-fabf@skynet.be Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-07-01tracing: Remove trace_seq_reserve()Steven Rostedt (Red Hat)
trace_seq_reserve() has no users in the kernel, it just wastes space. Remove it. Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-07-01tracing: Make trace_seq_putmem_hex() more robustSteven Rostedt (Red Hat)
Currently trace_seq_putmem_hex() can only take as a parameter a pointer to something that is 8 bytes or less, otherwise it will overflow the buffer. This is protected by a macro that encompasses the call to trace_seq_putmem_hex() that has a BUILD_BUG_ON() for the variable before it is passed in. This is not very robust and if trace_seq_putmem_hex() ever gets used outside that macro it will cause issues. Instead of only being able to produce a hex output of memory that is for a single word, change it to be more robust and allow any size input. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-07-01tracing: Clean up trace_seq.cSteven Rostedt (Red Hat)
For using trace_seq_*() functions in NMI context, I posted a patch to move it to the lib/ directory. This caused Andrew Morton to take a look at the code. He went through and gave a lot of comments about missing kernel doc, inconsistent types for the save variable, mix match of EXPORT_SYMBOL_GPL() and EXPORT_SYMBOL() as well as missing EXPORT_SYMBOL*()s. There were a few comments about the way variables were being compared (int vs uint). All these were good review comments and should be implemented regardless of if trace_seq.c should be moved to lib/ or not. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-07-01tracing: Move the trace_seq_* functions into its own trace_seq.c fileSteven Rostedt (Red Hat)
The trace_seq_*() functions are a nice utility that allows users to manipulate buffers with printf() like formats. It has its own trace_seq.h header in include/linux and should be in its own file. Being tied with trace_output.c is rather awkward. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-07-01ftrace: Simplify ftrace_hash_disable/enable path in ftrace_hash_moveMasami Hiramatsu
Simplify ftrace_hash_disable/enable path in ftrace_hash_move for hardening the process if the memory allocation failed. Link: http://lkml.kernel.org/p/20140617110442.15167.81076.stgit@kbuild-fedora.novalocal Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-07-01ftrace: Add trampolines to enabled_functions debug fileSteven Rostedt (Red Hat)
The enabled_functions is used to help debug the dynamic function tracing. Adding what trampolines are attached to files is useful for debugging. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-07-01ftrace: Optimize function graph to be called directlySteven Rostedt (Red Hat)
Function graph tracing is a bit different than the function tracers, as it is processed after either the ftrace_caller or ftrace_regs_caller and we only have one place to modify the jump to ftrace_graph_caller, the jump needs to happen after the restore of registeres. The function graph tracer is dependent on the function tracer, where even if the function graph tracing is going on by itself, the save and restore of registers is still done for function tracing regardless of if function tracing is happening, before it calls the function graph code. If there's no function tracing happening, it is possible to just call the function graph tracer directly, and avoid the wasted effort to save and restore regs for function tracing. This requires adding new flags to the dyn_ftrace records: FTRACE_FL_TRAMP FTRACE_FL_TRAMP_EN The first is set if the count for the record is one, and the ftrace_ops associated to that record has its own trampoline. That way the mcount code can call that trampoline directly. In the future, trampolines can be added to arbitrary ftrace_ops, where you can have two or more ftrace_ops registered to ftrace (like kprobes and perf) and if they are not tracing the same functions, then instead of doing a loop to check all registered ftrace_ops against their hashes, just call the ftrace_ops trampoline directly, which would call the registered ftrace_ops function directly. Without this patch perf showed: 0.05% hackbench [kernel.kallsyms] [k] ftrace_caller 0.05% hackbench [kernel.kallsyms] [k] arch_local_irq_save 0.05% hackbench [kernel.kallsyms] [k] native_sched_clock 0.04% hackbench [kernel.kallsyms] [k] __buffer_unlock_commit 0.04% hackbench [kernel.kallsyms] [k] preempt_trace 0.04% hackbench [kernel.kallsyms] [k] prepare_ftrace_return 0.04% hackbench [kernel.kallsyms] [k] __this_cpu_preempt_check 0.04% hackbench [kernel.kallsyms] [k] ftrace_graph_caller See that the ftrace_caller took up more time than the ftrace_graph_caller did. With this patch: 0.05% hackbench [kernel.kallsyms] [k] __buffer_unlock_commit 0.04% hackbench [kernel.kallsyms] [k] call_filter_check_discard 0.04% hackbench [kernel.kallsyms] [k] ftrace_graph_caller 0.04% hackbench [kernel.kallsyms] [k] sched_clock The ftrace_caller is no where to be found and ftrace_graph_caller still takes up the same percentage. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-06-30tracing/uprobes: Fix the usage of uprobe_buffer_enable() in probe_event_enable()Oleg Nesterov
The usage of uprobe_buffer_enable() added by dcad1a20 is very wrong, 1. uprobe_buffer_enable() and uprobe_buffer_disable() are not balanced, _enable() should be called only if !enabled. 2. If uprobe_buffer_enable() fails probe_event_enable() should clear tp.flags and free event_file_link. 3. If uprobe_register() fails it should do uprobe_buffer_disable(). Link: http://lkml.kernel.org/p/20140627170146.GA18332@redhat.com Acked-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Fixes: dcad1a204f72 "tracing/uprobes: Fetch args before reserving a ring buffer" Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-06-30tracing/uprobes: Kill the bogus UPROBE_HANDLER_REMOVE code in ↵Oleg Nesterov
uprobe_dispatcher() I do not know why dd9fa555d7bb "tracing/uprobes: Move argument fetching to uprobe_dispatcher()" added the UPROBE_HANDLER_REMOVE, but it looks wrong. OK, perhaps it makes sense to avoid store_trace_args() if the tracee is nacked by uprobe_perf_filter(). But then we should kill the same code in uprobe_perf_func() and unify the TRACE/PROFILE filtering (we need to do this anyway to mix perf/ftrace). Until then this code actually adds the pessimization because uprobe_perf_filter() will be called twice and return T in likely case. Link: http://lkml.kernel.org/p/20140627170143.GA18329@redhat.com Acked-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-06-30uprobes: Change unregister/apply to WARN() if uprobe/consumer is goneOleg Nesterov
Add WARN_ON's into uprobe_unregister() and uprobe_apply() to ensure that nobody tries to play with the dead uprobe/consumer. This helps to catch the bugs like the one fixed by the previous patch. In the longer term we should fix this poorly designed interface. uprobe_register() should return "struct uprobe *" which should be passed to apply/unregister. Plus other semantic changes, see the changelog in commit 41ccba029e94. Link: http://lkml.kernel.org/p/20140627170140.GA18322@redhat.com Acked-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-06-30tracing/uprobes: Revert "Support mix of ftrace and perf"Oleg Nesterov
This reverts commit 43fe98913c9f67e3b523615ee3316f9520a623e0. This patch is very wrong. Firstly, this change leads to unbalanced uprobe_unregister(). Just for example, # perf probe -x /lib/libc.so.6 syscall # echo 1 >> /sys/kernel/debug/tracing/events/probe_libc/enable # perf record -e probe_libc:syscall whatever after that uprobe is dead (unregistered) but the user of ftrace/perf can't know this, and it looks as if nobody hits this probe. This would be easy to fix, but there are other reasons why it is not simple to mix ftrace and perf. If nothing else, they can't share the same ->consumer.filter. This is fixable too, but probably we need to fix the poorly designed uprobe_register() interface first. At least "register" and "apply" should be clearly separated. Link: http://lkml.kernel.org/p/20140627170136.GA18319@redhat.com Cc: Tom Zanussi <tom.zanussi@linux.intel.com> Cc: "zhangwei(Jovi)" <jovi.zhangwei@huawei.com> Cc: stable@vger.kernel.org # v3.14 Acked-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>