summaryrefslogtreecommitdiff
path: root/kernel/sched
AgeCommit message (Collapse)Author
2016-05-06sched: Make hrtick_notifier an explicit callThomas Gleixner
No need for an extra notifier. We don't need to handle all these states. It's sufficient to kill the timer when the cpu dies. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160310120025.770528462@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-05-06sched/fair: Make ilb_notifier an explicit callThomas Gleixner
No need for an extra notifier. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160310120025.693720241@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-05-06sched/hotplug: Move migration CPU_DYING to sched_cpu_dying()Thomas Gleixner
Remove the hotplug notifier and make it an explicit state. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160310120025.502222097@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-05-06sched/migration: Move CPU_ONLINE into scheduler stateThomas Gleixner
The alleged requirement that the migration notifier has a lower priority than perf is completely undocumented and there is no indication at all that this is true. perf does not even handle the CPU_ONLINE notification and perf really has nothing to do with migration. Move the CPU_ONLINE code into the sched_activate_cpu() state callback. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160310120025.421743581@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-05-06sched/migration: Move calc_load_migrate() into CPU_DYINGThomas Gleixner
It really does not matter when we fold the load for the outgoing cpu. It's almost dead anyway, so there is no harm if we fail to fold the few microseconds which are required for going fully away. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160310120025.328739226@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-05-06sched/migration: Move prepare transition to SCHED_STARTING stateThomas Gleixner
We can piggy pack that on the SCHED_STARTING state. It's not required before the cpu actually comes online. Name the function proper as it has nothing to do with migration. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160310120025.248226511@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-05-06sched/hotplug: Move sync_rcu to be with set_cpu_active(false)Peter Zijlstra
The sync_rcu stuff is specificically for clearing bits in the active mask, such that everybody will observe the bit cleared and will not consider the cleared CPU for load-balancing etc. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160310120025.169219710@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-05-06sched/hotplug: Convert cpu_[in]active notifiers to state machineThomas Gleixner
Now that we reduced everything into single notifiers, it's simple to move them into the hotplug state machine space. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: rt@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-05-06sched: Move sched_domains_numa_masks_clear() to DOWN_PREPAREThomas Gleixner
This is the last operation on the cpu before vanishing. No point in calling that on CPU_DEAD. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: rt@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-05-06sched: Consolidate the notifier mazeThomas Gleixner
We can maintain the ordering of the scheduler cpu hotplug functionality nicely in one notifer. Get rid of the maze. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: rt@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-05-06sched: Allow hotplug notifiers to be setup earlyThomas Gleixner
Prevent the SMP scheduler related notifiers to be executed before the smp scheduler is initialized and install them early. This is a preparatory change for further consolidation of the hotplug notifier maze. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: rt@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-05-06sched: Make set_cpu_rq_start_time() a built in hotplug stateThomas Gleixner
Start distangling the maze of hotplug notifiers in the scheduler. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: rt@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-05-06sched: Allow per-cpu kernel threads to run on online && !activePeter Zijlstra (Intel)
In order to enable symmetric hotplug, we must mirror the online && !active state of cpu-down on the cpu-up side. However, to retain sanity, limit this state to per-cpu kthreads. Aside from the change to set_cpus_allowed_ptr(), which allow moving the per-cpu kthreads on, the other critical piece is the cpu selection for pinned tasks in select_task_rq(). This avoids dropping into select_fallback_rq(). select_fallback_rq() cannot be allowed to select !active cpus because its used to migrate user tasks away. And we do not want to move user tasks onto cpus that are in transition. Requested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Thomas Gleixner <tglx@linutronix.de> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Jan H. Schönherr <jschoenh@amazon.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160301152303.GV6356@twins.programming.kicks-ass.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-05-05sched/fair: Fix comment in calculate_imbalance()Dietmar Eggemann
The comment in calculate_imbalance() was introduced in commit: 2dd73a4f09be ("[PATCH] sched: implement smpnice") which described the logic as it was then, but a later commit: b18855500fc4 ("sched/balancing: Fix 'local->avg_load > sds->avg_load' case in calculate_imbalance()") .. complicated this logic some more so that the comment does not match anymore. Update the comment to match the code. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1461958364-675-3-git-send-email-dietmar.eggemann@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-05sched/fair: Remove stale power aware scheduling commentsDietmar Eggemann
Commit 8e7fbcbc22c1 ("sched: Remove stale power aware scheduling remnants and dysfunctional knobs") deleted the power aware scheduling support. This patch gets rid of the remaining power aware scheduling related comments in the code as well. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1461958364-675-2-git-send-email-dietmar.eggemann@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-05sched/fair: Update rq clock before updating nohz CPU loadMatt Fleming
If we're accessing rq_clock() (e.g. in sched_avg_update()) we should update the rq clock before calling cpu_load_update(), otherwise any time calculations will be stale. All other paths currently call update_rq_clock(). Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1462304814-11715-1-git-send-email-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-05sched/debug: Print out idle balance values even on !CONFIG_SCHEDSTATS kernelsWanpeng Li
The max_idle_balance_cost and avg_idle values which are tracked and ar used to capture short idle incidents, are not associated with schedstats, however the information of these two values isn't printed out on !CONFIG_SCHEDSTATS kernels. Fix this by moving the value printout out of the CONFIG_SCHEDSTATS section. Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1462250305-4523-1-git-send-email-wanpeng.li@hotmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-05sched/fair: Optimize sum computation with a lookup tableYuyang Du
__compute_runnable_contrib() uses a loop to compute sum, whereas a table lookup can do it faster in a constant amount of time. The program to generate the constants is located at: Documentation/scheduler/sched-avg.txt Signed-off-by: Yuyang Du <yuyang.du@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Morten Rasmussen <morten.rasmussen@arm.com> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: juri.lelli@arm.com Cc: pjt@google.com Link: http://lkml.kernel.org/r/1462226078-31904-2-git-send-email-yuyang.du@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-05sched/fair: Rename SCHED_LOAD_SHIFT to NICE_0_LOAD_SHIFT and remove ↵Yuyang Du
SCHED_LOAD_SCALE After cleaning up the sched metrics, there are two definitions that are ambiguous and confusing: SCHED_LOAD_SHIFT and SCHED_LOAD_SHIFT. Resolve this: - Rename SCHED_LOAD_SHIFT to NICE_0_LOAD_SHIFT, which better reflects what it is. - Replace SCHED_LOAD_SCALE use with SCHED_CAPACITY_SCALE and remove SCHED_LOAD_SCALE. Suggested-by: Ben Segall <bsegall@google.com> Signed-off-by: Yuyang Du <yuyang.du@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: lizefan@huawei.com Cc: morten.rasmussen@arm.com Cc: pjt@google.com Cc: umgwanakikbuti@gmail.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1459829551-21625-3-git-send-email-yuyang.du@intel.com [ Rewrote the changelog and fixed the build on 32-bit kernels. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-05sched/fair: Generalize the load/util averages resolution definitionYuyang Du
Integer metric needs fixed point arithmetic. In sched/fair, a few metrics, e.g., weight, load, load_avg, util_avg, freq, and capacity, may have different fixed point ranges, which makes their update and usage error-prone. In order to avoid the errors relating to the fixed point range, we definie a basic fixed point range, and then formalize all metrics to base on the basic range. The basic range is 1024 or (1 << 10). Further, one can recursively apply the basic range to have larger range. Pointed out by Ben Segall, weight (visible to user, e.g., NICE-0 has 1024) and load (e.g., NICE_0_LOAD) have independent ranges, but they must be well calibrated. Signed-off-by: Yuyang Du <yuyang.du@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: lizefan@huawei.com Cc: morten.rasmussen@arm.com Cc: pjt@google.com Cc: umgwanakikbuti@gmail.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1459829551-21625-2-git-send-email-yuyang.du@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-05sched/core: Enable increased load resolution on 64-bit kernelsPeter Zijlstra
Mike ran into the low load resolution limitation on his big machine. So reenable these bits; nobody could ever reproduce/analyze the reported power usage claim and Google has been running with this for years as well. Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com> Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-05locking/lockdep, sched/core: Implement a better lock pinning schemePeter Zijlstra
The problem with the existing lock pinning is that each pin is of value 1; this mean you can simply unpin if you know its pinned, without having any extra information. This scheme generates a random (16 bit) cookie for each pin and requires this same cookie to unpin. This means you have to keep the cookie in context. No objsize difference for !LOCKDEP kernels. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-05sched/core: Introduce 'struct rq_flags'Peter Zijlstra
In order to be able to pass around more than just the IRQ flags in the future, add a rq_flags structure. No difference in code generation for the x86_64-defconfig build I tested. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-05sched/core: Move task_rq_lock() out of linePeter Zijlstra
Its a rather large function, inline doesn't seems to make much sense: $ size defconfig-build/kernel/sched/core.o{.orig,} text data bss dec hex filename 56533 21037 2320 79890 13812 defconfig-build/kernel/sched/core.o.orig 55733 21037 2320 79090 134f2 defconfig-build/kernel/sched/core.o The 'perf bench sched messaging' micro-benchmark shows a visible improvement of 4-5%: $ for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ; do echo performance > $i ; done $ perf stat --null --repeat 25 -- perf bench sched messaging -g 40 -l 5000 pre: 4.582798193 seconds time elapsed ( +- 1.41% ) 4.733374877 seconds time elapsed ( +- 2.10% ) 4.560955136 seconds time elapsed ( +- 1.43% ) 4.631062303 seconds time elapsed ( +- 1.40% ) post: 4.364765213 seconds time elapsed ( +- 0.91% ) 4.454442734 seconds time elapsed ( +- 1.18% ) 4.448893817 seconds time elapsed ( +- 1.41% ) 4.424346872 seconds time elapsed ( +- 0.97% ) Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-05Merge branch 'sched/urgent' into sched/core, to pick up fixes before ↵Ingo Molnar
applying new changes Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-28Merge back earlier cpufreq material for v4.7.Rafael J. Wysocki
2016-04-28sched/core: Add switch_mm_irqs_off() and use it in the schedulerAndy Lutomirski
By default, this is the same thing as switch_mm(). x86 will override it as an optimization. Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/df401df47bdd6be3e389c6f1e3f5310d70e81b2c.1461688545.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-28sched/cpufreq: Optimize cpufreq update kicker to avoid update multiple timesWanpeng Li
Sometimes delta_exec is 0 due to update_curr() is called multiple times, this is captured by: u64 delta_exec = rq_clock_task(rq) - curr->se.exec_start; This patch optimizes the cpufreq update kicker by bailing out when nothing changed, it will benefit the upcoming schedutil, since otherwise it will (over)react to the special util/max combination. Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1461316044-9520-1-git-send-email-wanpeng.li@hotmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-28nohz/full, sched/rt: Fix missed tick-reenabling bug in sched_can_stop_tick()Peter Zijlstra
Chris Metcalf reported a that sched_can_stop_tick() sometimes fails to re-enable the tick. His observed problem is that rq->cfs.nr_running can be 1 even though there are multiple runnable CFS tasks. This happens in the cgroup case, in which case cfs.nr_running is the number of runnable entities for that level. If there is a single runnable cgroup (which can have an arbitrary number of runnable child entries itself) rq->cfs.nr_running will be 1. However, looking at that function I think there's more problems with it. It seems to assume that if there's FIFO tasks, those will run. This is incorrect. The FIFO task can have a lower prio than an RR task, in which case the RR task will run. So the whole fifo_nr_running test seems misplaced, it should go after the rr_nr_running tests. That is, only if !rr_nr_running, can we use fifo_nr_running like this. Reported-by: Chris Metcalf <cmetcalf@mellanox.com> Tested-by: Chris Metcalf <cmetcalf@mellanox.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Wanpeng Li <kernellwp@gmail.com> Fixes: 76d92ac305f2 ("sched: Migrate sched to use new tick dependency mask model") Link: http://lkml.kernel.org/r/20160421160315.GK24771@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-23sched/deadline: Fix a bug in dl_overflow()Xunlei Pang
I got a minus(very big) dl_b->total_bw during my deadline tests. # grep dl /proc/sched_debug dl_rq[0]: .dl_nr_running : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : -222297900 Something unusual must have happened. After some digging, I finally noticed that when changing a deadline task to normal(cfs), and changing it back to deadline immediately, after it died, we will got the wrong dl_bw->total_bw. The root cause is in dl_overflow(), it has: if (new_bw == p->dl.dl_bw) return 0; 1) When a deadline task is changed to !deadline task, it will start dl timer in switched_from_dl(), and retain previous deadline parameter till the timer expires. 2) If we change it back to deadline with the same bandwidth parameter before the timer expires, as it keeps the old bandwidth although it is not a deadline task. dl_overflow() simply returns success without updating the right data, and got the wrong dl_bw->total_bw. The solution is simple, if @p is not deadline, don't return. Signed-off-by: Xunlei Pang <xlpang@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@arm.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1460636368-1993-1-git-send-email-xlpang@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-23sched/fair: Optimize !CONFIG_NO_HZ_COMMON CPU load updatesFrederic Weisbecker
Some code in CPU load update only concern NO_HZ configs but it is built on all configurations. When NO_HZ isn't built, that code is harmless but just happens to take some useless ressources in CPU and memory: 1) one useless field in struct rq 2) jiffies record on every tick that is never used (cpu_load_update_periodic) 3) decay_load_missed is called two times on every tick to eventually return immediately with no action taken. And that function is dead code. For pure optimization purposes, lets conditionally build the NO_HZ related code. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1461080211-16271-1-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-23sched/fair: Correctly handle nohz ticks CPU load accountingFrederic Weisbecker
Ticks can happen while the CPU is in dynticks-idle or dynticks-singletask mode. In fact "nohz" or "dynticks" only mean that we exit the periodic mode and we try to minimize the ticks as much as possible. The nohz subsystem uses a confusing terminology with the internal state "ts->tick_stopped" which is also available through its public interface with tick_nohz_tick_stopped(). This is a misnomer as the tick is instead reduced with the best effort rather than stopped. In the best case the tick can indeed be actually stopped but there is no guarantee about that. If a timer needs to fire one second later, a tick will fire while the CPU is in nohz mode and this is a very common scenario. Now this confusion happens to be a problem with CPU load updates: cpu_load_update_active() doesn't handle nohz ticks correctly because it assumes that ticks are completely stopped in nohz mode and that cpu_load_update_active() can't be called in dynticks mode. When that happens, the whole previous tickless load is ignored and the function just records the load for the current tick, ignoring potentially long idle periods behind. In order to solve this, we could account the current load for the previous nohz time but there is a risk that we account the load of a task that got freshly enqueued for the whole nohz period. So instead, lets record the dynticks load on nohz frame entry so we know what to record in case of nohz ticks, then use this record to account the tickless load on nohz ticks and nohz frame end. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1460555812-25375-3-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-23sched/fair: Gather CPU load functions under a more conventional namespaceFrederic Weisbecker
The CPU load update related functions have a weak naming convention currently, starting with update_cpu_load_*() which isn't ideal as "update" is a very generic concept. Since two of these functions are public already (and a third is to come) that's enough to introduce a more conventional naming scheme. So let's do the following rename instead: update_cpu_load_*() -> cpu_load_update_*() Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1460555812-25375-2-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-23sched/fair: Call cpufreq hook in additional pathsSteve Muckle
The cpufreq hook should be called any time the root CFS rq utilization changes. This can occur when a task is switched to or from the fair class, or a task moves between groups or CPUs, but these paths currently do not call the cpufreq hook. Fix this by adding the hook to attach_entity_load_avg() and detach_entity_load_avg(). Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Steve Muckle <smuckle@linaro.org> [ Added the .update_freq argument to update_cfs_rq_load_avg() to avoid a double cpufreq call. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <Juri.Lelli@arm.com> Cc: Michael Turquette <mturquette@baylibre.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1458858367-2831-1-git-send-email-smuckle@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-23sched/fair: Do not call cpufreq hook unless util changedSteve Muckle
There's no reason to call the cpufreq hook if the root cfs_rq utilization has not been modified. Signed-off-by: Steve Muckle <smuckle@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <Juri.Lelli@arm.com> Cc: Michael Turquette <mturquette@baylibre.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/1458606068-7476-2-git-send-email-smuckle@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-23sched/fair: Move cpufreq hook to update_cfs_rq_load_avg()Steve Muckle
The cpufreq hook should be called whenever the root cfs_rq utilization changes so update_cfs_rq_load_avg() is a better place for it. The current location is not invoked in the enqueue_entity() or update_blocked_averages() paths. Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Steve Muckle <smuckle@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <Juri.Lelli@arm.com> Cc: Michael Turquette <mturquette@baylibre.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1458606068-7476-1-git-send-email-smuckle@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-23sched/fair: Fix asym packing to select correct CPUSrikar Dronamraju
When asymmetric packing is set in the sched_domain and target CPU is busy, update_sd_pick_busiest() may not select the busiest runqueue. When target CPU is busy, find_busiest_group() will ignore checks for asym packing and may continue to load balance using the currently selected not-the-busiest runqueue as source runqueue. Selecting the busiest runqueue as source when the target CPU is busy, should result in achieving much better load balance. Also when target CPU is not busy and asymmetric packing is set in sd, select higher CPU as source CPU for load balancing. While doing this change, move the check to see if target CPU is busy into check_asym_packing(). The extent of performance benefit from this change decreases with the increasing load. However there is benefit in undercommit as well as overcommit conditions. 1. Record per second ebizzy (32 threads) on a 64 CPU power 7 box. (5 iterations) 4.6.0-rc2 Testcase: Min Max Avg StdDev ebizzy: 5223767.00 10368236.00 7946971.00 1753094.76 4.6.0-rc2+asym-changes Testcase: Min Max Avg StdDev %Change ebizzy: 8617191.00 13872356.00 11383980.00 1783400.89 +24.78% 2. Record per second ebizzy (64 threads) on a 64 CPU power 7 box. (5 iterations) 4.6.0-rc2 Testcase: Min Max Avg StdDev ebizzy: 6497666.00 18399783.00 10818093.20 4051452.08 4.6.0-rc2+asym-changes Testcase: Min Max Avg StdDev %Change ebizzy: 7567365.00 19456937.00 11674063.60 4295407.48 +4.40% 3. Record per second ebizzy (128 threads) on a 64 CPU power 7 box. (5 iterations) 4.6.0-rc2 Testcase: Min Max Avg StdDev ebizzy: 37073983.00 40341911.00 38776241.80 1259766.82 4.6.0-rc2+asym-changes Testcase: Min Max Avg StdDev %Change ebizzy: 38030399.00 41333378.00 39827404.40 1255001.86 +2.54% Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com> Cc: Michael Neuling <mikey@neuling.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1459948660-16073-1-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-13sched/cpuacct: Check for NULL when using task_pt_regs()Anton Blanchard
task_pt_regs() can return NULL for kernel threads, so add a check. This fixes an oops at boot on ppc64. Reported-and-Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Tested-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Anton Blanchard <anton@samba.org> Acked-by: Zhao Lei <zhaolei@cn.fujitsu.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: efault@gmx.de Cc: htejun@gmail.com Cc: linuxppc-dev@lists.ozlabs.org Cc: tj@kernel.org Cc: yangds.fnst@cn.fujitsu.com Link: http://lkml.kernel.org/r/20160406215950.04bc3f0b@kryten Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-13sched/clock: Make local_clock()/cpu_clock() inlineDaniel Lezcano
The local_clock/cpu_clock functions were changed to prevent a double identical test with sched_clock_cpu() when HAVE_UNSTABLE_SCHED_CLOCK is set. That resulted in one line functions. As these functions are in all the cases one line functions and in the hot path, it is useful to specify them as static inline in order to give a strong hint to the compiler. After verification, it appears the compiler does not inline them without this hint. Change those functions to static inline. sched_clock_cpu() is called via the inlined local_clock()/cpu_clock() functions from sched.h. So any module code including sched.h will reference sched_clock_cpu(). Thus it must be exported with the EXPORT_SYMBOL_GPL macro. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1460385514-14700-2-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-13sched/clock: Remove pointless test in cpu_clock/local_clockDaniel Lezcano
In case the HAVE_UNSTABLE_SCHED_CLOCK config is set, the cpu_clock() version checks if sched_clock_stable() is not set and calls sched_clock_cpu(), otherwise it calls sched_clock(). sched_clock_cpu() checks also if sched_clock_stable() is set and, if true, calls sched_clock(). sched_clock() will be called in sched_clock_cpu() if sched_clock_stable() is true. Remove the duplicate test by directly calling sched_clock_cpu() and let the static key act in this function instead. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1460385514-14700-1-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-13sched/debug: Don't dump sched debug info in SysRq-WRabin Vincent
sysrq_sched_debug_show() can dump a lot of information. Don't print out all that if we're just trying to get a list of blocked tasks (SysRq-W). The information is still accessible with SysRq-T. Signed-off-by: Rabin Vincent <rabinv@axis.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1459777322-30902-1-git-send-email-rabin.vincent@axis.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-08cpufreq: Call cpufreq_disable_fast_switch() in sugov_exit()Rafael J. Wysocki
Due to differences in the cpufreq core's handling of runtime CPU offline and nonboot CPUs disabling during system suspend-to-RAM, fast frequency switching gets disabled after a suspend-to-RAM and resume cycle on all of the nonboot CPUs. To prevent that from happening, move the invocation of cpufreq_disable_fast_switch() from cpufreq_exit_governor() to sugov_exit(), as the schedutil governor is the only user of fast frequency switching today anyway. That simply prevents cpufreq_disable_fast_switch() from being called without invoking the ->governor callback for the CPUFREQ_GOV_POLICY_EXIT event (which happens during system suspend now). Fixes: b7898fda5bc7 (cpufreq: Support for fast frequency switching) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-04-02cpufreq: schedutil: New governor based on scheduler utilization dataRafael J. Wysocki
Add a new cpufreq scaling governor, called "schedutil", that uses scheduler-provided CPU utilization information as input for making its decisions. Doing that is possible after commit 34e2c555f3e1 (cpufreq: Add mechanism for registering utilization update callbacks) that introduced cpufreq_update_util() called by the scheduler on utilization changes (from CFS) and RT/DL task status updates. In particular, CPU frequency scaling decisions may be based on the the utilization data passed to cpufreq_update_util() by CFS. The new governor is relatively simple. The frequency selection formula used by it depends on whether or not the utilization is frequency-invariant. In the frequency-invariant case the new CPU frequency is given by next_freq = 1.25 * max_freq * util / max where util and max are the last two arguments of cpufreq_update_util(). In turn, if util is not frequency-invariant, the maximum frequency in the above formula is replaced with the current frequency of the CPU: next_freq = 1.25 * curr_freq * util / max The coefficient 1.25 corresponds to the frequency tipping point at (util / max) = 0.8. All of the computations are carried out in the utilization update handlers provided by the new governor. One of those handlers is used for cpufreq policies shared between multiple CPUs and the other one is for policies with one CPU only (and therefore it doesn't need to use any extra synchronization means). The governor supports fast frequency switching if that is supported by the cpufreq driver in use and possible for the given policy. In the fast switching case, all operations of the governor take place in its utilization update handlers. If fast switching cannot be used, the frequency switch operations are carried out with the help of a work item which only calls __cpufreq_driver_target() (under a mutex) to trigger a frequency update (to a value already computed beforehand in one of the utilization update handlers). Currently, the governor treats all of the RT and DL tasks as "unknown utilization" and sets the frequency to the allowed maximum when updated from the RT or DL sched classes. That heavy-handed approach should be replaced with something more subtle and specifically targeted at RT and DL tasks. The governor shares some tunables management code with the "ondemand" and "conservative" governors and uses some common definitions from cpufreq_governor.h, but apart from that it is stand-alone. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2016-04-02cpufreq: sched: Helpers to add and remove update_util hooksRafael J. Wysocki
Replace the single helper for adding and removing cpufreq utilization update hooks, cpufreq_set_update_util_data(), with a pair of helpers, cpufreq_add_update_util_hook() and cpufreq_remove_update_util_hook(), and modify the users of cpufreq_set_update_util_data() accordingly. With the new helpers, the code using them doesn't need to worry about the internals of struct update_util_data and in particular it doesn't need to worry about populating the func field in it properly upfront. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2016-03-31sched/fair: Initiate a new task's util avg to a bounded valueYuyang Du
A new task's util_avg is set to full utilization of a CPU (100% time running). This accelerates a new task's utilization ramp-up, useful to boost its execution in early time. However, it may result in (insanely) high utilization for a transient time period when a flood of tasks are spawned. Importantly, it violates the "fundamentally bounded" CPU utilization, and its side effect is negative if we don't take any measure to bound it. This patch proposes an algorithm to address this issue. It has two methods to approach a sensible initial util_avg: (1) An expected (or average) util_avg based on its cfs_rq's util_avg: util_avg = cfs_rq->util_avg / (cfs_rq->load_avg + 1) * se.load.weight (2) A trajectory of how successive new tasks' util develops, which gives 1/2 of the left utilization budget to a new task such that the additional util is noticeably large (when overall util is low) or unnoticeably small (when overall util is high enough). In the meantime, the aggregate utilization is well bounded: util_avg_cap = (1024 - cfs_rq->avg.util_avg) / 2^n where n denotes the nth task. If util_avg is larger than util_avg_cap, then the effective util is clamped to the util_avg_cap. Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Yuyang Du <yuyang.du@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: morten.rasmussen@arm.com Cc: pjt@google.com Cc: steve.muckle@linaro.org Link: http://lkml.kernel.org/r/1459283456-21682-1-git-send-email-yuyang.du@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31sched/fair: Update comments after a variable renameYuyang Du
The following commit: ed82b8a1ff76 ("sched/core: Move the sched_to_prio[] arrays out of line") renamed prio_to_weight to sched_prio_to_weight, but the old name was not updated in comments. Signed-off-by: Yuyang Du <yuyang.du@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1459292871-22531-1-git-send-email-yuyang.du@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31sched/core: Add preempt checks in preempt_schedule() codeSteven Rostedt
While testing the tracer preemptoff, I hit this strange trace: <...>-259 0...1 0us : schedule <-worker_thread <...>-259 0d..1 0us : rcu_note_context_switch <-__schedule <...>-259 0d..1 0us : rcu_sched_qs <-rcu_note_context_switch <...>-259 0d..1 0us : rcu_preempt_qs <-rcu_note_context_switch <...>-259 0d..1 0us : _raw_spin_lock <-__schedule <...>-259 0d..1 0us : preempt_count_add <-_raw_spin_lock <...>-259 0d..2 0us : do_raw_spin_lock <-_raw_spin_lock <...>-259 0d..2 1us : deactivate_task <-__schedule <...>-259 0d..2 1us : update_rq_clock.part.84 <-deactivate_task <...>-259 0d..2 1us : dequeue_task_fair <-deactivate_task <...>-259 0d..2 1us : dequeue_entity <-dequeue_task_fair <...>-259 0d..2 1us : update_curr <-dequeue_entity <...>-259 0d..2 1us : update_min_vruntime <-update_curr <...>-259 0d..2 1us : cpuacct_charge <-update_curr <...>-259 0d..2 1us : __rcu_read_lock <-cpuacct_charge <...>-259 0d..2 1us : __rcu_read_unlock <-cpuacct_charge <...>-259 0d..2 1us : clear_buddies <-dequeue_entity <...>-259 0d..2 1us : account_entity_dequeue <-dequeue_entity <...>-259 0d..2 2us : update_min_vruntime <-dequeue_entity <...>-259 0d..2 2us : update_cfs_shares <-dequeue_entity <...>-259 0d..2 2us : hrtick_update <-dequeue_task_fair <...>-259 0d..2 2us : wq_worker_sleeping <-__schedule <...>-259 0d..2 2us : kthread_data <-wq_worker_sleeping <...>-259 0d..2 2us : pick_next_task_fair <-__schedule <...>-259 0d..2 2us : check_cfs_rq_runtime <-pick_next_task_fair <...>-259 0d..2 2us : pick_next_entity <-pick_next_task_fair <...>-259 0d..2 2us : clear_buddies <-pick_next_entity <...>-259 0d..2 2us : pick_next_entity <-pick_next_task_fair <...>-259 0d..2 2us : clear_buddies <-pick_next_entity <...>-259 0d..2 2us : set_next_entity <-pick_next_task_fair <...>-259 0d..2 3us : put_prev_entity <-pick_next_task_fair <...>-259 0d..2 3us : check_cfs_rq_runtime <-put_prev_entity <...>-259 0d..2 3us : set_next_entity <-pick_next_task_fair gnome-sh-1031 0d..2 3us : finish_task_switch <-__schedule gnome-sh-1031 0d..2 3us : _raw_spin_unlock_irq <-finish_task_switch gnome-sh-1031 0d..2 3us : do_raw_spin_unlock <-_raw_spin_unlock_irq gnome-sh-1031 0...2 3us!: preempt_count_sub <-_raw_spin_unlock_irq gnome-sh-1031 0...1 582us : do_raw_spin_lock <-_raw_spin_lock gnome-sh-1031 0...1 583us : _raw_spin_unlock <-drm_gem_object_lookup gnome-sh-1031 0...1 583us : do_raw_spin_unlock <-_raw_spin_unlock gnome-sh-1031 0...1 583us : preempt_count_sub <-_raw_spin_unlock gnome-sh-1031 0...1 584us : _raw_spin_unlock <-drm_gem_object_lookup gnome-sh-1031 0...1 584us+: trace_preempt_on <-drm_gem_object_lookup gnome-sh-1031 0...1 603us : <stack trace> => preempt_count_sub => _raw_spin_unlock => drm_gem_object_lookup => i915_gem_madvise_ioctl => drm_ioctl => do_vfs_ioctl => SyS_ioctl => entry_SYSCALL_64_fastpath As I'm tracing preemption disabled, it seemed incorrect that the trace would go across a schedule and report not being in the scheduler. Looking into this I discovered the problem. schedule() calls preempt_disable() but the preempt_schedule() calls preempt_enable_notrace(). What happened above was that the gnome-shell task was preempted on another CPU, migrated over to the idle cpu. The tracer stared with idle calling schedule(), which called preempt_disable(), but then gnome-shell finished, and it enabled preemption with preempt_enable_notrace() that does stop the trace, even though preemption was enabled. The purpose of the preempt_disable_notrace() in the preempt_schedule() is to prevent function tracing from going into an infinite loop. Because function tracing can trace the preempt_enable/disable() calls that are traced. The problem with function tracing is: NEED_RESCHED set preempt_schedule() preempt_disable() preempt_count_inc() function trace (before incrementing preempt count) preempt_disable_notrace() preempt_enable_notrace() sees NEED_RESCHED set preempt_schedule() (repeat) Now by breaking out the preempt off/on tracing into their own code: preempt_disable_check() and preempt_enable_check(), we can add these to the preempt_schedule() code. As preemption would then be disabled, even if they were to be traced by the function tracer, the disabled preemption would prevent the recursion. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160321112339.6dc78ad6@gandalf.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31sched/numa: Remove unnecessary NUMA dequeue update from non-SMP kernelsTim Chen
In account_entity_enqueue(), we do not do account_numa_enqueue() as NUMA balancing is not needed for UP kernels. Hence, we should remove the account_numa_dequeue() call from account_entity_dequeue() for UP kernels. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1454366879.21738.29.camel@schen9-desk2.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31sched/fair: Reset nr_balance_failed after active balancingSrikar Dronamraju
To force a task migration during active balancing, nr_balance_failed is set to cache_nice_tries + 1. However nr_balance_failed is not reset. As a side effect, the next regular load balance under the same sd, a cache hot task might be migrated, just because nr_balance_failed count is high. Resetting nr_balance_failed after a successful active balance ensures that a hot task is not unreasonably migrated. This can be verified by looking at othe number of hot task migrations reported by /proc/schedstat. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1458735884-30105-1-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31sched/cpuacct: Split usage accounting into user_usage and sys_usageDongsheng Yang
Sometimes, cpuacct.usage is not detailed enough to see how much CPU usage a group had. We want to know how much time it used in user mode and how much in kernel mode. This patch introduces more files to give this information: # ls /sys/fs/cgroup/cpuacct/cpuacct.usage* /sys/fs/cgroup/cpuacct/cpuacct.usage /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu /sys/fs/cgroup/cpuacct/cpuacct.usage_user /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_user /sys/fs/cgroup/cpuacct/cpuacct.usage_sys /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_sys ... while keeping the ABI with the existing counter. Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> [ Ported to newer kernels. ] Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <htejun@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/aa171da036b520b51c79549e9b3215d29473f19d.1458635566.git.zhaolei@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>