summaryrefslogtreecommitdiff
path: root/kernel/sched/deadline.c
AgeCommit message (Collapse)Author
2019-11-08sched: Fix pick_next_task() vs 'change' pattern racePeter Zijlstra
Commit 67692435c411 ("sched: Rework pick_next_task() slow-path") inadvertly introduced a race because it changed a previously unexplored dependency between dropping the rq->lock and sched_class::put_prev_task(). The comments about dropping rq->lock, in for example newidle_balance(), only mentions the task being current and ->on_cpu being set. But when we look at the 'change' pattern (in for example sched_setnuma()): queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */ running = task_current(rq, p); /* rq->curr == p */ if (queued) dequeue_task(...); if (running) put_prev_task(...); /* change task properties */ if (queued) enqueue_task(...); if (running) set_next_task(...); It becomes obvious that if we do this after put_prev_task() has already been called on @p, things go sideways. This is exactly what the commit in question allows to happen when it does: prev->sched_class->put_prev_task(rq, prev, rf); if (!rq->nr_running) newidle_balance(rq, rf); The newidle_balance() call will drop rq->lock after we've called put_prev_task() and that allows the above 'change' pattern to interleave and mess up the state. Furthermore, it turns out we lost the RT-pull when we put the last DL task. Fix both problems by extracting the balancing from put_prev_task() and doing a multi-class balance() pass before put_prev_task(). Fixes: 67692435c411 ("sched: Rework pick_next_task() slow-path") Reported-by: Quentin Perret <qperret@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Quentin Perret <qperret@google.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com>
2019-09-17Merge branch 'timers-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core timer updates from Thomas Gleixner: "Timers and timekeeping updates: - A large overhaul of the posix CPU timer code which is a preparation for moving the CPU timer expiry out into task work so it can be properly accounted on the task/process. An update to the bogus permission checks will come later during the merge window as feedback was not complete before heading of for travel. - Switch the timerqueue code to use cached rbtrees and get rid of the homebrewn caching of the leftmost node. - Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a single function - Implement the separation of hrtimers to be forced to expire in hard interrupt context even when PREEMPT_RT is enabled and mark the affected timers accordingly. - Implement a mechanism for hrtimers and the timer wheel to protect RT against priority inversion and live lock issues when a (hr)timer which should be canceled is currently executing the callback. Instead of infinitely spinning, the task which tries to cancel the timer blocks on a per cpu base expiry lock which is held and released by the (hr)timer expiry code. - Enable the Hyper-V TSC page based sched_clock for Hyper-V guests resulting in faster access to timekeeping functions. - Updates to various clocksource/clockevent drivers and their device tree bindings. - The usual small improvements all over the place" * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits) posix-cpu-timers: Fix permission check regression posix-cpu-timers: Always clear head pointer on dequeue hrtimer: Add a missing bracket and hide `migration_base' on !SMP posix-cpu-timers: Make expiry_active check actually work correctly posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build tick: Mark sched_timer to expire in hard interrupt context hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n posix-cpu-timers: Utilize timerqueue for storage posix-cpu-timers: Move state tracking to struct posix_cputimers posix-cpu-timers: Deduplicate rlimit handling posix-cpu-timers: Remove pointless comparisons posix-cpu-timers: Get rid of 64bit divisions posix-cpu-timers: Consolidate timer expiry further posix-cpu-timers: Get rid of zero checks rlimit: Rewrite non-sensical RLIMIT_CPU comment posix-cpu-timers: Respect INFINITY for hard RTTIME limit posix-cpu-timers: Switch thread group sampling to array posix-cpu-timers: Restructure expiry array posix-cpu-timers: Remove cputime_expires ...
2019-09-16Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - MAINTAINERS: Add Mark Rutland as perf submaintainer, Juri Lelli and Vincent Guittot as scheduler submaintainers. Add Dietmar Eggemann, Steven Rostedt, Ben Segall and Mel Gorman as scheduler reviewers. As perf and the scheduler is getting bigger and more complex, document the status quo of current responsibilities and interests, and spread the review pain^H^H^H^H fun via an increase in the Cc: linecount generated by scripts/get_maintainer.pl. :-) - Add another series of patches that brings the -rt (PREEMPT_RT) tree closer to mainline: split the monolithic CONFIG_PREEMPT dependencies into a new CONFIG_PREEMPTION category that will allow the eventual introduction of CONFIG_PREEMPT_RT. Still a few more hundred patches to go though. - Extend the CPU cgroup controller with uclamp.min and uclamp.max to allow the finer shaping of CPU bandwidth usage. - Micro-optimize energy-aware wake-ups from O(CPUS^2) to O(CPUS). - Improve the behavior of high CPU count, high thread count applications running under cpu.cfs_quota_us constraints. - Improve balancing with SCHED_IDLE (SCHED_BATCH) tasks present. - Improve CPU isolation housekeeping CPU allocation NUMA locality. - Fix deadline scheduler bandwidth calculations and logic when cpusets rebuilds the topology, or when it gets deadline-throttled while it's being offlined. - Convert the cpuset_mutex to percpu_rwsem, to allow it to be used from setscheduler() system calls without creating global serialization. Add new synchronization between cpuset topology-changing events and the deadline acceptance tests in setscheduler(), which were broken before. - Rework the active_mm state machine to be less confusing and more optimal. - Rework (simplify) the pick_next_task() slowpath. - Improve load-balancing on AMD EPYC systems. - ... and misc cleanups, smaller fixes and improvements - please see the Git log for more details. * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits) sched/psi: Correct overly pessimistic size calculation sched/fair: Speed-up energy-aware wake-ups sched/uclamp: Always use 'enum uclamp_id' for clamp_id values sched/uclamp: Update CPU's refcount on TG's clamp changes sched/uclamp: Use TG's clamps to restrict TASK's clamps sched/uclamp: Propagate system defaults to the root group sched/uclamp: Propagate parent clamps sched/uclamp: Extend CPU's cgroup controller sched/topology: Improve load balancing on AMD EPYC systems arch, ia64: Make NUMA select SMP sched, perf: MAINTAINERS update, add submaintainers and reviewers sched/fair: Use rq_lock/unlock in online_fair_sched_group cpufreq: schedutil: fix equation in comment sched: Rework pick_next_task() slow-path sched: Allow put_prev_task() to drop rq->lock sched/fair: Expose newidle_balance() sched: Add task_struct pointer to sched_class::set_curr_task sched: Rework CPU hotplug task selection sched/{rt,deadline}: Fix set_next_task vs pick_next_task sched: Fix kerneldoc comment for ia64_set_curr_task ...
2019-08-08sched: Rework pick_next_task() slow-pathPeter Zijlstra
Avoid the RETRY_TASK case in the pick_next_task() slow path. By doing the put_prev_task() early, we get the rt/deadline pull done, and by testing rq->nr_running we know if we need newidle_balance(). This then gives a stable state to pick a task from. Since the fast-path is fair only; it means the other classes will always have pick_next_task(.prev=NULL, .rf=NULL) and we can simplify. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Aaron Lu <aaron.lwe@gmail.com> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: mingo@kernel.org Cc: Phil Auld <pauld@redhat.com> Cc: Julien Desfossez <jdesfossez@digitalocean.com> Cc: Nishanth Aravamudan <naravamudan@digitalocean.com> Link: https://lkml.kernel.org/r/aa34d24b36547139248f32a30138791ac6c02bd6.1559129225.git.vpillai@digitalocean.com
2019-08-08sched: Allow put_prev_task() to drop rq->lockPeter Zijlstra
Currently the pick_next_task() loop is convoluted and ugly because of how it can drop the rq->lock and needs to restart the picking. For the RT/Deadline classes, it is put_prev_task() where we do balancing, and we could do this before the picking loop. Make this possible. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Aaron Lu <aaron.lwe@gmail.com> Cc: mingo@kernel.org Cc: Phil Auld <pauld@redhat.com> Cc: Julien Desfossez <jdesfossez@digitalocean.com> Cc: Nishanth Aravamudan <naravamudan@digitalocean.com> Link: https://lkml.kernel.org/r/e4519f6850477ab7f3d257062796e6425ee4ba7c.1559129225.git.vpillai@digitalocean.com
2019-08-08sched: Add task_struct pointer to sched_class::set_curr_taskPeter Zijlstra
In preparation of further separating pick_next_task() and set_curr_task() we have to pass the actual task into it, while there, rename the thing to better pair with put_prev_task(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Aaron Lu <aaron.lwe@gmail.com> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: mingo@kernel.org Cc: Phil Auld <pauld@redhat.com> Cc: Julien Desfossez <jdesfossez@digitalocean.com> Cc: Nishanth Aravamudan <naravamudan@digitalocean.com> Link: https://lkml.kernel.org/r/a96d1bcdd716db4a4c5da2fece647a1456c0ed78.1559129225.git.vpillai@digitalocean.com
2019-08-08sched/{rt,deadline}: Fix set_next_task vs pick_next_taskPeter Zijlstra
Because pick_next_task() implies set_curr_task() and some of the details haven't mattered too much, some of what _should_ be in set_curr_task() ended up in pick_next_task, correct this. This prepares the way for a pick_next_task() variant that does not affect the current state; allowing remote picking. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Aaron Lu <aaron.lwe@gmail.com> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: mingo@kernel.org Cc: Phil Auld <pauld@redhat.com> Cc: Julien Desfossez <jdesfossez@digitalocean.com> Cc: Nishanth Aravamudan <naravamudan@digitalocean.com> Link: https://lkml.kernel.org/r/38c61d5240553e043c27c5e00b9dd0d184dd6081.1559129225.git.vpillai@digitalocean.com
2019-08-06sched/deadline: Fix double accounting of rq/running bw in push & pullDietmar Eggemann
{push,pull}_dl_task() always calls {de,}activate_task() with .flags=0 which sets p->on_rq=TASK_ON_RQ_MIGRATING. {push,pull}_dl_task()->{de,}activate_task()->{de,en}queue_task()-> {de,en}queue_task_dl() calls {sub,add}_{running,rq}_bw() since p->on_rq==TASK_ON_RQ_MIGRATING. So {sub,add}_{running,rq}_bw() in {push,pull}_dl_task() is double-accounting for that task. Fix it by removing rq/running bw accounting in [push/pull]_dl_task(). Fixes: 7dd778841164 ("sched/core: Unify p->on_rq updates") Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Qais Yousef <qais.yousef@arm.com> Link: https://lkml.kernel.org/r/20190802145945.18702-2-dietmar.eggemann@arm.com
2019-08-01sched/deadline: Ensure inactive_timer runs in hardirq contextJuri Lelli
SCHED_DEADLINE inactive timer needs to run in hardirq context (as dl_task_timer already does) on PREEMPT_RT Change the mode to HRTIMER_MODE_REL_HARD. [ tglx: Fixed up the start site, so mode debugging works ] Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190731103715.4047-1-juri.lelli@redhat.com
2019-08-01sched: Mark hrtimers to expire in hard interrupt contextSebastian Andrzej Siewior
The scheduler related hrtimers need to expire in hard interrupt context even on PREEMPT_RT enabled kernels. Mark then as such. No functional change. [ tglx: Split out from larger combo patch. Add changelog. ] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20190726185753.077004842@linutronix.de
2019-07-25sched/deadline: Fix bandwidth accounting at all levels after offline migrationJuri Lelli
If a task happens to be throttled while the CPU it was running on gets hotplugged off, the bandwidth associated with the task is not correctly migrated with it when the replenishment timer fires (offline_migration). Fix things up, for this_bw, running_bw and total_bw, when replenishment timer fires and task is migrated (dl_task_offline_migration()). Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bristot@redhat.com Cc: claudio@evidence.eu.com Cc: lizefan@huawei.com Cc: longman@redhat.com Cc: luca.abeni@santannapisa.it Cc: mathieu.poirier@linaro.org Cc: rostedt@goodmis.org Cc: tj@kernel.org Cc: tommaso.cucinotta@santannapisa.it Link: https://lkml.kernel.org/r/20190719140000.31694-5-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25cpusets: Rebuild root domain deadline accounting informationMathieu Poirier
When the topology of root domains is modified by CPUset or CPUhotplug operations information about the current deadline bandwidth held in the root domain is lost. This patch addresses the issue by recalculating the lost deadline bandwidth information by circling through the deadline tasks held in CPUsets and adding their current load to the root domain they are associated with. Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> [ Various additional modifications. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bristot@redhat.com Cc: claudio@evidence.eu.com Cc: lizefan@huawei.com Cc: longman@redhat.com Cc: luca.abeni@santannapisa.it Cc: rostedt@goodmis.org Cc: tj@kernel.org Cc: tommaso.cucinotta@santannapisa.it Link: https://lkml.kernel.org/r/20190719140000.31694-4-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-09Merge tag 'docs-5.3' of git://git.lwn.net/linuxLinus Torvalds
Pull Documentation updates from Jonathan Corbet: "It's been a relatively busy cycle for docs: - A fair pile of RST conversions, many from Mauro. These create more than the usual number of simple but annoying merge conflicts with other trees, unfortunately. He has a lot more of these waiting on the wings that, I think, will go to you directly later on. - A new document on how to use merges and rebases in kernel repos, and one on Spectre vulnerabilities. - Various improvements to the build system, including automatic markup of function() references because some people, for reasons I will never understand, were of the opinion that :c:func:``function()`` is unattractive and not fun to type. - We now recommend using sphinx 1.7, but still support back to 1.4. - Lots of smaller improvements, warning fixes, typo fixes, etc" * tag 'docs-5.3' of git://git.lwn.net/linux: (129 commits) docs: automarkup.py: ignore exceptions when seeking for xrefs docs: Move binderfs to admin-guide Disable Sphinx SmartyPants in HTML output doc: RCU callback locks need only _bh, not necessarily _irq docs: format kernel-parameters -- as code Doc : doc-guide : Fix a typo platform: x86: get rid of a non-existent document Add the RCU docs to the core-api manual Documentation: RCU: Add TOC tree hooks Documentation: RCU: Rename txt files to rst Documentation: RCU: Convert RCU UP systems to reST Documentation: RCU: Convert RCU linked list to reST Documentation: RCU: Convert RCU basic concepts to reST docs: filesystems: Remove uneeded .rst extension on toctables scripts/sphinx-pre-install: fix out-of-tree build docs: zh_CN: submitting-drivers.rst: Remove a duplicated Documentation/ Documentation: PGP: update for newer HW devices Documentation: Add section about CPU vulnerabilities for Spectre Documentation: platform: Delete x86-laptop-drivers.txt docs: Note that :c:func: should no longer be used ...
2019-06-24sched/topology: Remove unused 'sd' parameter from arch_scale_cpu_capacity()Vincent Guittot
The 'struct sched_domain *sd' parameter to arch_scale_cpu_capacity() is unused since commit: 765d0af19f5f ("sched/topology: Remove the ::smt_gain field from 'struct sched_domain'") Remove it. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: gregkh@linuxfoundation.org Cc: linux@armlinux.org.uk Cc: quentin.perret@arm.com Cc: rafael@kernel.org Link: https://lkml.kernel.org/r/1560783617-5827-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-14docs: scheduler: convert docs to ReST and rename to *.rstMauro Carvalho Chehab
In order to prepare to add them to the Kernel API book, convert the files to ReST format. The conversion is actually: - add blank lines and identation in order to identify paragraphs; - fix tables markups; - add some lists markups; - mark literal blocks; - adjust title markups. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2019-06-03sched/core: Provide a pointer to the valid CPU maskSebastian Andrzej Siewior
In commit: 4b53a3412d66 ("sched/core: Remove the tsk_nr_cpus_allowed() wrapper") the tsk_nr_cpus_allowed() wrapper was removed. There was not much difference in !RT but in RT we used this to implement migrate_disable(). Within a migrate_disable() section the CPU mask is restricted to single CPU while the "normal" CPU mask remains untouched. As an alternative implementation Ingo suggested to use: struct task_struct { const cpumask_t *cpus_ptr; cpumask_t cpus_mask; }; with t->cpus_ptr = &t->cpus_mask; In -RT we then can switch the cpus_ptr to: t->cpus_ptr = &cpumask_of(task_cpu(p)); in a migration disabled region. The rules are simple: - Code that 'uses' ->cpus_allowed would use the pointer. - Code that 'modifies' ->cpus_allowed would use the direct mask. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20190423142636.14347-1-bigeasy@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-16sched/deadline: Correctly handle active 0-lag timersluca abeni
syzbot reported the following warning: [ ] WARNING: CPU: 4 PID: 17089 at kernel/sched/deadline.c:255 task_non_contending+0xae0/0x1950 line 255 of deadline.c is: WARN_ON(hrtimer_active(&dl_se->inactive_timer)); in task_non_contending(). Unfortunately, in some cases (for example, a deadline task continuosly blocking and waking immediately) it can happen that a task blocks (and task_non_contending() is called) while the 0-lag timer is still active. In this case, the safest thing to do is to immediately decrease the running bandwidth of the task, without trying to re-arm the 0-lag timer. Signed-off-by: luca abeni <luca.abeni@santannapisa.it> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: chengjian (D) <cj.chengjian@huawei.com> Link: https://lkml.kernel.org/r/20190325131530.34706-1-luca.abeni@santannapisa.it Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-04sched/fair: Update scale invariance of PELTVincent Guittot
The current implementation of load tracking invariance scales the contribution with current frequency and uarch performance (only for utilization) of the CPU. One main result of this formula is that the figures are capped by current capacity of CPU. Another one is that the load_avg is not invariant because not scaled with uarch. The util_avg of a periodic task that runs r time slots every p time slots varies in the range : U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p) with U is the max util_avg value = SCHED_CAPACITY_SCALE At a lower capacity, the range becomes: U * C * (1-y^r')/(1-y^p) * y^i' < Utilization < U * C * (1-y^r')/(1-y^p) with C reflecting the compute capacity ratio between current capacity and max capacity. so C tries to compensate changes in (1-y^r') but it can't be accurate. Instead of scaling the contribution value of PELT algo, we should scale the running time. The PELT signal aims to track the amount of computation of tasks and/or rq so it seems more correct to scale the running time to reflect the effective amount of computation done since the last update. In order to be fully invariant, we need to apply the same amount of running time and idle time whatever the current capacity. Because running at lower capacity implies that the task will run longer, we have to ensure that the same amount of idle time will be applied when system becomes idle and no idle time has been "stolen". But reaching the maximum utilization value (SCHED_CAPACITY_SCALE) means that the task is seen as an always-running task whatever the capacity of the CPU (even at max compute capacity). In this case, we can discard this "stolen" idle times which becomes meaningless. In order to achieve this time scaling, a new clock_pelt is created per rq. The increase of this clock scales with current capacity when something is running on rq and synchronizes with clock_task when rq is idle. With this mechanism, we ensure the same running and idle time whatever the current capacity. This also enables to simplify the pelt algorithm by removing all references of uarch and frequency and applying the same contribution to utilization and loads. Furthermore, the scaling is done only once per update of clock (update_rq_clock_task()) instead of during each update of sched_entities and cfs/rt/dl_rq of the rq like the current implementation. This is interesting when cgroup are involved as shown in the results below: On a hikey (octo Arm64 platform). Performance cpufreq governor and only shallowest c-state to remove variance generated by those power features so we only track the impact of pelt algo. each test runs 16 times: ./perf bench sched pipe (higher is better) kernel tip/sched/core + patch ops/seconds ops/seconds diff cgroup root 59652(+/- 0.18%) 59876(+/- 0.24%) +0.38% level1 55608(+/- 0.27%) 55923(+/- 0.24%) +0.57% level2 52115(+/- 0.29%) 52564(+/- 0.22%) +0.86% hackbench -l 1000 (lower is better) kernel tip/sched/core + patch duration(sec) duration(sec) diff cgroup root 4.453(+/- 2.37%) 4.383(+/- 2.88%) -1.57% level1 4.859(+/- 8.50%) 4.830(+/- 7.07%) -0.60% level2 5.063(+/- 9.83%) 4.928(+/- 9.66%) -2.66% Then, the responsiveness of PELT is improved when CPU is not running at max capacity with this new algorithm. I have put below some examples of duration to reach some typical load values according to the capacity of the CPU with current implementation and with this patch. These values has been computed based on the geometric series and the half period value: Util (%) max capacity half capacity(mainline) half capacity(w/ patch) 972 (95%) 138ms not reachable 276ms 486 (47.5%) 30ms 138ms 60ms 256 (25%) 13ms 32ms 26ms On my hikey (octo Arm64 platform) with schedutil governor, the time to reach max OPP when starting from a null utilization, decreases from 223ms with current scale invariance down to 121ms with the new algorithm. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: patrick.bellasi@arm.com Cc: pjt@google.com Cc: pkondeti@codeaurora.org Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Link: https://lkml.kernel.org/r/1548257214-13745-3-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-11sched/core: Remove unnecessary unlikely() in push_*_task()Yangtao Li
WARN_ON() already contains an unlikely(), so it's not necessary to use WARN_ON(1). Signed-off-by: Yangtao Li <tiny.windzz@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20181103172602.1917-1-tiny.windzz@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-03sched: Fix various typos in commentsIngo Molnar
Go over the scheduler source code and fix common typos in comments - and a typo in an actual variable name. No change in functionality intended. Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-04sched/core: Introduce set_next_task() helper for better code readabilityMuchun Song
When we pick the next task, we will do the following for the task: 1) p->se.exec_start = rq_clock_task(rq); 2) dequeue_pushable(_dl)_task(rq, p); When we call set_curr_task(), we also need to do the same thing above. In rt.c, the code at 1) is in the _pick_next_task_rt() and the code at 2) is in the pick_next_task_rt(). If we put two operations in one function, maybe better. So, we introduce a new function set_next_task(), which is responsible for doing the above. By introducing the function we can get rid of calling the dequeue_pushable(_dl)_task() directly(We can call set_next_task()) in pick_next_task() and have better code readability and reuse. In set_curr_task_rt(), we also can call set_next_task(). Do this things such that we end up with: static struct task_struct *pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { /* do something else ... */ put_prev_task(rq, prev); /* pick next task p */ set_next_task(rq, p); /* do something else ... */ } put_prev_task() can match set_next_task(), which can make the code more readable. Signed-off-by: Muchun Song <smuchun@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20181026131743.21786-1-smuchun@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02sched/numa: Pass destination CPU as a parameter to migrate_task_rqSrikar Dronamraju
This additional parameter (new_cpu) is used later for identifying if task migration is across nodes. No functional change. Specjbb2005 results (8 warehouses) Higher bops are better 2 Socket - 2 Node Haswell - X86 JVMS Prev Current %Change 4 203353 200668 -1.32036 1 328205 321791 -1.95427 2 Socket - 4 Node Power8 - PowerNV JVMS Prev Current %Change 1 214384 204848 -4.44809 2 Socket - 2 Node Power9 - PowerNV JVMS Prev Current %Change 4 188553 188098 -0.241311 1 196273 200351 2.07772 4 Socket - 4 Node Power7 - PowerVM JVMS Prev Current %Change 8 57581.2 58145.9 0.980702 1 103468 103798 0.318939 Brings out the variance between different specjbb2005 runs. Some events stats before and after applying the patch. perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 13,941,377 13,912,183 migrations 1,157,323 1,155,931 faults 382,175 367,139 cache-misses 54,993,823,500 54,240,196,814 sched:sched_move_numa 2,005 1,571 sched:sched_stick_numa 14 9 sched:sched_swap_numa 529 463 migrate:mm_migrate_pages 1,573 703 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 67099 50155 numa_hint_faults_local 58456 45264 numa_hit 240416 239652 numa_huge_pte_updates 18 36 numa_interleave 65 68 numa_local 240339 239576 numa_other 77 76 numa_pages_migrated 1574 680 numa_pte_updates 77182 71146 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 3,176,453 3,156,720 migrations 30,238 30,354 faults 87,869 97,261 cache-misses 12,544,479,391 12,400,026,826 sched:sched_move_numa 23 4 sched:sched_stick_numa 0 0 sched:sched_swap_numa 6 1 migrate:mm_migrate_pages 10 20 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 236 272 numa_hint_faults_local 201 186 numa_hit 72293 71362 numa_huge_pte_updates 0 0 numa_interleave 26 23 numa_local 72233 71299 numa_other 60 63 numa_pages_migrated 8 2 numa_pte_updates 0 0 perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 8,478,820 8,606,824 migrations 171,323 155,352 faults 307,499 301,409 cache-misses 240,353,599 157,759,224 sched:sched_move_numa 214 168 sched:sched_stick_numa 0 0 sched:sched_swap_numa 4 3 migrate:mm_migrate_pages 89 125 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 5301 4650 numa_hint_faults_local 4745 3946 numa_hit 92943 90489 numa_huge_pte_updates 0 0 numa_interleave 899 892 numa_local 92345 90034 numa_other 598 455 numa_pages_migrated 88 124 numa_pte_updates 5505 4818 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 2,066,172 2,113,167 migrations 11,076 10,533 faults 149,544 142,727 cache-misses 10,398,067 5,594,192 sched:sched_move_numa 43 10 sched:sched_stick_numa 0 0 sched:sched_swap_numa 0 0 migrate:mm_migrate_pages 6 6 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 3552 744 numa_hint_faults_local 3347 584 numa_hit 25611 25551 numa_huge_pte_updates 0 0 numa_interleave 213 263 numa_local 25583 25302 numa_other 28 249 numa_pages_migrated 6 6 numa_pte_updates 3535 744 perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 99,358,136 101,227,352 migrations 4,041,607 4,151,829 faults 749,653 745,233 cache-misses 225,562,543,251 224,669,561,766 sched:sched_move_numa 771 617 sched:sched_stick_numa 14 2 sched:sched_swap_numa 204 187 migrate:mm_migrate_pages 1,180 316 vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 27409 24195 numa_hint_faults_local 20677 21639 numa_hit 239988 238331 numa_huge_pte_updates 0 0 numa_interleave 0 0 numa_local 239983 238331 numa_other 5 0 numa_pages_migrated 1016 204 numa_pte_updates 27916 24561 perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 60,899,307 62,738,978 migrations 544,668 562,702 faults 270,834 228,465 cache-misses 74,543,455,635 75,778,067,952 sched:sched_move_numa 735 648 sched:sched_stick_numa 25 13 sched:sched_swap_numa 174 137 migrate:mm_migrate_pages 816 733 vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 11059 10281 numa_hint_faults_local 4733 3242 numa_hit 41384 36338 numa_huge_pte_updates 0 0 numa_interleave 0 0 numa_local 41383 36338 numa_other 1 0 numa_pages_migrated 815 706 numa_pte_updates 11323 10176 Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Jirka Hladky <jhladky@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1537552141-27815-3-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25Merge branch 'sched/urgent' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25sched/deadline: Update rq_clock of later_rq when pushing a taskDaniel Bristot de Oliveira
Daniel Casini got this warn while running a DL task here at RetisLab: [ 461.137582] ------------[ cut here ]------------ [ 461.137583] rq->clock_update_flags < RQCF_ACT_SKIP [ 461.137599] WARNING: CPU: 4 PID: 2354 at kernel/sched/sched.h:967 assert_clock_updated.isra.32.part.33+0x17/0x20 [a ton of modules] [ 461.137646] CPU: 4 PID: 2354 Comm: label_image Not tainted 4.18.0-rc4+ #3 [ 461.137647] Hardware name: ASUS All Series/Z87-K, BIOS 0801 09/02/2013 [ 461.137649] RIP: 0010:assert_clock_updated.isra.32.part.33+0x17/0x20 [ 461.137649] Code: ff 48 89 83 08 09 00 00 eb c6 66 0f 1f 84 00 00 00 00 00 55 48 c7 c7 98 7a 6c a5 c6 05 bc 0d 54 01 01 48 89 e5 e8 a9 84 fb ff <0f> 0b 5d c3 0f 1f 44 00 00 0f 1f 44 00 00 83 7e 60 01 74 0a 48 3b [ 461.137673] RSP: 0018:ffffa77e08cafc68 EFLAGS: 00010082 [ 461.137674] RAX: 0000000000000000 RBX: ffff8b3fc1702d80 RCX: 0000000000000006 [ 461.137674] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8b3fded164b0 [ 461.137675] RBP: ffffa77e08cafc68 R08: 0000000000000026 R09: 0000000000000339 [ 461.137676] R10: ffff8b3fd060d410 R11: 0000000000000026 R12: ffffffffa4e14e20 [ 461.137677] R13: ffff8b3fdec22940 R14: ffff8b3fc1702da0 R15: ffff8b3fdec22940 [ 461.137678] FS: 00007efe43ee5700(0000) GS:ffff8b3fded00000(0000) knlGS:0000000000000000 [ 461.137679] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 461.137680] CR2: 00007efe30000010 CR3: 0000000301744003 CR4: 00000000001606e0 [ 461.137680] Call Trace: [ 461.137684] push_dl_task.part.46+0x3bc/0x460 [ 461.137686] task_woken_dl+0x60/0x80 [ 461.137689] ttwu_do_wakeup+0x4f/0x150 [ 461.137690] ttwu_do_activate+0x77/0x80 [ 461.137692] try_to_wake_up+0x1d6/0x4c0 [ 461.137693] wake_up_q+0x32/0x70 [ 461.137696] do_futex+0x7e7/0xb50 [ 461.137698] __x64_sys_futex+0x8b/0x180 [ 461.137701] do_syscall_64+0x5a/0x110 [ 461.137703] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 461.137705] RIP: 0033:0x7efe4918ca26 [ 461.137705] Code: 00 00 00 74 17 49 8b 48 20 44 8b 59 10 41 83 e3 30 41 83 fb 20 74 1e be 85 00 00 00 41 ba 01 00 00 00 41 b9 01 00 00 04 0f 05 <48> 3d 01 f0 ff ff 73 1f 31 c0 c3 be 8c 00 00 00 49 89 c8 4d 31 d2 [ 461.137738] RSP: 002b:00007efe43ee4928 EFLAGS: 00000283 ORIG_RAX: 00000000000000ca [ 461.137739] RAX: ffffffffffffffda RBX: 0000000005094df0 RCX: 00007efe4918ca26 [ 461.137740] RDX: 0000000000000001 RSI: 0000000000000085 RDI: 0000000005094e24 [ 461.137741] RBP: 00007efe43ee49c0 R08: 0000000005094e20 R09: 0000000004000001 [ 461.137741] R10: 0000000000000001 R11: 0000000000000283 R12: 0000000000000000 [ 461.137742] R13: 0000000005094df8 R14: 0000000000000001 R15: 0000000000448a10 [ 461.137743] ---[ end trace 187df4cad2bf7649 ]--- This warning happened in the push_dl_task(), because __add_running_bw()->cpufreq_update_util() is getting the rq_clock of the later_rq before its update, which takes place at activate_task(). The fix then is to update the rq_clock before calling add_running_bw(). To avoid double rq_clock_update() call, we set ENQUEUE_NOCLOCK flag to activate_task(). Reported-by: Daniel Casini <daniel.casini@santannapisa.it> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@santannapisa.it> Fixes: e0367b12674b sched/deadline: Move CPU frequency selection triggering points Link: http://lkml.kernel.org/r/ca31d073a4788acf0684a8b255f14fea775ccf20.1532077269.git.bristot@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16sched/core: Use PELT for scale_rt_capacity()Vincent Guittot
The utilization of the CPU by RT, DL and IRQs are now tracked with PELT so we can use these metrics instead of rt_avg to evaluate the remaining capacity available for CFS class. scale_rt_capacity() behavior has been changed and now returns the remaining capacity available for CFS instead of a scaling factor because RT, DL and IRQ provide now absolute utilization value. The same formula as schedutil is used: IRQ util_avg + (1 - IRQ util_avg / max capacity ) * /Sum rq util_avg but the implementation is different because it doesn't return the same value and doesn't benefit of the same optimization. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/1530200714-4504-10-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-15sched/dl: Add dl_rq utilization trackingVincent Guittot
Similarly to what happens with RT tasks, CFS tasks can be preempted by DL tasks and the CFS's utilization might no longer describes the real utilization level. Current DL bandwidth reflects the requirements to meet deadline when tasks are enqueued but not the current utilization of the DL sched class. We track DL class utilization to estimate the system utilization. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/1530200714-4504-5-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-15sched/deadline: Fix switched_from_dl() warningJuri Lelli
Mark noticed that syzkaller is able to reliably trigger the following warning: dl_rq->running_bw > dl_rq->this_bw WARNING: CPU: 1 PID: 153 at kernel/sched/deadline.c:124 switched_from_dl+0x454/0x608 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 153 Comm: syz-executor253 Not tainted 4.18.0-rc3+ #29 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x458 show_stack+0x20/0x30 dump_stack+0x180/0x250 panic+0x2dc/0x4ec __warn_printk+0x0/0x150 report_bug+0x228/0x2d8 bug_handler+0xa0/0x1a0 brk_handler+0x2f0/0x568 do_debug_exception+0x1bc/0x5d0 el1_dbg+0x18/0x78 switched_from_dl+0x454/0x608 __sched_setscheduler+0x8cc/0x2018 sys_sched_setattr+0x340/0x758 el0_svc_naked+0x30/0x34 syzkaller reproducer runs a bunch of threads that constantly switch between DEADLINE and NORMAL classes while interacting through futexes. The splat above is caused by the fact that if a DEADLINE task is setattr back to NORMAL while in non_contending state (blocked on a futex - inactive timer armed), its contribution to running_bw is not removed before sub_rq_bw() gets called (!task_on_rq_queued() branch) and the latter sees running_bw > this_bw. Fix it by removing a task contribution from running_bw if the task is not queued and in non_contending state while switched to a different class. Reported-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Reviewed-by: Luca Abeni <luca.abeni@santannapisa.it> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: rostedt@goodmis.org Link: http://lkml.kernel.org/r/20180711072948.27061-1-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-31sched/deadline: Fix missing clock updateJuri Lelli
A missing clock update is causing the following warning: rq->clock_update_flags < RQCF_ACT_SKIP WARNING: CPU: 10 PID: 0 at kernel/sched/sched.h:963 inactive_task_timer+0x5d6/0x720 Call Trace: <IRQ> __hrtimer_run_queues+0x10f/0x530 hrtimer_interrupt+0xe5/0x240 smp_apic_timer_interrupt+0x79/0x2b0 apic_timer_interrupt+0xf/0x20 </IRQ> do_idle+0x203/0x280 cpu_startup_entry+0x6f/0x80 start_secondary+0x1b0/0x200 secondary_startup_64+0xa5/0xb0 hardirqs last enabled at (793919): [<ffffffffa27c5f6e>] cpuidle_enter_state+0x9e/0x360 hardirqs last disabled at (793920): [<ffffffffa2a0096e>] interrupt_entry+0xce/0xe0 softirqs last enabled at (793922): [<ffffffffa20bef78>] irq_enter+0x68/0x70 softirqs last disabled at (793921): [<ffffffffa20bef5d>] irq_enter+0x4d/0x70 This happens because inactive_task_timer() calls sub_running_bw() (if TASK_DEAD and non_contending) that might trigger a schedutil update, which might access the clock. Clock is however currently updated only later in inactive_task_timer() function. Fix the problem by updating the clock right after task_rq_lock(). Reported-by: kernel test robot <xiaolong.ye@intel.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Claudio Scordino <claudio@evidence.eu.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180530160809.9074-1-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-18sched/deadline: Make the grub_reclaim() function staticMathieu Malaterre
Since the grub_reclaim() function can be made static, make it so. Silences the following GCC warning (W=1): kernel/sched/deadline.c:1120:5: warning: no previous prototype for ‘grub_reclaim’ [-Wmissing-prototypes] Signed-off-by: Mathieu Malaterre <malat@debian.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180516200902.959-1-malat@debian.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-18sched/debug: Move the print_rt_rq() and print_dl_rq() declarations to ↵Mathieu Malaterre
kernel/sched/sched.h In the following commit: 6b55c9654fcc ("sched/debug: Move print_cfs_rq() declaration to kernel/sched/sched.h") the print_cfs_rq() prototype was added to <kernel/sched/sched.h>, right next to the prototypes for print_cfs_stats(), print_rt_stats() and print_dl_stats(). Finish this previous commit and also move related prototypes for print_rt_rq() and print_dl_rq(). Remove existing extern declarations now that they not needed anymore. Silences the following GCC warning, triggered by W=1: kernel/sched/debug.c:573:6: warning: no previous prototype for ‘print_rt_rq’ [-Wmissing-prototypes] kernel/sched/debug.c:603:6: warning: no previous prototype for ‘print_dl_rq’ [-Wmissing-prototypes] Signed-off-by: Mathieu Malaterre <malat@debian.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180516195348.30426-1-malat@debian.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-05sched/core: Simplify helpers for rq clock update skip requestsDavidlohr Bueso
By renaming the functions we can get rid of the skip parameter and have better code redability. It makes zero sense to have things such as: rq_clock_skip_update(rq, false) When the skip request is in fact not going to happen. Ever. Rename things such that we end up with: rq_clock_skip_update(rq) rq_clock_cancel_skipupdate(rq) Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Cc: matt@codeblueprint.co.uk Cc: rostedt@goodmis.org Link: http://lkml.kernel.org/r/20180404161539.nhadkff2aats74jh@linux-n805 Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09cpufreq/schedutil: Remove unused CPUFREQ_DLPeter Zijlstra
Bitrot... Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-04sched/deadline, rt: Rename queue_push_tasks/queue_pull_task to create ↵Ingo Molnar
separate namespace There are similarly named functions in both of these modules: kernel/sched/deadline.c:static inline void queue_push_tasks(struct rq *rq) kernel/sched/deadline.c:static inline void queue_pull_task(struct rq *rq) kernel/sched/deadline.c:static inline void queue_push_tasks(struct rq *rq) kernel/sched/deadline.c:static inline void queue_pull_task(struct rq *rq) kernel/sched/deadline.c: queue_push_tasks(rq); kernel/sched/deadline.c: queue_pull_task(rq); kernel/sched/deadline.c: queue_push_tasks(rq); kernel/sched/deadline.c: queue_pull_task(rq); kernel/sched/rt.c:static inline void queue_push_tasks(struct rq *rq) kernel/sched/rt.c:static inline void queue_pull_task(struct rq *rq) kernel/sched/rt.c:static inline void queue_push_tasks(struct rq *rq) kernel/sched/rt.c: queue_push_tasks(rq); kernel/sched/rt.c: queue_pull_task(rq); kernel/sched/rt.c: queue_push_tasks(rq); kernel/sched/rt.c: queue_pull_task(rq); ... which makes it harder to grep for them. Prefix them with deadline_ and rt_, respectively. Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-04sched/headers: Simplify and clean up header usage in the schedulerIngo Molnar
Do the following cleanups and simplifications: - sched/sched.h already includes <asm/paravirt.h>, so no need to include it in sched/core.c again. - order the <linux/sched/*.h> headers alphabetically - add all <linux/sched/*.h> headers to kernel/sched/sched.h - remove all unnecessary includes from the .c files that are already included in kernel/sched/sched.h. Finally, make all scheduler .c files use a single common header: #include "sched.h" ... which now contains a union of the relied upon headers. This makes the various .c files easier to read and easier to handle. Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-03sched: Clean up and harmonize the coding style of the scheduler code baseIngo Molnar
A good number of small style inconsistencies have accumulated in the scheduler core, so do a pass over them to harmonize all these details: - fix speling in comments, - use curly braces for multi-line statements, - remove unnecessary parentheses from integer literals, - capitalize consistently, - remove stray newlines, - add comments where necessary, - remove invalid/unnecessary comments, - align structure definitions and other data types vertically, - add missing newlines for increased readability, - fix vertical tabulation where it's misaligned, - harmonize preprocessor conditional block labeling and vertical alignment, - remove line-breaks where they uglify the code, - add newline after local variable definitions, No change in functionality: md5: 1191fa0a890cfa8132156d2959d7e9e2 built-in.o.before.asm 1191fa0a890cfa8132156d2959d7e9e2 built-in.o.after.asm Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21sched/isolation: Offload residual 1Hz scheduler tickFrederic Weisbecker
When a CPU runs in full dynticks mode, a 1Hz tick remains in order to keep the scheduler stats alive. However this residual tick is a burden for bare metal tasks that can't stand any interruption at all, or want to minimize them. The usual boot parameters "nohz_full=" or "isolcpus=nohz" will now outsource these scheduler ticks to the global workqueue so that a housekeeping CPU handles those remotely. The sched_class::task_tick() implementations have been audited and look safe to be called remotely as the target runqueue and its current task are passed in parameter and don't seem to be accessed locally. Note that in the case of using isolcpus, it's still up to the user to affine the global workqueues to the housekeeping CPUs through /sys/devices/virtual/workqueue/cpumask or domains isolation "isolcpus=nohz,domain". Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wanpeng Li <kernellwp@gmail.com> Link: http://lkml.kernel.org/r/1519186649-3242-6-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-13sched/deadline: Make update_curr_dl() more accurateWen Yang
rq->clock_task may be updated between the two calls of rq_clock_task() in update_curr_dl(). Calling rq_clock_task() only once makes it more accurate and efficient, taking update_curr() as reference. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Wen Yang <wen.yang99@zte.com.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: zhong.weidong@zte.com.cn Link: http://lkml.kernel.org/r/1517882148-44599-1-git-send-email-wen.yang99@zte.com.cn Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10sched/deadline: Make bandwidth enforcement scale-invariantJuri Lelli
Apply frequency and CPU scale-invariance correction factor to bandwidth enforcement (similar to what we already do to fair utilization tracking). Each delta_exec gets scaled considering current frequency and maximum CPU capacity; which means that the reservation runtime parameter (that need to be specified profiling the task execution at max frequency on biggest capacity core) gets thus scaled accordingly. Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Claudio Scordino <claudio@evidence.eu.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: alessio.balsini@arm.com Cc: bristot@redhat.com Cc: dietmar.eggemann@arm.com Cc: joelaf@google.com Cc: juri.lelli@redhat.com Cc: mathieu.poirier@linaro.org Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: rjw@rjwysocki.net Cc: rostedt@goodmis.org Cc: tkjos@android.com Cc: tommaso.cucinotta@santannapisa.it Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/20171204102325.5110-9-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10sched/cpufreq: Change the worker kthread to SCHED_DEADLINEJuri Lelli
Worker kthread needs to be able to change frequency for all other threads. Make it special, just under STOP class. Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Claudio Scordino <claudio@evidence.eu.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: alessio.balsini@arm.com Cc: bristot@redhat.com Cc: dietmar.eggemann@arm.com Cc: joelaf@google.com Cc: juri.lelli@redhat.com Cc: mathieu.poirier@linaro.org Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: rjw@rjwysocki.net Cc: rostedt@goodmis.org Cc: tkjos@android.com Cc: tommaso.cucinotta@santannapisa.it Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/20171204102325.5110-4-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10sched/deadline: Move CPU frequency selection triggering pointsJuri Lelli
Since SCHED_DEADLINE doesn't track utilization signal (but reserves a fraction of CPU bandwidth to tasks admitted to the system), there is no point in evaluating frequency changes during each tick event. Move frequency selection triggering points to where running_bw changes. Co-authored-by: Claudio Scordino <claudio@evidence.eu.com> Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: alessio.balsini@arm.com Cc: bristot@redhat.com Cc: dietmar.eggemann@arm.com Cc: joelaf@google.com Cc: juri.lelli@redhat.com Cc: mathieu.poirier@linaro.org Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: rjw@rjwysocki.net Cc: rostedt@goodmis.org Cc: tkjos@android.com Cc: tommaso.cucinotta@santannapisa.it Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/20171204102325.5110-3-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10sched/deadline: Implement "runtime overrun signal" supportJuri Lelli
This patch adds the possibility of getting the delivery of a SIGXCPU signal whenever there is a runtime overrun. The request is done through the sched_flags field within the sched_attr structure. Forward port of https://lkml.org/lkml/2009/10/16/170 Tested-by: Mathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Claudio Scordino <claudio@evidence.eu.com> Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/1513077024-25461-1-git-send-email-claudio@evidence.eu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-15Merge branch 'for-4.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "Cgroup2 cpu controller support is finally merged. - Basic cpu statistics support to allow monitoring by default without the CPU controller enabled. - cgroup2 cpu controller support. - /sys/kernel/cgroup files to help dealing with new / optional features" * 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: export list of cgroups v2 features using sysfs cgroup: export list of delegatable control files using sysfs cgroup: mark @cgrp __maybe_unused in cpu_stat_show() MAINTAINERS: relocate cpuset.c cgroup, sched: Move basic cpu stats from cgroup.stat to cpu.stat sched: Implement interface for cgroup unified hierarchy sched: Misc preps for cgroup unified hierarchy interface sched/cputime: Add dummy cputime_adjust() implementation for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE cgroup: statically initialize init_css_set->dfl_cgrp cgroup: Implement cgroup2 basic CPU usage accounting cpuacct: Introduce cgroup_account_cputime[_field]() sched/cputime: Expose cputime_adjust()
2017-11-08Merge branch 'linus' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-02License cleanup: add SPDX GPL-2.0 license identifier to files with no licenseGreg Kroah-Hartman
Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-10sched/deadline: Rename __dl_clear() to __dl_sub()Peter Zijlstra
__dl_sub() is more meaningful as a name, and is more consistent with the naming of the dual function (__dl_add()). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1504778971-13573-4-git-send-email-luca.abeni@santannapisa.it Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10sched/deadline: Fix switching to -deadlineLuca Abeni
Fix a bug introduced in: 72f9f3fdc928 ("sched/deadline: Remove dl_new from struct sched_dl_entity") After that commit, when switching to -deadline if the scheduling deadline of a task is in the past then switched_to_dl() calls setup_new_entity() to properly initialize the scheduling deadline and runtime. The problem is that the task is enqueued _before_ having its parameters initialized by setup_new_entity(), and this can cause problems. For example, a task with its out-of-date deadline in the past will potentially be enqueued as the highest priority one; however, its adjusted deadline may not be the earliest one. This patch fixes the problem by initializing the task's parameters before enqueuing it. Signed-off-by: luca abeni <luca.abeni@santannapisa.it> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1504778971-13573-3-git-send-email-luca.abeni@santannapisa.it Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-25cpuacct: Introduce cgroup_account_cputime[_field]()Tejun Heo
Introduce cgroup_account_cputime[_field]() which wrap cpuacct_charge() and cgroup_account_field(). This doesn't introduce any functional changes and will be used to add cgroup basic resource accounting. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com>
2017-09-08sched/deadline: replace earliest dl and rq leftmost cachingDavidlohr Bueso
... with the generic rbtree flavor instead. No changes in semantics whatsoever. Link: http://lkml.kernel.org/r/20170719014603.19029-9-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-05Merge tag 'pm-4.14-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "This time (again) cpufreq gets the majority of changes which mostly are driver updates (including a major consolidation of intel_pstate), some schedutil governor modifications and core cleanups. There also are some changes in the system suspend area, mostly related to diagnostics and debug messages plus some renames of things related to suspend-to-idle. One major change here is that suspend-to-idle is now going to be preferred over S3 on systems where the ACPI tables indicate to do so and provide requsite support (the Low Power Idle S0 _DSM in particular). The system sleep documentation and the tools related to it are updated too. The rest is a few cpuidle changes (nothing major), devfreq updates, generic power domains (genpd) framework updates and a few assorted modifications elsewhere. Specifics: - Drop the P-state selection algorithm based on a PID controller from intel_pstate and make it use the same P-state selection method (based on the CPU load) for all types of systems in the active mode (Rafael Wysocki, Srinivas Pandruvada). - Rework the cpufreq core and governors to make it possible to take cross-CPU utilization updates into account and modify the schedutil governor to actually do so (Viresh Kumar). - Clean up the handling of transition latency information in the cpufreq core and untangle it from the information on which drivers cannot do dynamic frequency switching (Viresh Kumar). - Add support for new SoCs (MT2701/MT7623 and MT7622) to the mediatek cpufreq driver and update its DT bindings (Sean Wang). - Modify the cpufreq dt-platdev driver to autimatically create cpufreq devices for the new (v2) Operating Performance Points (OPP) DT bindings and update its whitelist of supported systems (Viresh Kumar, Shubhrajyoti Datta, Marc Gonzalez, Khiem Nguyen, Finley Xiao). - Add support for Ux500 to the cpufreq-dt driver and drop the obsolete dbx500 cpufreq driver (Linus Walleij, Arnd Bergmann). - Add new SoC (R8A7795) support to the cpufreq rcar driver (Khiem Nguyen). - Fix and clean up assorted issues in the cpufreq drivers and core (Arvind Yadav, Christophe Jaillet, Colin Ian King, Gustavo Silva, Julia Lawall, Leonard Crestez, Rob Herring, Sudeep Holla). - Update the IO-wait boost handling in the schedutil governor to make it less aggressive (Joel Fernandes). - Rework system suspend diagnostics to make it print fewer messages to the kernel log by default, add a sysfs knob to allow more suspend-related messages to be printed and add Low Power S0 Idle constraints checks to the ACPI suspend-to-idle code (Rafael Wysocki, Srinivas Pandruvada). - Prefer suspend-to-idle over S3 on ACPI-based systems with the ACPI_FADT_LOW_POWER_S0 flag set and the Low Power Idle S0 _DSM interface present in the ACPI tables (Rafael Wysocki). - Update documentation related to system sleep and rename a number of items in the code to make it cleare that they are related to suspend-to-idle (Rafael Wysocki). - Export a variable allowing device drivers to check the target system sleep state from the core system suspend code (Florian Fainelli). - Clean up the cpuidle subsystem to handle the polling state on x86 in a more straightforward way and to use %pOF instead of full_name (Rafael Wysocki, Rob Herring). - Update the devfreq framework to fix and clean up a few minor issues (Chanwoo Choi, Rob Herring). - Extend diagnostics in the generic power domains (genpd) framework and clean it up slightly (Thara Gopinath, Rob Herring). - Fix and clean up a couple of issues in the operating performance points (OPP) framework (Viresh Kumar, Waldemar Rymarkiewicz). - Add support for RV1108 to the rockchip-io Adaptive Voltage Scaling (AVS) driver (David Wu). - Fix the usage of notifiers in CPU power management on some platforms (Alex Shi). - Update the pm-graph system suspend/hibernation and boot profiling utility (Todd Brandt). - Make it possible to run the cpupower utility without CPU0 (Prarit Bhargava)" * tag 'pm-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (87 commits) cpuidle: Make drivers initialize polling state cpuidle: Move polling state initialization code to separate file cpuidle: Eliminate the CPUIDLE_DRIVER_STATE_START symbol cpufreq: imx6q: Fix imx6sx low frequency support cpufreq: speedstep-lib: make several arrays static, makes code smaller PM: docs: Delete the obsolete states.txt document PM: docs: Describe high-level PM strategies and sleep states PM / devfreq: Fix memory leak when fail to register device PM / devfreq: Add dependency on PM_OPP PM / devfreq: Move private devfreq_update_stats() into devfreq PM / devfreq: Convert to using %pOF instead of full_name PM / AVS: rockchip-io: add io selectors and supplies for RV1108 cpufreq: ti: Fix 'of_node_put' being called twice in error handling path cpufreq: dt-platdev: Drop few entries from whitelist cpufreq: dt-platdev: Automatically create cpufreq device with OPP v2 ARM: ux500: don't select CPUFREQ_DT cpuidle: Convert to using %pOF instead of full_name cpufreq: Convert to using %pOF instead of full_name PM / Domains: Convert to using %pOF instead of full_name cpufreq: Cap the default transition delay value to 10 ms ...
2017-08-10sched/deadline: Change return value of cpudl_find()Byungchul Park
cpudl_find() users are only interested in knowing if suitable CPU(s) were found or not (and then they look at later_mask to know which). Change cpudl_find() return type accordingly. Aligns with rt code. Signed-off-by: Byungchul Park <byungchul.park@lge.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <bristot@redhat.com> Cc: <juri.lelli@gmail.com> Cc: <kernel-team@lge.com> Cc: <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1495504859-10960-3-git-send-email-byungchul.park@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>