summaryrefslogtreecommitdiff
path: root/kernel/sched
AgeCommit message (Collapse)Author
2017-05-23sched/deadline: Remove unnecessary condition in push_dl_task()Byungchul Park
pick_next_pushable_dl_task(rq) has BUG_ON(rq->cpu != task_cpu(task)) when it returns a task other than NULL, which means that task_cpu(task) must be rq->cpu. So if task == next_task, then task_cpu(next_task) must be rq->cpu as well. Remove the redundant condition and make the code simpler. This way one unnecessary branch and two LOAD operations can be avoided. Signed-off-by: Byungchul Park <byungchul.park@lge.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by: Juri Lelli <juri.lelli@arm.com> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: <kernel-team@lge.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1494551159-22367-1-git-send-email-byungchul.park@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23sched/rt: Remove unnecessary condition in push_rt_task()Byungchul Park
pick_next_pushable_task(rq) has BUG_ON(rq_cpu != task_cpu(task)) when it returns a task other than NULL, which means that task_cpu(task) must be rq->cpu. So if task == next_task, then task_cpu(next_task) must be rq->cpu as well. Remove the redundant condition and make the code simpler. This way one unnecessary branch and two LOAD operations can be avoided. Signed-off-by: Byungchul Park <byungchul.park@lge.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by: Juri Lelli <juri.lelli@arm.com> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: <kernel-team@lge.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1494551143-22219-1-git-send-email-byungchul.park@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23sched/core: Use the new llist_for_each_entry_safe() primitiveByungchul Park
Now that we've added llist_for_each_entry_safe(), use it to simplify an open coded version of it in sched_ttwu_pending(). Signed-off-by: Byungchul Park <byungchul.park@lge.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <kernel-team@lge.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1494549584-11730-1-git-send-email-byungchul.park@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23Merge branch 'linus' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-22Merge branches 'intel_pstate', 'pm-cpufreq' and 'pm-cpufreq-sched'Rafael J. Wysocki
* intel_pstate: cpufreq: intel_pstate: Document the current behavior and user interface * pm-cpufreq: cpufreq: dbx500: add a Kconfig symbol * pm-cpufreq-sched: cpufreq: schedutil: use now as reference when aggregating shared policy requests
2017-05-15sched/fair: Fix O(nr_cgroups) in load balance pathTejun Heo
Currently, rq->leaf_cfs_rq_list is a traversal ordered list of all live cfs_rqs which have ever been active on the CPU; unfortunately, this makes update_blocked_averages() O(# total cgroups) which isn't scalable at all. This shows up as a small CPU consumption and scheduling latency increase in the load balancing path in systems with CPU controller enabled across most cgroups. In an edge case where temporary cgroups were leaking, this caused the kernel to consume good several tens of percents of CPU cycles running update_blocked_averages(), each run taking multiple millisecs. This patch fixes the issue by taking empty and fully decayed cfs_rqs off the rq->leaf_cfs_rq_list. Signed-off-by: Tejun Heo <tj@kernel.org> [ Added cfs_rq_is_decayed() ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Chris Mason <clm@fb.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170426004350.GB3222@wtj.duckdns.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15sched/fair: Use task_groups instead of leaf_cfs_rq_list to walk all cfs_rqsPeter Zijlstra
In order to allow leaf_cfs_rq_list to remove entries switch the bandwidth hotplug code over to the task_groups list. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Mason <clm@fb.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170504133122.a6qjlj3hlblbjxux@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15sched/topology: Rename sched_group_cpus()Peter Zijlstra
There's a discrepancy in naming between the sched_domain and sched_group cpumask accessor. Since we're doing changes, fix it. $ git grep sched_group_cpus | wc -l 28 $ git grep sched_domain_span | wc -l 38 Suggests changing sched_group_cpus() into sched_group_span(): for i in `git grep -l sched_group_cpus` do sed -ie 's/sched_group_cpus/sched_group_span/g' $i done Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15sched/topology: Rename sched_group_mask()Peter Zijlstra
Since sched_group_mask() is now an independent cpumask (it no longer masks sched_group_cpus()), rename the thing. Suggested-by: Lauro Ramos Venancio <lvenanci@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15sched/topology: Simplify sched_group_mask() usagePeter Zijlstra
While writing the comments, it occurred to me that: sg_cpus & sg_mask == sg_mask at least conceptually; the !overlap case sets the all 1s mask. If we correct that we can simplify things and directly use sg_mask. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15sched/topology: Rewrite get_group()Peter Zijlstra
We want to attain: sg_cpus() & sg_mask() == sg_mask() for this to be so we must initialize sg_mask() to sg_cpus() for the !overlap case (its currently cpumask_setall()). Since the code makes my head hurt bad, rewrite it into a simpler form, inspired by the now fixed overlap code. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15sched/topology: Add a few commentsPeter Zijlstra
Try and describe what this code is about.. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15sched/topology: Fix overlapping sched_group_capacityPeter Zijlstra
When building the overlapping groups we need to attach a consistent sched_group_capacity structure. That is, all 'identical' sched_group's should have the _same_ sched_group_capacity. This can (once again) be demonstrated with a topology like: node 0 1 2 3 0: 10 20 30 20 1: 20 10 20 30 2: 30 20 10 20 3: 20 30 20 10 But we need at least 2 CPUs per node for this to show up, after all, if there is only one CPU per node, our CPU @i is per definition a unique CPU that reaches this domain (aka balance-cpu). Given the above NUMA topo and 2 CPUs per node: [] CPU0 attaching sched-domain(s): [] domain-0: span=0,4 level=DIE [] groups: 0:{ span=0 }, 4:{ span=4 } [] domain-1: span=0-1,3-5,7 level=NUMA [] groups: 0:{ span=0,4 mask=0,4 cap=2048 }, 1:{ span=1,5 mask=1,5 cap=2048 }, 3:{ span=3,7 mask=3,7 cap=2048 } [] domain-2: span=0-7 level=NUMA [] groups: 0:{ span=0-1,3-5,7 mask=0,4 cap=6144 }, 2:{ span=1-3,5-7 mask=2,6 cap=6144 } [] CPU1 attaching sched-domain(s): [] domain-0: span=1,5 level=DIE [] groups: 1:{ span=1 }, 5:{ span=5 } [] domain-1: span=0-2,4-6 level=NUMA [] groups: 1:{ span=1,5 mask=1,5 cap=2048 }, 2:{ span=2,6 mask=2,6 cap=2048 }, 4:{ span=0,4 mask=0,4 cap=2048 } [] domain-2: span=0-7 level=NUMA [] groups: 1:{ span=0-2,4-6 mask=1,5 cap=6144 }, 3:{ span=0,2-4,6-7 mask=3,7 cap=6144 } Observe how CPU0-domain1-group0 and CPU1-domain1-group4 are the 'same' but have a different id (0 vs 4). To fix this, use the group balance CPU to select the SGC. This means we have to compute the full mask for each CPU and require a second temporary mask to store the group mask in (it otherwise lives in the SGC). The fixed topology looks like: [] CPU0 attaching sched-domain(s): [] domain-0: span=0,4 level=DIE [] groups: 0:{ span=0 }, 4:{ span=4 } [] domain-1: span=0-1,3-5,7 level=NUMA [] groups: 0:{ span=0,4 mask=0,4 cap=2048 }, 1:{ span=1,5 mask=1,5 cap=2048 }, 3:{ span=3,7 mask=3,7 cap=2048 } [] domain-2: span=0-7 level=NUMA [] groups: 0:{ span=0-1,3-5,7 mask=0,4 cap=6144 }, 2:{ span=1-3,5-7 mask=2,6 cap=6144 } [] CPU1 attaching sched-domain(s): [] domain-0: span=1,5 level=DIE [] groups: 1:{ span=1 }, 5:{ span=5 } [] domain-1: span=0-2,4-6 level=NUMA [] groups: 1:{ span=1,5 mask=1,5 cap=2048 }, 2:{ span=2,6 mask=2,6 cap=2048 }, 0:{ span=0,4 mask=0,4 cap=2048 } [] domain-2: span=0-7 level=NUMA [] groups: 1:{ span=0-2,4-6 mask=1,5 cap=6144 }, 3:{ span=0,2-4,6-7 mask=3,7 cap=6144 } Debugged-by: Lauro Ramos Venancio <lvenanci@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Fixes: e3589f6c81e4 ("sched: Allow for overlapping sched_domain spans") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15sched/topology: Add sched_group_capacity debuggingPeter Zijlstra
Add sgc::id to easier spot domain construction issues. Take the opportunity to slightly rework the group printing, because adding more "(id: %d)" strings makes the entire thing very hard to read. Also the individual groups are very hard to separate, so add explicit visual grouping, which allows replacing all the "(%s: %d)" format things with shorter "%s=%d" variants. Then fix up some inconsistencies in surrounding prints for domains. The end result looks like: [] CPU0 attaching sched-domain(s): [] domain-0: span=0,4 level=DIE [] groups: 0:{ span=0 }, 4:{ span=4 } [] domain-1: span=0-1,3-5,7 level=NUMA [] groups: 0:{ span=0,4 mask=0,4 cap=2048 }, 1:{ span=1,5 mask=1,5 cap=2048 }, 3:{ span=3,7 mask=3,7 cap=2048 } [] domain-2: span=0-7 level=NUMA [] groups: 0:{ span=0-1,3-5,7 mask=0,4 cap=6144 }, 2:{ span=1-3,5-7 mask=2,6 cap=6144 } Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15sched/topology: Small cleanupPeter Zijlstra
Move the allocation of topology specific cpumasks into the topology code. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15sched/topology: Fix overlapping sched_group_maskPeter Zijlstra
The point of sched_group_mask is to select those CPUs from sched_group_cpus that can actually arrive at this balance domain. The current code gets it wrong, as can be readily demonstrated with a topology like: node 0 1 2 3 0: 10 20 30 20 1: 20 10 20 30 2: 30 20 10 20 3: 20 30 20 10 Where (for example) domain 1 on CPU1 ends up with a mask that includes CPU0: [] CPU1 attaching sched-domain: [] domain 0: span 0-2 level NUMA [] groups: 1 (mask: 1), 2, 0 [] domain 1: span 0-3 level NUMA [] groups: 0-2 (mask: 0-2) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072) This causes sched_balance_cpu() to compute the wrong CPU and consequently should_we_balance() will terminate early resulting in missed load-balance opportunities. The fixed topology looks like: [] CPU1 attaching sched-domain: [] domain 0: span 0-2 level NUMA [] groups: 1 (mask: 1), 2, 0 [] domain 1: span 0-3 level NUMA [] groups: 0-2 (mask: 1) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072) (note: this relies on OVERLAP domains to always have children, this is true because the regular topology domains are still here -- this is before degenerate trimming) Debugged-by: Lauro Ramos Venancio <lvenanci@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org Fixes: e3589f6c81e4 ("sched: Allow for overlapping sched_domain spans") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15sched/topology: Remove FORCE_SD_OVERLAPPeter Zijlstra
Its an obsolete debug mechanism and future code wants to rely on properties this undermines. Namely, it would be good to assume that SD_OVERLAP domains have children, but if we build the entire hierarchy with SD_OVERLAP this is obviously false. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15sched/topology: Move comment about asymmetric node setupsLauro Ramos Venancio
Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: lwang@redhat.com Cc: riel@redhat.com Link: http://lkml.kernel.org/r/1492717903-5195-4-git-send-email-lvenanci@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15sched/topology: Optimize build_group_mask()Lauro Ramos Venancio
The group mask is always used in intersection with the group CPUs. So, when building the group mask, we don't have to care about CPUs that are not part of the group. Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: lwang@redhat.com Cc: riel@redhat.com Link: http://lkml.kernel.org/r/1492717903-5195-2-git-send-email-lvenanci@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15sched/topology: Verify the first group matches the child domainPeter Zijlstra
We want sched_groups to be sibling child domains (or individual CPUs when there are no child domains). Furthermore, since the first group of a domain should include the CPU of that domain, the first group of each domain should match the child domain. Verify this is indeed so. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15sched/debug: Print the scheduler topology group maskPeter Zijlstra
In order to determine the balance_cpu (for should_we_balance()) we need the sched_group_mask() for overlapping domains. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15sched/topology: Simplify build_overlap_sched_groups()Peter Zijlstra
Now that the first group will always be the previous domain of this @cpu this can be simplified. In fact, writing the code now removed should've been a big clue I was doing it wrong :/ Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15sched/topology: Fix building of overlapping sched-groupsPeter Zijlstra
When building the overlapping groups, we very obviously should start with the previous domain of _this_ @cpu, not CPU-0. This can be readily demonstrated with a topology like: node 0 1 2 3 0: 10 20 30 20 1: 20 10 20 30 2: 30 20 10 20 3: 20 30 20 10 Where (for example) CPU1 ends up generating the following nonsensical groups: [] CPU1 attaching sched-domain: [] domain 0: span 0-2 level NUMA [] groups: 1 2 0 [] domain 1: span 0-3 level NUMA [] groups: 1-3 (cpu_capacity = 3072) 0-1,3 (cpu_capacity = 3072) Where the fact that domain 1 doesn't include a group with span 0-2 is the obvious fail. With patch this looks like: [] CPU1 attaching sched-domain: [] domain 0: span 0-2 level NUMA [] groups: 1 0 2 [] domain 1: span 0-3 level NUMA [] groups: 0-2 (cpu_capacity = 3072) 0,2-3 (cpu_capacity = 3072) Debugged-by: Lauro Ramos Venancio <lvenanci@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org Fixes: e3589f6c81e4 ("sched: Allow for overlapping sched_domain spans") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15sched/fair, cpumask: Export for_each_cpu_wrap()Peter Zijlstra
More users for for_each_cpu_wrap() have appeared. Promote the construct to generic cpumask interface. The implementation is slightly modified to reduce arguments. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Lauro Ramos Venancio <lvenanci@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: lwang@redhat.com Link: http://lkml.kernel.org/r/20170414122005.o35me2h5nowqkxbv@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15sched/topology: Refactor function build_overlap_sched_groups()Lauro Ramos Venancio
Create functions build_group_from_child_sched_domain() and init_overlap_sched_group(). No functional change. Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1492091769-19879-2-git-send-email-lvenanci@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15sched/clock: Print a warning recommending 'tsc=unstable'Peter Zijlstra
With our switch to stable delayed until late_initcall(), the most likely cause of hitting mark_tsc_unstable() is the watchdog. The watchdog typically only triggers when creative BIOS'es fiddle with the TSC to hide SMI latency. Since the watchdog can only detect TSC fiddling after the fact all TSC clocks (including userspace GTOD) can already have reported funny values. The only way to fully avoid this, is manually marking the TSC unstable at boot. Suggest people do this on their broken systems. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15sched/clock: Use late_initcall() instead of sched_init_smp()Peter Zijlstra
Core2 marks its TSC unstable in ACPI Processor Idle, which is probed after sched_init_smp(). Luckily it appears both acpi_processor and intel_idle (which has a similar check) are mandatory built-in. This means we can delay switching to stable until after these drivers have ran (if they were modules, this would be impossible). Delay the stable switch to late_initcall() to allow these drivers to mark TSC unstable and avoid difficult stable->unstable transitions. Reported-by: Lofstedt, Marta <marta.lofstedt@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15cpuidle: Fix idle time trackingPeter Zijlstra
Ville reported that on his Core2, which has TSC stop in idle, we would always report very short idle durations. He tracked this down to commit: e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()") which replaces ktime_get() with local_clock(). Add a sched_clock_idle_wakeup_event() call, which will re-sync the clock with ktime_get_ns() when TSC is unstable and no-op otherwise. Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Fixes: e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15sched/clock: Remove watchdog touchingPeter Zijlstra
Commit: 2bacec8c318c ("sched: touch softlockup watchdog after idling") introduced the touch_softlockup_watchdog_sched() call without justification and I feel sched_clock management is not the right place, it should only be concerned with producing semi coherent time. If this causes watchdog thingies, we can find a better place. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15sched/clock: Remove unused argument to sched_clock_idle_wakeup_event()Peter Zijlstra
The argument to sched_clock_idle_wakeup_event() has not been used in a long time. Remove it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15x86/tsc, sched/clock, clocksource: Use clocksource watchdog to provide ↵Peter Zijlstra
stable sync points Currently we keep sched_clock_tick() active for stable TSC in order to keep the per-CPU state semi up-to-date. The (obvious) problem is that by the time we detect TSC is borked, our per-CPU state is also borked. So hook into the clocksource watchdog and call a method after we've found it to still be stable. There's the obvious race where the TSC goes wonky between finding it stable and us running the callback, but closing that is too much work and not really worth it, since we're already detecting TSC wobbles after the fact, so we cannot, per definition, fully avoid funny clock values. And since the watchdog runs less often than the tick, this is also an optimization. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15sched/clock: Initialize all per-CPU state before switching (back) to unstablePeter Zijlstra
In preparation for not keeping the sched_clock_tick() active for stable TSC, we need to explicitly initialize all per-CPU state before switching back to unstable. Note: this patch looses the __gtod_offset calculation; it will be restored in the next one. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15sched/cfs: Make util/load_avg more stableVincent Guittot
In the current implementation of load/util_avg, we assume that the ongoing time segment has fully elapsed, and util/load_sum is divided by LOAD_AVG_MAX, even if part of the time segment still remains to run. As a consequence, this remaining part is considered as idle time and generates unexpected variations of util_avg of a busy CPU in the range [1002..1024[ whereas util_avg should stay at 1023. In order to keep the metric stable, we should not consider the ongoing time segment when computing load/util_avg but only the segments that have already fully elapsed. But to not consider the current time segment adds unwanted latency in the load/util_avg responsivness especially when the time is scaled instead of the contribution. Instead of waiting for the current time segment to have fully elapsed before accounting it in load/util_avg, we can already account the elapsed part but change the range used to compute load/util_avg accordingly. At the very beginning of a new time segment, the past segments have been decayed and the max value is LOAD_AVG_MAX*y. At the very end of the current time segment, the max value becomes: LOAD_AVG_MAX*y + 1024(us) (== LOAD_AVG_MAX) In fact, the max value is: LOAD_AVG_MAX*y + sa->period_contrib at any time in the time segment. Taking advantage of the fact that: LOAD_AVG_MAX*y == LOAD_AVG_MAX-1024 the range becomes [0..LOAD_AVG_MAX-1024+sa->period_contrib]. As the elapsed part is already accounted in load/util_sum, we update the max value according to the current position in the time segment instead of removing its contribution. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten.Rasmussen@arm.com Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: pjt@google.com Cc: yuyang.du@intel.com Link: http://lkml.kernel.org/r/1493188076-2767-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15sched/core: Call __schedule() from do_idle() without enabling preemptionSteven Rostedt (VMware)
I finally got around to creating trampolines for dynamically allocated ftrace_ops with using synchronize_rcu_tasks(). For users of the ftrace function hook callbacks, like perf, that allocate the ftrace_ops descriptor via kmalloc() and friends, ftrace was not able to optimize the functions being traced to use a trampoline because they would also need to be allocated dynamically. The problem is that they cannot be freed when CONFIG_PREEMPT is set, as there's no way to tell if a task was preempted on the trampoline. That was before Paul McKenney implemented synchronize_rcu_tasks() that would make sure all tasks (except idle) have scheduled out or have entered user space. While testing this, I triggered this bug: BUG: unable to handle kernel paging request at ffffffffa0230077 ... RIP: 0010:0xffffffffa0230077 ... Call Trace: schedule+0x5/0xe0 schedule_preempt_disabled+0x18/0x30 do_idle+0x172/0x220 What happened was that the idle task was preempted on the trampoline. As synchronize_rcu_tasks() ignores the idle thread, there's nothing that lets ftrace know that the idle task was preempted on a trampoline. The idle task shouldn't need to ever enable preemption. The idle task is simply a loop that calls schedule or places the cpu into idle mode. In fact, having preemption enabled is inefficient, because it can happen when idle is just about to call schedule anyway, which would cause schedule to be called twice. Once for when the interrupt came in and was returning back to normal context, and then again in the normal path that the idle loop is running in, which would be pointless, as it had already scheduled. The only reason schedule_preempt_disable() enables preemption is to be able to call sched_submit_work(), which requires preemption enabled. As this is a nop when the task is in the RUNNING state, and idle is always in the running state, there's no reason that idle needs to enable preemption. But that means it cannot use schedule_preempt_disable() as other callers of that function require calling sched_submit_work(). Adding a new function local to kernel/sched/ that allows idle to call the scheduler without enabling preemption, fixes the synchronize_rcu_tasks() issue, as well as removes the pointless spurious schedule calls caused by interrupts happening in the brief window where preemption is enabled just before it calls schedule. Reviewed: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170414084809.3dacde2a@gandalf.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-10Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Ingo Molnar: "The main changes are: - Debloat RCU headers - Parallelize SRCU callback handling (plus overlapping patches) - Improve the performance of Tree SRCU on a CPU-hotplug stress test - Documentation updates - Miscellaneous fixes" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits) rcu: Open-code the rcu_cblist_n_lazy_cbs() function rcu: Open-code the rcu_cblist_n_cbs() function rcu: Open-code the rcu_cblist_empty() function rcu: Separately compile large rcu_segcblist functions srcu: Debloat the <linux/rcu_segcblist.h> header srcu: Adjust default auto-expediting holdoff srcu: Specify auto-expedite holdoff time srcu: Expedite first synchronize_srcu() when idle srcu: Expedited grace periods with reduced memory contention srcu: Make rcutorture writer stalls print SRCU GP state srcu: Exact tracking of srcu_data structures containing callbacks srcu: Make SRCU be built by default srcu: Fix Kconfig botch when SRCU not selected rcu: Make non-preemptive schedule be Tasks RCU quiescent state srcu: Expedite srcu_schedule_cbs_snp() callback invocation srcu: Parallelize callback handling kvm: Move srcu_struct fields to end of struct kvm rcu: Fix typo in PER_RCU_NODE_PERIOD header comment rcu: Use true/false in assignment to bool rcu: Use bool value directly ...
2017-05-05cpufreq: schedutil: use now as reference when aggregating shared policy requestsJuri Lelli
Currently, sugov_next_freq_shared() uses last_freq_update_time as a reference to decide when to start considering CPU contributions as stale. However, since last_freq_update_time is set by the last CPU that issued a frequency transition, this might cause problems in certain cases. In practice, the detection of stale utilization values fails whenever the CPU with such values was the last to update the policy. For example (and please note again that the SCHED_CPUFREQ_RT flag is not the problem here, but only the detection of after how much time that flag has to be considered stale), suppose a policy with 2 CPUs: CPU0 | CPU1 | | RT task scheduled | SCHED_CPUFREQ_RT is set | CPU1->last_update = now | freq transition to max | last_freq_update_time = now | more than TICK_NSEC nsecs | a small CFS wakes up | CPU0->last_update = now1 | delta_ns(CPU0) < TICK_NSEC* | CPU0's util is considered | delta_ns(CPU1) = | last_freq_update_time - | CPU1->last_update = 0 | < TICK_NSEC | CPU1 is still considered | CPU1->SCHED_CPUFREQ_RT is set | we stay at max (until CPU1 | exits from idle) | * delta_ns is actually negative as now1 > last_freq_update_time While last_freq_update_time is a sensible reference for rate limiting, it doesn't seem to be useful for working around stale CPU states. Fix the problem by always considering now (time) as the reference for deciding when CPUs have stale contributions. Signed-off-by: Juri Lelli <juri.lelli@arm.com> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-05-02Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: tty: fix comment for __tty_alloc_driver() init/main: properly align the multi-line comment init/main: Fix double "the" in comment Fix dead URLs to ftp.kernel.org drivers: Clean up duplicated email address treewide: Fix typo in xml/driver-api/basics.xml tools/testing/selftests/powerpc: remove redundant CFLAGS in Makefile: "-Wall -O2 -Wall" -> "-O2 -Wall" selftests/timers: Spelling s/privledges/privileges/ HID: picoLCD: Spelling s/REPORT_WRTIE_MEMORY/REPORT_WRITE_MEMORY/ net: phy: dp83848: Fix Typo UBI: Fix typos Documentation: ftrace.txt: Correct nice value of 120 priority net: fec: Fix typo in error msg and comment treewide: Fix typos in printk
2017-05-02Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching Pull livepatch updates from Jiri Kosina: - a per-task consistency model is being added for architectures that support reliable stack dumping (extending this, currently rather trivial set, is currently in the works). This extends the nature of the types of patches that can be applied by live patching infrastructure. The code stems from the design proposal made [1] back in November 2014. It's a hybrid of SUSE's kGraft and RH's kpatch, combining advantages of both: it uses kGraft's per-task consistency and syscall barrier switching combined with kpatch's stack trace switching. There are also a number of fallback options which make it quite flexible. Most of the heavy lifting done by Josh Poimboeuf with help from Miroslav Benes and Petr Mladek [1] https://lkml.kernel.org/r/20141107140458.GA21774@suse.cz - module load time patch optimization from Zhou Chengming - a few assorted small fixes * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching: livepatch: add missing printk newlines livepatch: Cancel transition a safe way for immediate patches livepatch: Reduce the time of finding module symbols livepatch: make klp_mutex proper part of API livepatch: allow removal of a disabled patch livepatch: add /proc/<pid>/patch_state livepatch: change to a per-task consistency model livepatch: store function sizes livepatch: use kstrtobool() in enabled_store() livepatch: move patching functions into patch.c livepatch: remove unnecessary object loaded check livepatch: separate enabled and patched states livepatch/s390: add TIF_PATCH_PENDING thread flag livepatch/s390: reorganize TIF thread flag bits livepatch/powerpc: add TIF_PATCH_PENDING thread flag livepatch/x86: add TIF_PATCH_PENDING thread flag livepatch: create temporary klp_update_patch_state() stub x86/entry: define _TIF_ALLWORK_MASK flags explicitly stacktrace/x86: add function for detecting reliable stack traces
2017-05-01Merge branch 'locking-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "The main changes in this cycle were: - a big round of FUTEX_UNLOCK_PI improvements, fixes, cleanups and general restructuring - lockdep updates such as new checks for lock_downgrade() - introduce the new atomic_try_cmpxchg() locking API and use it to optimize refcount code generation - ... plus misc fixes, updates and cleanups" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits) MAINTAINERS: Add FUTEX SUBSYSTEM futex: Clarify mark_wake_futex memory barrier usage futex: Fix small (and harmless looking) inconsistencies futex: Avoid freeing an active timer rtmutex: Plug preempt count leak in rt_mutex_futex_unlock() rtmutex: Fix more prio comparisons rtmutex: Fix PI chain order integrity sched,tracing: Update trace_sched_pi_setprio() sched/rtmutex: Refactor rt_mutex_setprio() rtmutex: Clean up sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update sched/rtmutex/deadline: Fix a PI crash for deadline tasks rtmutex: Deboost before waking up the top waiter locking/ww-mutex: Limit stress test to 2 seconds locking/atomic: Fix atomic_try_cmpxchg() semantics lockdep: Fix per-cpu static objects futex: Drop hb->lock before enqueueing on the rtmutex futex: Futex_unlock_pi() determinism futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock() futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock() ...
2017-05-01Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The main changes in this cycle were: - another round of rq-clock handling debugging, robustization and fixes - PELT accounting improvements - CPU hotplug related ->cpus_allowed affinity handling fixes all around the tree - ... plus misc fixes, cleanups and updates" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits) sched/x86: Update reschedule warning text crypto: N2 - Replace racy task affinity logic cpufreq/sparc-us2e: Replace racy task affinity logic cpufreq/sparc-us3: Replace racy task affinity logic cpufreq/sh: Replace racy task affinity logic cpufreq/ia64: Replace racy task affinity logic ACPI/processor: Replace racy task affinity logic ACPI/processor: Fix error handling in __acpi_processor_start() sparc/sysfs: Replace racy task affinity logic powerpc/smp: Replace open coded task affinity logic ia64/sn/hwperf: Replace racy task affinity logic ia64/salinfo: Replace racy task affinity logic workqueue: Provide work_on_cpu_safe() ia64/topology: Remove cpus_allowed manipulation sched/fair: Move the PELT constants into a generated header sched/fair: Increase PELT accuracy for small tasks sched/fair: Fix comments sched/Documentation: Add 'sched-pelt' tool sched/fair: Fix corner case in __accumulate_sum() sched/core: Remove 'task' parameter and rename tsk_restore_flags() to current_restore_flags() ...
2017-05-01Merge tag 'pm-4.12-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "This time the majority of changes go to the cpufreq subsystem (and to the intel_pstate driver in particular) and there are some updates in the generic power domains framework, cpuidle, tools and a couple of other places. One thing worth mentioning is that the intel_pstate's sysfs interface has been reworked to be more consistent with the general expectations of the cpufreq core and less confusing, hopefully for the better. Also, we have a new cpufreq driver for Tegra186 and new hardware support in intel_pstata and the Mediatek cpufreq driver. Apart from that, the AnalyzeSuspend utility for system suspend profiling gets a companion called AnalyzeBoot for the analogous profiling of system boot and they both go into one place under tools/power/pm-graph/. The rest is mostly fixes, cleanups and code reorganization. Specifics: - Rework the intel_pstate driver's sysfs interface to make it more straightforward and more intuitive (Rafael Wysocki). - Make intel_pstate support all processors which advertise HWP (hardware-managed P-states) to the kernel in all operation modes and make it use the load-based P-state selection algorithm on a wider range of systems in the active mode (Rafael Wysocki). - Add cpufreq driver for Tegra186 (Mikko Perttunen). - Add support for Gemini Lake SoCs to intel_pstate (David Box). - Add support for MT8176 and MT817x to the Mediatek cpufreq driver and clean up that driver a bit (Daniel Kurtz). - Clean up intel_pstate and optimize it slightly (Rafael Wysocki). - Update the schedutil cpufreq governor, mostly to fix a couple of issues with it related to specific workloads, and rework its sysfs tunable and initialization a bit (Rafael Wysocki, Viresh Kumar). - Fix minor issues in the imx6q, dbx500 and qoriq cpufreq drivers (Christophe Jaillet, Irina Tirdea, Leonard Crestez, Viresh Kumar, YuanTian Tang). - Add file patterns for cpufreq DT bindings to MAINTAINERS (Geert Uytterhoeven). - Add support for "always on" power domains to the genpd (generic power domains) framework and clean up that code somewhat (Ulf Hansson, Lina Iyer, Viresh Kumar). - Fix minor issues in the powernv cpuidle driver and clean it up (Anton Blanchard, Gautham Shenoy). - Move the AnalyzeSuspend utility under tools/power/pm-graph/ and add an analogous boot-profiling utility called AnalyzeBoot to it (Todd Brandt). - Add rk3328 support to the rockchip-io AVS (Adaptive Voltage Scaling) driver (David Wu). - Fix minor issues in the cpuidle core, the intel_pstate_tracer utility, the devfreq framework and the PM core documentation (Chanwoo Choi, Doug Smythies, Johan Hovold, Marcin Nowakowski)" * tag 'pm-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (56 commits) PM / runtime: Document autosuspend-helper side effects PM / runtime: Fix autosuspend documentation tools: power: pm-graph: Package makefile and man pages tools: power: pm-graph: AnalyzeBoot v2.0 tools: power: pm-graph: AnalyzeSuspend v4.6 cpufreq: Add Tegra186 cpufreq driver cpufreq: imx6q: Fix error handling code cpufreq: imx6q: Set max suspend_freq to avoid changes during suspend cpufreq: imx6q: Fix handling EPROBE_DEFER from regulator cpuidle: powernv: Avoid a branch in the core snooze_loop() loop cpuidle: powernv: Don't continually set thread priority in snooze_loop() cpuidle: powernv: Don't bounce between low and very low thread priority cpuidle: cpuidle-cps: remove unused variable tools/power/x86/intel_pstate_tracer: Adjust directory ownership cpufreq: schedutil: Use policy-dependent transition delays cpufreq: schedutil: Reduce frequencies slower PM / devfreq: Move struct devfreq_governor to devfreq directory PM / Domains: Ignore domain-idle-states that are not compatible cpufreq: intel_pstate: Add support for Gemini Lake powernv-cpuidle: Validate DT property array size ...
2017-05-01Merge branch 'for-4.12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "Nothing major. Two notable fixes are Li's second stab at fixing the long-standing race condition in the mount path and suppression of spurious warning from cgroup_get(). All other changes are trivial" * 'for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: mark cgroup_get() with __maybe_unused cgroup: avoid attaching a cgroup root to two different superblocks, take 2 cgroup: fix spurious warnings on cgroup_is_dead() from cgroup_sk_alloc() cgroup: move cgroup_subsys_state parent field for cache locality cpuset: Remove cpuset_update_active_cpus()'s parameter. cgroup: switch to BUG_ON() cgroup: drop duplicate header nsproxy.h kernel: convert css_set.refcount from atomic_t to refcount_t kernel: convert cgroup_namespace.count from atomic_t to refcount_t
2017-04-28Merge schedutil governor updates for v4.12.Rafael J. Wysocki
2017-04-27sched/cputime: Fix ksoftirqd cputime accounting regressionFrederic Weisbecker
irq_time_read() returns the irqtime minus the ksoftirqd time. This is necessary because irq_time_read() is used to substract the IRQ time from the sum_exec_runtime of a task. If we were to include the softirq time of ksoftirqd, this task would substract its own CPU time everytime it updates ksoftirqd->sum_exec_runtime which would therefore never progress. But this behaviour got broken by: a499a5a14db ("sched/cputime: Increment kcpustat directly on irqtime account") ... which now includes ksoftirqd softirq time in the time returned by irq_time_read(). This has resulted in wrong ksoftirqd cputime reported to userspace through /proc/stat and thus "top" not showing ksoftirqd when it should after intense networking load. ksoftirqd->stime happens to be correct but it gets scaled down by sum_exec_runtime through task_cputime_adjusted(). To fix this, just account the strict IRQ time in a separate counter and use it to report the IRQ time. Reported-and-tested-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1493129448-5356-1-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-23Merge branch 'for-mingo' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU updates from Paul E. McKenney: - Documentation updates. - Miscellaneous fixes. - Parallelize SRCU callback handling (plus overlapping patches). Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-21rcu: Make non-preemptive schedule be Tasks RCU quiescent statePaul E. McKenney
Currently, a call to schedule() acts as a Tasks RCU quiescent state only if a context switch actually takes place. However, just the call to schedule() guarantees that the calling task has moved off of whatever tracing trampoline that it might have been one previously. This commit therefore plumbs schedule()'s "preempt" parameter into rcu_note_context_switch(), which then records the Tasks RCU quiescent state, but only if this call to schedule() was -not- due to a preemption. To avoid adding overhead to the common-case context-switch path, this commit hides the rcu_note_context_switch() check under an existing non-common-case check. Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-17cpufreq: schedutil: Use policy-dependent transition delaysRafael J. Wysocki
Make the schedutil governor take the initial (default) value of the rate_limit_us sysfs attribute from the (new) transition_delay_us policy parameter (to be set by the scaling driver). That will allow scaling drivers to make schedutil use smaller default values of rate_limit_us and reduce the default average time interval between consecutive frequency changes. Make intel_pstate set transition_delay_us to 500. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2017-04-14Merge branch 'linus' into locking/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-14sched/fair: Move the PELT constants into a generated headerPeter Zijlstra
Now that we have a tool to generate the PELT constants in C form, use its output as a separate header. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-14sched/fair: Increase PELT accuracy for small tasksPeter Zijlstra
We truncate (and loose) the lower 10 bits of runtime in ___update_load_avg(), this means there's a consistent bias to under-account tasks. This is esp. significant for small tasks. Cure this by only forwarding last_update_time to the point we've actually accounted for, leaving the remainder for the next time. Reported-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>