summaryrefslogtreecommitdiff
path: root/kernel/sched
AgeCommit message (Collapse)Author
2017-08-25sched/topology: Improve commentsPeter Zijlstra
Mike provided a better comment for destroy_sched_domain() ... Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25sched/topology: Fix memory leak in __sdt_alloc()Shu Wang
Found this issue by kmemleak: the 'sg' and 'sgc' pointers from __sdt_alloc() might be leaked as each domain holds many groups' ref, but in destroy_sched_domain(), it only declined the first group ref. Onlining and offlining a CPU can trigger this leak, and cause OOM. Reproducer for my 6 CPUs machine: while true do echo 0 > /sys/devices/system/cpu/cpu5/online; echo 1 > /sys/devices/system/cpu/cpu5/online; done unreferenced object 0xffff88007d772a80 (size 64): comm "cpuhp/5", pid 39, jiffies 4294719962 (age 35.251s) hex dump (first 32 bytes): c0 22 77 7d 00 88 ff ff 02 00 00 00 01 00 00 00 ."w}............ 40 2a 77 7d 00 88 ff ff 00 00 00 00 00 00 00 00 @*w}............ backtrace: [<ffffffff8176525a>] kmemleak_alloc+0x4a/0xa0 [<ffffffff8121efe1>] __kmalloc_node+0xf1/0x280 [<ffffffff810d94a8>] build_sched_domains+0x1e8/0xf20 [<ffffffff810da674>] partition_sched_domains+0x304/0x360 [<ffffffff81139557>] cpuset_update_active_cpus+0x17/0x40 [<ffffffff810bdb2e>] sched_cpu_activate+0xae/0xc0 [<ffffffff810900e0>] cpuhp_invoke_callback+0x90/0x400 [<ffffffff81090597>] cpuhp_up_callbacks+0x37/0xb0 [<ffffffff81090887>] cpuhp_thread_fun+0xd7/0xf0 [<ffffffff810b37e0>] smpboot_thread_fn+0x110/0x160 [<ffffffff810af5d9>] kthread+0x109/0x140 [<ffffffff81770e45>] ret_from_fork+0x25/0x30 [<ffffffffffffffff>] 0xffffffffffffffff unreferenced object 0xffff88007d772a40 (size 64): comm "cpuhp/5", pid 39, jiffies 4294719962 (age 35.251s) hex dump (first 32 bytes): 03 00 00 00 00 00 00 00 00 04 00 00 00 00 00 00 ................ 00 04 00 00 00 00 00 00 4f 3c fc ff 00 00 00 00 ........O<...... backtrace: [<ffffffff8176525a>] kmemleak_alloc+0x4a/0xa0 [<ffffffff8121efe1>] __kmalloc_node+0xf1/0x280 [<ffffffff810da16d>] build_sched_domains+0xead/0xf20 [<ffffffff810da674>] partition_sched_domains+0x304/0x360 [<ffffffff81139557>] cpuset_update_active_cpus+0x17/0x40 [<ffffffff810bdb2e>] sched_cpu_activate+0xae/0xc0 [<ffffffff810900e0>] cpuhp_invoke_callback+0x90/0x400 [<ffffffff81090597>] cpuhp_up_callbacks+0x37/0xb0 [<ffffffff81090887>] cpuhp_thread_fun+0xd7/0xf0 [<ffffffff810b37e0>] smpboot_thread_fn+0x110/0x160 [<ffffffff810af5d9>] kthread+0x109/0x140 [<ffffffff81770e45>] ret_from_fork+0x25/0x30 [<ffffffffffffffff>] 0xffffffffffffffff Reported-by: Chunyu Hu <chuhu@redhat.com> Signed-off-by: Shu Wang <shuwang@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Chunyu Hu <chuhu@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: liwang@redhat.com Link: http://lkml.kernel.org/r/1502351536-9108-1-git-send-email-shuwang@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-21Merge branch 'for-mingo' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU updates from Paul E. McKenney: - Removal of spin_unlock_wait() - SRCU updates - Torture-test updates - Documentation updates - Miscellaneous fixes - CPU-hotplug fixes - Miscellaneous non-RCU fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-18cpufreq: schedutil: Always process remote callback with slow switchingViresh Kumar
The frequency update from the utilization update handlers can be divided into two parts: (A) Finding the next frequency (B) Updating the frequency While any CPU can do (A), (B) can be restricted to a group of CPUs only, depending on the current platform. For platforms where fast cpufreq switching is possible, both (A) and (B) are always done from the same CPU and that CPU should be capable of changing the frequency of the target CPU. But for platforms where fast cpufreq switching isn't possible, after doing (A) we wake up a kthread which will eventually do (B). This kthread is already bound to the right set of CPUs, i.e. only those which can change the frequency of CPUs of a cpufreq policy. And so any CPU can actually do (A) in this case, as the frequency is updated from the right set of CPUs only. Check cpufreq_can_do_remote_dvfs() only for the fast switching case. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-08-18cpufreq: schedutil: Don't restrict kthread to related_cpus unnecessarilyViresh Kumar
Utilization update callbacks are now processed remotely, even on the CPUs that don't share cpufreq policy with the target CPU (if dvfs_possible_from_any_cpu flag is set). But in non-fast switch paths, the frequency is changed only from one of policy->related_cpus. This happens because the kthread which does the actual update is bound to a subset of CPUs (i.e. related_cpus). Allow frequency to be remotely updated as well (i.e. call __cpufreq_driver_target()) if dvfs_possible_from_any_cpu flag is set. Reported-by: Pavan Kondeti <pkondeti@codeaurora.org> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-08-17Merge branches 'doc.2017.08.17a', 'fixes.2017.08.17a', ↵Paul E. McKenney
'hotplug.2017.07.25b', 'misc.2017.08.17a', 'spin_unlock_wait_no.2017.08.17a', 'srcu.2017.07.27c' and 'torture.2017.07.24c' into HEAD doc.2017.08.17a: Documentation updates. fixes.2017.08.17a: RCU fixes. hotplug.2017.07.25b: CPU-hotplug updates. misc.2017.08.17a: Miscellaneous fixes outside of RCU (give or take conflicts). spin_unlock_wait_no.2017.08.17a: Remove spin_unlock_wait(). srcu.2017.07.27c: SRCU updates. torture.2017.07.24c: Torture-test updates.
2017-08-17completion: Replace spin_unlock_wait() with lock/unlock pairPaul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore replaces the spin_unlock_wait() call in completion_done() with spin_lock() followed immediately by spin_unlock(). This should be safe from a performance perspective because the lock will be held only the wakeup happens really quickly. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Andrea Parri <parri.andrea@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-08-17membarrier: Provide expedited private commandMathieu Desnoyers
Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built from all runqueues for which current thread's mm is the same as the thread calling sys_membarrier. It executes faster than the non-expedited variant (no blocking). It also works on NOHZ_FULL configurations. Scheduler-wise, it requires a memory barrier before and after context switching between processes (which have different mm). The memory barrier before context switch is already present. For the barrier after context switch: * Our TSO archs can do RELEASE without being a full barrier. Look at x86 spin_unlock() being a regular STORE for example. But for those archs, all atomics imply smp_mb and all of them have atomic ops in switch_mm() for mm_cpumask(), and on x86 the CR3 load acts as a full barrier. * From all weakly ordered machines, only ARM64 and PPC can do RELEASE, the rest does indeed do smp_mb(), so there the spin_unlock() is a full barrier and we're good. * ARM64 has a very heavy barrier in switch_to(), which suffices. * PPC just removed its barrier from switch_to(), but appears to be talking about adding something to switch_mm(). So add a smp_mb__after_unlock_lock() for now, until this is settled on the PPC side. Changes since v3: - Properly document the memory barriers provided by each architecture. Changes since v2: - Address comments from Peter Zijlstra, - Add smp_mb__after_unlock_lock() after finish_lock_switch() in finish_task_switch() to add the memory barrier we need after storing to rq->curr. This is much simpler than the previous approach relying on atomic_dec_and_test() in mmdrop(), which actually added a memory barrier in the common case of switching between userspace processes. - Return -EINVAL when MEMBARRIER_CMD_SHARED is used on a nohz_full kernel, rather than having the whole membarrier system call returning -ENOSYS. Indeed, CMD_PRIVATE_EXPEDITED is compatible with nohz_full. Adapt the CMD_QUERY mask accordingly. Changes since v1: - move membarrier code under kernel/sched/ because it uses the scheduler runqueue, - only add the barrier when we switch from a kernel thread. The case where we switch from a user-space thread is already handled by the atomic_dec_and_test() in mmdrop(). - add a comment to mmdrop() documenting the requirement on the implicit memory barrier. CC: Peter Zijlstra <peterz@infradead.org> CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com> CC: Boqun Feng <boqun.feng@gmail.com> CC: Andrew Hunter <ahh@google.com> CC: Maged Michael <maged.michael@gmail.com> CC: gromer@google.com CC: Avi Kivity <avi@scylladb.com> CC: Benjamin Herrenschmidt <benh@kernel.crashing.org> CC: Paul Mackerras <paulus@samba.org> CC: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Dave Watson <davejwatson@fb.com>
2017-08-16sched/completion: Document that reinit_completion() must be called after ↵Steven Rostedt
complete_all() The complete_all() function modifies the completion's "done" variable to UINT_MAX, and no other caller (wait_for_completion(), etc) will modify it back to zero. That means that any call to complete_all() must have a reinit_completion() before that completion can be used again. Document this fact by the complete_all() function. Also document that completion_done() will always return true if complete_all() is called. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170816131202.195c2f4b@gandalf.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-11sched: Replace spin_unlock_wait() with lock/unlock pairPaul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore replaces the spin_unlock_wait() call in do_task_dead() with spin_lock() followed immediately by spin_unlock(). This should be safe from a performance perspective because the lock is this tasks ->pi_lock, and this is called only after the task exits. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Andrea Parri <parri.andrea@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> [ paulmck: Drop smp_mb() based on Peter Zijlstra's analysis: http://lkml.kernel.org/r/20170811144150.26gowhxte7ri5fpk@hirez.programming.kicks-ass.net ]
2017-08-11PM / s2idle: Rename ->enter_freeze to ->enter_s2idleRafael J. Wysocki
Rename the ->enter_freeze cpuidle driver callback to ->enter_s2idle to make it clear that it is used for entering suspend-to-idle and rename the related functions, variables and so on accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-08-11PM / s2idle: Rename freeze_state enum and related itemsRafael J. Wysocki
Rename the freeze_state enum representing the suspend-to-idle state machine states to s2idle_states and rename the related variables and functions accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-08-10sched/autogroup: Fix error reporting printk text in autogroup_create()Anshuman Khandual
Its kzalloc() not kmalloc() which has failed earlier. While here remove a redundant empty line. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20170802084300.29487-1-khandual@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10sched/fair: Fix wake_affine() for !NUMA_BALANCINGPeter Zijlstra
In commit: 3fed382b46ba ("sched/numa: Implement NUMA node level wake_affine()") Rik changed wake_affine to consider NUMA information when balancing between LLC domains. There are a number of problems here which this patch tries to address: - LLC < NODE; in this case we'd use the wrong information to balance - !NUMA_BALANCING: in this case, the new code doesn't do any balancing at all - re-computes the NUMA data for every wakeup, this can mean iterating up to 64 CPUs for every wakeup. - default affine wakeups inside a cache We address these by saving the load/capacity values for each sched_domain during regular load-balance and using these values in wake_affine_llc(). The obvious down-side to using cached values is that they can be too old and poorly reflect reality. But this way we can use LLC wide information and thus not rely on assuming LLC matches NODE. We also don't rely on NUMA_BALANCING nor do we have to aggegate two nodes (or even cache domains) worth of CPUs for each wakeup. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Fixes: 3fed382b46ba ("sched/numa: Implement NUMA node level wake_affine()") [ Minor readability improvements. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10locking/lockdep: Apply crossrelease to completionsByungchul Park
Although wait_for_completion() and its family can cause deadlock, the lock correctness validator could not be applied to them until now, because things like complete() are usually called in a different context from the waiting context, which violates lockdep's assumption. Thanks to CONFIG_LOCKDEP_CROSSRELEASE, we can now apply the lockdep detector to those completion operations. Applied it. Signed-off-by: Byungchul Park <byungchul.park@lge.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akpm@linux-foundation.org Cc: boqun.feng@gmail.com Cc: kernel-team@lge.com Cc: kirill@shutemov.name Cc: npiggin@gmail.com Cc: walken@google.com Cc: willy@infradead.org Link: http://lkml.kernel.org/r/1502089981-21272-10-git-send-email-byungchul.park@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10locking: Introduce smp_mb__after_spinlock()Peter Zijlstra
Since its inception, our understanding of ACQUIRE, esp. as applied to spinlocks, has changed somewhat. Also, I wonder if, with a simple change, we cannot make it provide more. The problem with the comment is that the STORE done by spin_lock isn't itself ordered by the ACQUIRE, and therefore a later LOAD can pass over it and cross with any prior STORE, rendering the default WMB insufficient (pointed out by Alan). Now, this is only really a problem on PowerPC and ARM64, both of which already defined smp_mb__before_spinlock() as a smp_mb(). At the same time, we can get a much stronger construct if we place that same barrier _inside_ the spin_lock(). In that case we upgrade the RCpc spinlock to an RCsc. That would make all schedule() calls fully transitive against one another. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10sched/wait: Remove the lockless swait_active() check in swake_up*()Boqun Feng
Steven Rostedt reported a potential race in RCU core because of swake_up(): CPU0 CPU1 ---- ---- __call_rcu_core() { spin_lock(rnp_root) need_wake = __rcu_start_gp() { rcu_start_gp_advanced() { gp_flags = FLAG_INIT } } rcu_gp_kthread() { swait_event_interruptible(wq, gp_flags & FLAG_INIT) { spin_lock(q->lock) *fetch wq->task_list here! * list_add(wq->task_list, q->task_list) spin_unlock(q->lock); *fetch old value of gp_flags here * spin_unlock(rnp_root) rcu_gp_kthread_wake() { swake_up(wq) { swait_active(wq) { list_empty(wq->task_list) } * return false * if (condition) * false * schedule(); In this case, a wakeup is missed, which could cause the rcu_gp_kthread waits for a long time. The reason of this is that we do a lockless swait_active() check in swake_up(). To fix this, we can either 1) add a smp_mb() in swake_up() before swait_active() to provide the proper order or 2) simply remove the swait_active() in swake_up(). The solution 2 not only fixes this problem but also keeps the swait and wait API as close as possible, as wake_up() doesn't provide a full barrier and doesn't do a lockless check of the wait queue either. Moreover, there are users already using swait_active() to do their quick checks for the wait queues, so it make less sense that swake_up() and swake_up_all() do this on their own. This patch then removes the lockless swait_active() check in swake_up() and swake_up_all(). Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Krister Johansen <kjlx@templeofstupid.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170615041828.zk3a3sfyudm5p6nl@tardis Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10sched/debug: Intruduce task_state_to_char() helper functionXie XiuQi
Now that we have more than one place to get the task state, intruduce the task_state_to_char() helper function to save some code. No functionality changed. Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <cj.chengjian@huawei.com> Cc: <huawei.libin@huawei.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1502095463-160172-3-git-send-email-xiexiuqi@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10sched/debug: Show task state in /proc/sched_debugXie XiuQi
Currently we print the runnable task in /proc/sched_debug, but there is no task state information. We don't know which task is in the runqueue and which task is sleeping. Add task state in the runnable task list, like this: runnable tasks: S task PID tree-key switches prio wait-time sum-exec sum-sleep ----------------------------------------------------------------------------------------------------------- S watchdog/239 1452 -11.917445 2811 0 0.000000 8.949306 0.000000 7 0 / S migration/239 1453 20686.367740 8 0 0.000000 16215.720897 0.000000 7 0 / S ksoftirqd/239 1454 115383.841071 12 120 0.000000 0.200683 0.000000 7 0 / >R test 21287 4872.190970 407 120 0.000000 4874.911790 0.000000 7 0 /autogroup-150 R test 21288 4868.385454 401 120 0.000000 3672.341489 0.000000 7 0 /autogroup-150 R test 21289 4868.326776 384 120 0.000000 3424.934159 0.000000 7 0 /autogroup-150 Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <cj.chengjian@huawei.com> Cc: <huawei.libin@huawei.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1502095463-160172-2-git-send-email-xiexiuqi@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10sched/debug: Use task_pid_nr_ns in /proc/$pid/schedAleksa Sarai
It appears as though the addition of the PID namespace did not update the output code for /proc/*/sched, which resulted in it providing PIDs that were not self-consistent with the /proc mount. This additionally made it trivial to detect whether a process was inside &init_pid_ns from userspace, making container detection trivial: https://github.com/jessfraz/amicontained This leads to situations such as: % unshare -pmf % mount -t proc proc /proc % head -n1 /proc/1/sched head (10047, #threads: 1) Fix this by just using task_pid_nr_ns for the output of /proc/*/sched. All of the other uses of task_pid_nr in kernel/sched/debug.c are from a sysctl context and thus don't need to be namespaced. Signed-off-by: Aleksa Sarai <asarai@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Jess Frazelle <acidburn@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: cyphar@cyphar.com Link: http://lkml.kernel.org/r/20170806044141.5093-1-asarai@suse.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10sched/core: Remove unnecessary initialization init_idle_bootup_task()Cheng Jian
init_idle_bootup_task( ) is called in rest_init( ) to switch the scheduling class of the boot thread to the idle class. the function only sets: idle->sched_class = &idle_sched_class; which has been set in init_idle() called by sched_init(): /* * The idle tasks have their own, simple scheduling class: */ idle->sched_class = &idle_sched_class; We've already set the boot thread to idle class in start_kernel()->sched_init()->init_idle() so it's unnecessary to set it again in start_kernel()->rest_init()->init_idle_bootup_task() Signed-off-by: Cheng Jian <cj.chengjian@huawei.com> Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <akpm@linux-foundation.org> Cc: <huawei.libin@huawei.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1501838377-109720-1-git-send-email-cj.chengjian@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10sched/deadline: Change return value of cpudl_find()Byungchul Park
cpudl_find() users are only interested in knowing if suitable CPU(s) were found or not (and then they look at later_mask to know which). Change cpudl_find() return type accordingly. Aligns with rt code. Signed-off-by: Byungchul Park <byungchul.park@lge.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <bristot@redhat.com> Cc: <juri.lelli@gmail.com> Cc: <kernel-team@lge.com> Cc: <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1495504859-10960-3-git-send-email-byungchul.park@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10sched/deadline: Make find_later_rq() choose a closer CPU in topologyByungchul Park
When cpudl_find() returns any among free_cpus, the CPU might not be closer than others, considering sched domain. For example: this_cpu: 15 free_cpus: 0, 1,..., 14 (== later_mask) best_cpu: 0 topology: 0 --+ +--+ 1 --+ | +-- ... --+ 2 --+ | | +--+ | 3 --+ | ... ... 12 --+ | +--+ | 13 --+ | | +-- ... -+ 14 --+ | +--+ 15 --+ In this case, it would be best to select 14 since it's a free CPU and closest to 15 (this_cpu). However, currently the code selects 0 (best_cpu) even though that's just any among free_cpus. Fix it. This (re)aligns the deadline behaviour with the rt behaviour. Signed-off-by: Byungchul Park <byungchul.park@lge.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <bristot@redhat.com> Cc: <juri.lelli@gmail.com> Cc: <kernel-team@lge.com> Cc: <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1495504859-10960-2-git-send-email-byungchul.park@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10sched/numa: Scale scan period with tasks in group and shared/privateRik van Riel
Running 80 tasks in the same group, or as threads of the same process, results in the memory getting scanned 80x as fast as it would be if a single task was using the memory. This really hurts some workloads. Scale the scan period by the number of tasks in the numa group, and the shared / private ratio, so the average rate at which memory in the group is scanned corresponds roughly to the rate at which a single task would scan its memory. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: jhladky@redhat.com Cc: lvenanci@redhat.com Link: http://lkml.kernel.org/r/20170731192847.23050-3-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10sched/numa: Slow down scan rate if shared faults dominateRik van Riel
The comment above update_task_scan_period() says the scan period should be increased (scanning slows down) if the majority of memory accesses are on the local node, or if the majority of the page accesses are shared with other tasks. However, with the current code, all a high ratio of shared accesses does is slow down the rate at which scanning is made faster. This patch changes things so either lots of shared accesses or lots of local accesses will slow down scanning, and numa scanning is sped up only when there are lots of private faults on remote memory pages. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: jhladky@redhat.com Cc: lvenanci@redhat.com Link: http://lkml.kernel.org/r/20170731192847.23050-2-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10sched/pelt: Fix false running accountingVincent Guittot
The running state is a subset of runnable state which means that running can't be set if runnable (weight) is cleared. There are corner cases where the current sched_entity has been already dequeued but cfs_rq->curr has not been updated yet and still points to the dequeued sched_entity. If ___update_load_avg() is called at that time, weight will be 0 and running will be set which is not possible. This case happens during pick_next_task_fair() when a cfs_rq becomes idles. The current sched_entity has been dequeued so se->on_rq is cleared and cfs_rq->weight is null. But cfs_rq->curr still points to se (it will be cleared when picking the idle thread). Because the cfs_rq becomes idle, idle_balance() is called and ends up to call update_blocked_averages() with these wrong running and runnable states. Add a test in ___update_load_avg() to correct the running state in this case. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Link: http://lkml.kernel.org/r/1498885573-18984-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10sched: Mark pick_next_task_dl() and build_sched_domain() as staticViresh Kumar
pick_next_task_dl() and build_sched_domain() aren't used outside deadline.c and topology.c. Make them static. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: linaro-kernel@lists.linaro.org Link: http://lkml.kernel.org/r/36e4cbb6210002cadae89920ae97e19e7e513008.1493281605.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10sched/cpupri: Don't re-initialize 'struct cpupri'Viresh Kumar
The 'struct cpupri' passed to cpupri_init() is already initialized to zero. Don't do that again. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: linaro-kernel@lists.linaro.org Link: http://lkml.kernel.org/r/8a71d48c5a077500b6ddc1a41484c0ac8d3aad94.1492065513.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10sched/deadline: Don't re-initialize 'struct cpudl'Viresh Kumar
The 'struct cpudl' passed to cpudl_init() is already initialized to zero. Don't do that again. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: linaro-kernel@lists.linaro.org Link: http://lkml.kernel.org/r/bd4c229806bc96694b15546207afcc221387d2f5.1492065513.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10sched/topology: Drop memset() from init_rootdomain()Viresh Kumar
There are only two callers of init_rootdomain(). One of them passes a global to it and another one sends dynamically allocated root-domain. There is no need to memset the root-domain in the first case as the structure is already reset. Update alloc_rootdomain() to allocate the memory with kzalloc() and remove the memset() call from init_rootdomain(). Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: linaro-kernel@lists.linaro.org Link: http://lkml.kernel.org/r/fc2f6cc90b098040970c85a97046512572d765bc.1492065513.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10sched/fair: Drop always true parameter of update_cfs_rq_load_avg()Viresh Kumar
update_freq is always true and there is no need to pass it to update_cfs_rq_load_avg(). Remove it. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: linaro-kernel@lists.linaro.org Link: http://lkml.kernel.org/r/2d28d295f3f591ede7e931462bce1bda5aaa4896.1495603536.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10sched/fair: Avoid checking cfs_rq->nr_running twiceViresh Kumar
Rearrange pick_next_task_fair() a bit to avoid checking cfs_rq->nr_running twice for the case where FAIR_GROUP_SCHED is enabled and the previous task doesn't belong to the fair class. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: linaro-kernel@lists.linaro.org Link: http://lkml.kernel.org/r/000903ab3df3350943d3271c53615893a230dc95.1495603536.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10sched/fair: Pass 'rq' to weighted_cpuload()Viresh Kumar
weighted_cpuload() uses the cpu number passed to it get pointer to the runqueue. Almost all callers of weighted_cpuload() already have the rq pointer with them and can send that directly to weighted_cpuload(). In some cases the callers actually get the CPU number by doing cpu_of(rq). It would be simpler to pass rq to weighted_cpuload(). Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: linaro-kernel@lists.linaro.org Link: http://lkml.kernel.org/r/b7720627e0576dc29b4ba3f9b6edbc913bb4f684.1495603536.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10sched/core: Reuse put_prev_task()Viresh Kumar
Reuse put_prev_task() instead of copying its implementation. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: linaro-kernel@lists.linaro.org Link: http://lkml.kernel.org/r/e2e50578223d05c5e90a9feb964fe1ec5d09a052.1495603536.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10sched/fair: Call cpufreq update util handlers less frequently on UPViresh Kumar
For SMP systems, update_load_avg() calls the cpufreq update util handlers only for the top level cfs_rq (i.e. rq->cfs). But that is not the case for UP systems. update_load_avg() calls util handler for any cfs_rq for which it is called. This would result in way too many calls from the scheduler to the cpufreq governors when CONFIG_FAIR_GROUP_SCHED is enabled. Reduce the frequency of these calls by copying the behavior from the SMP case, i.e. Only call util handlers for the top level cfs_rq. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: linaro-kernel@lists.linaro.org Fixes: 536bd00cdbb7 ("sched/fair: Fix !CONFIG_SMP kernel cpufreq governor breakage") Link: http://lkml.kernel.org/r/6abf69a2107525885b616a2c1ec03d9c0946171c.1495603536.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10cpufreq: Return 0 from ->fast_switch() on errorsViresh Kumar
CPUFREQ_ENTRY_INVALID is a special symbol which is used to specify that an entry in the cpufreq table is invalid. But using it outside of the scope of the cpufreq table looks a bit incorrect. We can represent an invalid frequency by writing it as 0 instead if we need. Note that it is already done that way for the return value of the ->get() callback. Lets do the same for ->fast_switch() and not use CPUFREQ_ENTRY_INVALID outside of the scope of cpufreq table. Also update the comment over cpufreq_driver_fast_switch() to clearly mention what this returns. None of the drivers return CPUFREQ_ENTRY_INVALID as of now from ->fast_switch() callback and so we don't need to update any of those. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-08-01sched: cpufreq: Allow remote cpufreq callbacksViresh Kumar
With Android UI and benchmarks the latency of cpufreq response to certain scheduling events can become very critical. Currently, callbacks into cpufreq governors are only made from the scheduler if the target CPU of the event is the same as the current CPU. This means there are certain situations where a target CPU may not run the cpufreq governor for some time. One testcase to show this behavior is where a task starts running on CPU0, then a new task is also spawned on CPU0 by a task on CPU1. If the system is configured such that the new tasks should receive maximum demand initially, this should result in CPU0 increasing frequency immediately. But because of the above mentioned limitation though, this does not occur. This patch updates the scheduler core to call the cpufreq callbacks for remote CPUs as well. The schedutil, ondemand and conservative governors are updated to process cpufreq utilization update hooks called for remote CPUs where the remote CPU is managed by the cpufreq policy of the local CPU. The intel_pstate driver is updated to always reject remote callbacks. This is tested with couple of usecases (Android: hackbench, recentfling, galleryfling, vellamo, Ubuntu: hackbench) on ARM hikey board (64 bit octa-core, single policy). Only galleryfling showed minor improvements, while others didn't had much deviation. The reason being that this patch only targets a corner case, where following are required to be true to improve performance and that doesn't happen too often with these tests: - Task is migrated to another CPU. - The task has high demand, and should take the target CPU to higher OPPs. - And the target CPU doesn't call into the cpufreq governor until the next tick. Based on initial work from Steve Muckle. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Saravana Kannan <skannan@codeaurora.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-07-28sched: Allow migrating kthreads into online but inactive CPUsTejun Heo
Per-cpu workqueues have been tripping CPU affinity sanity checks while a CPU is being offlined. A per-cpu kworker ends up running on a CPU which isn't its target CPU while the CPU is online but inactive. While the scheduler allows kthreads to wake up on an online but inactive CPU, it doesn't allow a running kthread to be migrated to such a CPU, which leads to an odd situation where setting affinity on a sleeping and running kthread leads to different results. Each mem-reclaim workqueue has one rescuer which guarantees forward progress and the rescuer needs to bind itself to the CPU which needs help in making forward progress; however, due to the above issue, while set_cpus_allowed_ptr() succeeds, the rescuer doesn't end up on the correct CPU if the CPU is in the process of going offline, tripping the sanity check and executing the work item on the wrong CPU. This patch updates __migrate_task() so that kthreads can be migrated into an inactive but online CPU. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-07-26cpufreq: schedutil: Use unsigned int for iowait boostJoel Fernandes
Make iowait_boost and iowait_boost_max as unsigned int since its unit is kHz and this is consistent with struct cpufreq_policy. Also change the local variables in sugov_iowait_boost() to match this. Signed-off-by: Joel Fernandes <joelaf@google.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-07-26cpufreq: schedutil: Make iowait boost more energy efficientJoel Fernandes
Currently the iowait_boost feature in schedutil makes the frequency go to max on iowait wakeups. This feature was added to handle a case that Peter described where the throughput of operations involving continuous I/O requests [1] is reduced due to running at a lower frequency, however the lower throughput itself causes utilization to be low and hence causing frequency to be low hence its "stuck". Instead of going to max, its also possible to achieve the same effect by ramping up to max if there are repeated in_iowait wakeups happening. This patch is an attempt to do that. We start from a lower frequency (policy->min) and double the boost for every consecutive iowait update until we reach the maximum iowait boost frequency (iowait_boost_max). I ran a synthetic test (continuous O_DIRECT writes in a loop) on an x86 machine with intel_pstate in passive mode using schedutil. In this test the iowait_boost value ramped from 800MHz to 4GHz in 60ms. The patch achieves the desired improved throughput as the existing behavior. [1] https://patchwork.kernel.org/patch/9735885/ Suggested-by: Peter Zijlstra <peterz@infradead.org> Suggested-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Joel Fernandes <joelaf@google.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-07-26cpufreq: schedutil: Set dynamic_switching to trueViresh Kumar
Set dynamic_switching to 'true' to disallow use of schedutil governor for platforms with transition_latency set to CPUFREQ_ETERNAL, as they may not want to do automatic dynamic frequency switching. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-07-25sched/core: Fix some documentation build warningsJonathan Corbet
The kerneldoc comments for try_to_wake_up_local() were out of date, leading to these documentation build warnings: ./kernel/sched/core.c:2080: warning: No description found for parameter 'rf' ./kernel/sched/core.c:2080: warning: Excess function parameter 'cookie' description in 'try_to_wake_up_local' Update the comment to reflect current reality and give us some peace and quiet. Signed-off-by: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-doc@vger.kernel.org Link: http://lkml.kernel.org/r/20170724135628.695cecfc@lwn.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-22cpufreq: Use transition_delay_us for legacy governors as wellViresh Kumar
The policy->transition_delay_us field is used only by the schedutil governor currently, and this field describes how fast the driver wants the cpufreq governor to change CPUs frequency. It should rather be a common thing across all governors, as it doesn't have any schedutil dependency here. Create a new helper cpufreq_policy_transition_delay_us() to get the transition delay across all governors. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-07-21Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "A cputime fix and code comments/organization fix to the deadline scheduler" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Fix confusing comments about selection of top pi-waiter sched/cputime: Don't use smp_processor_id() in preemptible context
2017-07-14Merge branches 'pm-cpufreq-sched' and 'intel_pstate'Rafael J. Wysocki
* pm-cpufreq-sched: cpufreq: schedutil: Fix sugov_start() versus sugov_update_shared() race * intel_pstate: cpufreq: intel_pstate: Fix ratio setting for min_perf_pct
2017-07-14sched/deadline: Fix confusing comments about selection of top pi-waiterJoel Fernandes
This comment in the code is incomplete, and I believe it begs a definition of dl_boosted to make sense of the condition that follows. Rewrite the comment and also rearrange the condition that follows to reflect the first condition "we have a top pi-waiter which is a SCHED_DEADLINE task" in that order. Also fix a typo that follows. Signed-off-by: Joel Fernandes <joelaf@google.com> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Acked-by: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170713022429.10307-1-joelaf@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-14sched/cputime: Don't use smp_processor_id() in preemptible contextWanpeng Li
Recent kernels trigger this warning: BUG: using smp_processor_id() in preemptible [00000000] code: 99-trinity/181 caller is debug_smp_processor_id+0x17/0x19 CPU: 0 PID: 181 Comm: 99-trinity Not tainted 4.12.0-01059-g2a42eb9 #1 Call Trace: dump_stack+0x82/0xb8 check_preemption_disabled() debug_smp_processor_id() vtime_delta() task_cputime() thread_group_cputime() thread_group_cputime_adjusted() wait_consider_task() do_wait() SYSC_wait4() do_syscall_64() entry_SYSCALL64_slow_path() As Frederic pointed out: | Although those sched_clock_cpu() things seem to only matter when the | sched_clock() is unstable. And that stability is a condition for nohz_full | to work anyway. So probably sched_clock() alone would be enough. This patch fixes it by replacing sched_clock_cpu() with sched_clock() to avoid calling smp_processor_id() in a preemptible context. Reported-by: Xiaolong Ye <xiaolong.ye@intel.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1499586028-7402-1-git-send-email-wanpeng.li@hotmail.com [ Prettified the changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-12cpufreq: schedutil: Fix sugov_start() versus sugov_update_shared() raceVikram Mulukutla
With a shared policy in place, when one of the CPUs in the policy is hotplugged out and then brought back online, sugov_stop() and sugov_start() are called in order. sugov_stop() removes utilization hooks for each CPU in the policy and does nothing else in the for_each_cpu() loop. sugov_start() on the other hand iterates through the CPUs in the policy and re-initializes the per-cpu structure _and_ adds the utilization hook. This implies that the scheduler is allowed to invoke a CPU's utilization update hook when the rest of the per-cpu structures have yet to be re-inited. Apart from some strange values in tracepoints this doesn't cause a problem, but if we do end up accessing a pointer from the per-cpu sugov_cpu structure somewhere in the sugov_update_shared() path, we will likely see crashes since the memset for another CPU in the policy is free to race with sugov_update_shared from the CPU that is ready to go. So let's fix this now to first init all per-cpu structures, and then add the per-cpu utilization update hooks all at once. Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-07-05sched/fair: Fix load_balance() affinity redo pathJeffrey Hugo
If load_balance() fails to migrate any tasks because all tasks were affined, load_balance() removes the source CPU from consideration and attempts to redo and balance among the new subset of CPUs. There is a bug in this code path where the algorithm considers all active CPUs in the system (minus the source that was just masked out). This is not valid for two reasons: some active CPUs may not be in the current scheduling domain and one of the active CPUs is dst_cpu. These CPUs should not be considered, as we cannot pull load from them. Instead of failing out of load_balance(), we may end up redoing the search with no valid CPUs and incorrectly concluding the domain is balanced. Additionally, if the group_imbalance flag was just set, it may also be incorrectly unset, thus the flag will not be seen by other CPUs in future load_balance() runs as that algorithm intends. Fix the check by removing CPUs not in the current domain and the dst_cpu from considertation, thus limiting the evaluation to valid remaining CPUs from which load might be migrated. Co-authored-by: Austin Christ <austinwc@codeaurora.org> Co-authored-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Tyler Baicar <tbaicar@codeaurora.org> Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Austin Christ <austinwc@codeaurora.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Timur Tabi <timur@codeaurora.org> Link: http://lkml.kernel.org/r/1496863138-11322-2-git-send-email-jhugo@codeaurora.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-05sched/cputime: Accumulate vtime on top of nsec clocksourceWanpeng Li
Currently the cputime source used by vtime is jiffies. When we cross a context boundary and jiffies have changed since the last snapshot, the pending cputime is accounted to the switching out context. This system works ok if the ticks are not aligned across CPUs. If they instead are aligned (ie: all fire at the same time) and the CPUs run in userspace, the jiffies change is only observed on tick exit and therefore the user cputime is accounted as system cputime. This is because the CPU that maintains timekeeping fires its tick at the same time as the others. It updates jiffies in the middle of the tick and the other CPUs see that update on IRQ exit: CPU 0 (timekeeper) CPU 1 ------------------- ------------- jiffies = N ... run in userspace for a jiffy tick entry tick entry (sees jiffies = N) set jiffies = N + 1 tick exit tick exit (sees jiffies = N + 1) account 1 jiffy as stime Fix this with using a nanosec clock source instead of jiffies. The cputime is then accumulated and flushed everytime the pending delta reaches a jiffy in order to mitigate the accounting overhead. [ fweisbec: changelog, rebase on struct vtime, field renames, add delta on cputime readers, keep idle vtime as-is (low overhead accounting), harmonize clock sources. ] Suggested-by: Thomas Gleixner <tglx@linutronix.de> Reported-by: Luiz Capitulino <lcapitulino@redhat.com> Tested-by: Luiz Capitulino <lcapitulino@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Wanpeng Li <kernellwp@gmail.com> Link: http://lkml.kernel.org/r/1498756511-11714-6-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>