summaryrefslogtreecommitdiff
path: root/kernel/sched
AgeCommit message (Collapse)Author
2020-11-10sched: Deny self-issued __set_cpus_allowed_ptr() when migrate_disable()Valentin Schneider
migrate_disable(); set_cpus_allowed_ptr(current, {something excluding task_cpu(current)}); affine_move_task(); <-- never returns Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201013140116.26651-1-valentin.schneider@arm.com
2020-11-10sched: Fix migrate_disable() vs rt/dl balancingPeter Zijlstra
In order to minimize the interference of migrate_disable() on lower priority tasks, which can be deprived of runtime due to being stuck below a higher priority task. Teach the RT/DL balancers to push away these higher priority tasks when a lower priority task gets selected to run on a freshly demoted CPU (pull). This adds migration interference to the higher priority task, but restores bandwidth to system that would otherwise be irrevocably lost. Without this it would be possible to have all tasks on the system stuck on a single CPU, each task preempted in a migrate_disable() section with a single high priority task running. This way we can still approximate running the M highest priority tasks on the system. Migrating the top task away is (ofcourse) still subject to migrate_disable() too, which means the lower task is subject to an interference equivalent to the worst case migrate_disable() section. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Link: https://lkml.kernel.org/r/20201023102347.499155098@infradead.org
2020-11-10sched, lockdep: Annotate ->pi_lock recursionPeter Zijlstra
There's a valid ->pi_lock recursion issue where the actual PI code tries to wake up the stop task. Make lockdep aware so it doesn't complain about this. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Link: https://lkml.kernel.org/r/20201023102347.406912197@infradead.org
2020-11-10sched,rt: Use the full cpumask for balancingPeter Zijlstra
We want migrate_disable() tasks to get PULLs in order for them to PUSH away the higher priority task. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Link: https://lkml.kernel.org/r/20201023102347.310519774@infradead.org
2020-11-10sched,rt: Use cpumask_any*_distribute()Peter Zijlstra
Replace a bunch of cpumask_any*() instances with cpumask_any*_distribute(), by injecting this little bit of random in cpu selection, we reduce the chance two competing balance operations working off the same lowest_mask pick the same CPU. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Link: https://lkml.kernel.org/r/20201023102347.190759694@infradead.org
2020-11-10sched/core: Make migrate disable and CPU hotplug cooperativeThomas Gleixner
On CPU unplug tasks which are in a migrate disabled region cannot be pushed to a different CPU until they returned to migrateable state. Account the number of tasks on a runqueue which are in a migrate disabled section and make the hotplug wait mechanism respect that. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Link: https://lkml.kernel.org/r/20201023102347.067278757@infradead.org
2020-11-10sched: Fix migrate_disable() vs set_cpus_allowed_ptr()Peter Zijlstra
Concurrent migrate_disable() and set_cpus_allowed_ptr() has interesting features. We rely on set_cpus_allowed_ptr() to not return until the task runs inside the provided mask. This expectation is exported to userspace. This means that any set_cpus_allowed_ptr() caller must wait until migrate_enable() allows migrations. At the same time, we don't want migrate_enable() to schedule, due to patterns like: preempt_disable(); migrate_disable(); ... migrate_enable(); preempt_enable(); And: raw_spin_lock(&B); spin_unlock(&A); this means that when migrate_enable() must restore the affinity mask, it cannot wait for completion thereof. Luck will have it that that is exactly the case where there is a pending set_cpus_allowed_ptr(), so let that provide storage for the async stop machine. Much thanks to Valentin who used TLA+ most effective and found lots of 'interesting' cases. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Link: https://lkml.kernel.org/r/20201023102346.921768277@infradead.org
2020-11-10sched: Add migrate_disable()Peter Zijlstra
Add the base migrate_disable() support (under protest). While migrate_disable() is (currently) required for PREEMPT_RT, it is also one of the biggest flaws in the system. Notably this is just the base implementation, it is broken vs sched_setaffinity() and hotplug, both solved in additional patches for ease of review. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Link: https://lkml.kernel.org/r/20201023102346.818170844@infradead.org
2020-11-10sched: Massage set_cpus_allowed()Peter Zijlstra
Thread a u32 flags word through the *set_cpus_allowed*() callchain. This will allow adding behavioural tweaks for future users. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Link: https://lkml.kernel.org/r/20201023102346.729082820@infradead.org
2020-11-10sched: Fix hotplug vs CPU bandwidth controlPeter Zijlstra
Since we now migrate tasks away before DYING, we should also move bandwidth unthrottle, otherwise we can gain tasks from unthrottle after we expect all tasks to be gone already. Also; it looks like the RT balancers don't respect cpu_active() and instead rely on rq->online in part, complete this. This too requires we do set_rq_offline() earlier to match the cpu_active() semantics. (The bigger patch is to convert RT to cpu_active() entirely) Since set_rq_online() is called from sched_cpu_activate(), place set_rq_offline() in sched_cpu_deactivate(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Link: https://lkml.kernel.org/r/20201023102346.639538965@infradead.org
2020-11-10sched/hotplug: Consolidate task migration on CPU unplugThomas Gleixner
With the new mechanism which kicks tasks off the outgoing CPU at the end of schedule() the situation on an outgoing CPU right before the stopper thread brings it down completely is: - All user tasks and all unbound kernel threads have either been migrated away or are not running and the next wakeup will move them to a online CPU. - All per CPU kernel threads, except cpu hotplug thread and the stopper thread have either been unbound or parked by the responsible CPU hotplug callback. That means that at the last step before the stopper thread is invoked the cpu hotplug thread is the last legitimate running task on the outgoing CPU. Add a final wait step right before the stopper thread is kicked which ensures that any still running tasks on the way to park or on the way to kick themself of the CPU are either sleeping or gone. This allows to remove the migrate_tasks() crutch in sched_cpu_dying(). If sched_cpu_dying() detects that there is still another running task aside of the stopper thread then it will explode with the appropriate fireworks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Link: https://lkml.kernel.org/r/20201023102346.547163969@infradead.org
2020-11-10sched/core: Wait for tasks being pushed away on hotplugThomas Gleixner
RT kernels need to ensure that all tasks which are not per CPU kthreads have left the outgoing CPU to guarantee that no tasks are force migrated within a migrate disabled section. There is also some desire to (ab)use fine grained CPU hotplug control to clear a CPU from active state to force migrate tasks which are not per CPU kthreads away for power control purposes. Add a mechanism which waits until all tasks which should leave the CPU after the CPU active flag is cleared have moved to a different online CPU. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Link: https://lkml.kernel.org/r/20201023102346.377836842@infradead.org
2020-11-10sched/hotplug: Ensure only per-cpu kthreads run during hotplugPeter Zijlstra
In preparation for migrate_disable(), make sure only per-cpu kthreads are allowed to run on !active CPUs. This is ran (as one of the very first steps) from the cpu-hotplug task which is a per-cpu kthread and completion of the hotplug operation only requires such tasks. This constraint enables the migrate_disable() implementation to wait for completion of all migrate_disable regions on this CPU at hotplug time without fear of any new ones starting. This replaces the unlikely(rq->balance_callbacks) test at the tail of context_switch with an unlikely(rq->balance_work), the fast path is not affected. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Link: https://lkml.kernel.org/r/20201023102346.292709163@infradead.org
2020-11-10sched: Fix balance_callback()Peter Zijlstra
The intent of balance_callback() has always been to delay executing balancing operations until the end of the current rq->lock section. This is because balance operations must often drop rq->lock, and that isn't safe in general. However, as noted by Scott, there were a few holes in that scheme; balance_callback() was called after rq->lock was dropped, which means another CPU can interleave and touch the callback list. Rework code to call the balance callbacks before dropping rq->lock where possible, and otherwise splice the balance list onto a local stack. This guarantees that the balance list must be empty when we take rq->lock. IOW, we'll only ever run our own balance callbacks. Reported-by: Scott Wood <swood@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Link: https://lkml.kernel.org/r/20201023102346.203901269@infradead.org
2020-11-10stop_machine: Add function and caller debug infoPeter Zijlstra
Crashes in stop-machine are hard to connect to the calling code, add a little something to help with that. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Link: https://lkml.kernel.org/r/20201023102346.116513635@infradead.org
2020-11-10sched/debug: Fix memory corruption caused by multiple small reads of flagsColin Ian King
Reading /proc/sys/kernel/sched_domain/cpu*/domain0/flags mutliple times with small reads causes oopses with slub corruption issues because the kfree is free'ing an offset from a previous allocation. Fix this by adding in a new pointer 'buf' for the allocation and kfree and use the temporary pointer tmp to handle memory copies of the buf offsets. Fixes: 5b9f8ff7b320 ("sched/debug: Output SD flag names rather than their values") Reported-by: Jeff Bastian <jbastian@redhat.com> Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20201029151103.373410-1-colin.king@canonical.com
2020-11-10sched/fair: Prefer prev cpu in asymmetric wakeup pathVincent Guittot
During fast wakeup path, scheduler always check whether local or prev cpus are good candidates for the task before looking for other cpus in the domain. With commit b7a331615d25 ("sched/fair: Add asymmetric CPU capacity wakeup scan") the heterogenous system gains a dedicated path but doesn't try to reuse prev cpu whenever possible. If the previous cpu is idle and belong to the LLC domain, we should check it 1st before looking for another cpu because it stays one of the best candidate and this also stabilizes task placement on the system. This change aligns asymmetric path behavior with symmetric one and reduces cases where the task migrates across all cpus of the sd_asym_cpucapacity domains at wakeup. This change does not impact normal EAS mode but only the overloaded case or when EAS is not used. - On hikey960 with performance governor (EAS disable) ./perf bench sched pipe -T -l 50000 mainline w/ patch # migrations 999364 0 ops/sec 149313(+/-0.28%) 182587(+/- 0.40) +22% - On hikey with performance governor ./perf bench sched pipe -T -l 50000 mainline w/ patch # migrations 0 0 ops/sec 47721(+/-0.76%) 47899(+/- 0.56) +0.4% According to test on hikey, the patch doesn't impact symmetric system compared to current implementation (only tested on arm64) Also read the uclamped value of task's utilization at most twice instead instead each time we compare task's utilization with cpu's capacity. Fixes: b7a331615d25 ("sched/fair: Add asymmetric CPU capacity wakeup scan") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20201029161824.26389-1-vincent.guittot@linaro.org
2020-11-10sched/fair: Ensure tasks spreading in LLC during LBVincent Guittot
schbench shows latency increase for 95 percentile above since: commit 0b0695f2b34a ("sched/fair: Rework load_balance()") Align the behavior of the load balancer with the wake up path, which tries to select an idle CPU which belongs to the LLC for a waking task. calculate_imbalance() will use nr_running instead of the spare capacity when CPUs share resources (ie cache) at the domain level. This will ensure a better spread of tasks on idle CPUs. Running schbench on a hikey (8cores arm64) shows the problem: tip/sched/core : schbench -m 2 -t 4 -s 10000 -c 1000000 -r 10 Latency percentiles (usec) 50.0th: 33 75.0th: 45 90.0th: 51 95.0th: 4152 *99.0th: 14288 99.5th: 14288 99.9th: 14288 min=0, max=14276 tip/sched/core + patch : schbench -m 2 -t 4 -s 10000 -c 1000000 -r 10 Latency percentiles (usec) 50.0th: 34 75.0th: 47 90.0th: 52 95.0th: 78 *99.0th: 94 99.5th: 94 99.9th: 94 min=0, max=94 Fixes: 0b0695f2b34a ("sched/fair: Rework load_balance()") Reported-by: Chris Mason <clm@fb.com> Suggested-by: Rik van Riel <riel@surriel.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@surriel.com> Tested-by: Rik van Riel <riel@surriel.com> Link: https://lkml.kernel.org/r/20201102102457.28808-1-vincent.guittot@linaro.org
2020-11-10cpufreq: Introduce governor flagsRafael J. Wysocki
A new cpufreq governor flag will be added subsequently, so replace the bool dynamic_switching fleid in struct cpufreq_governor with a flags field and introduce CPUFREQ_GOV_DYNAMIC_SWITCHING to set for the "dynamic switching" governors instead of it. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-11-10sched/fair: Reorder throttle_cfs_rq() pathPeng Wang
As commit: 39f23ce07b93 ("sched/fair: Fix unthrottle_cfs_rq() for leaf_cfs_rq list") does in unthrottle_cfs_rq(), throttle_cfs_rq() can also use the same pattern as dequeue_task_fair(). No functional changes. Signed-off-by: Peng Wang <rocking@linux.alibaba.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Phil Auld <pauld@redhat.com> Cc: Ben Segall <bsegall@google.com> Link: https://lore.kernel.org/r/f11dd2e3ab35cc538e2eb57bf0c99b6eaffce127.1604973978.git.rocking@linux.alibaba.com
2020-11-02cpufreq: schedutil: Don't skip freq update if need_freq_update is setViresh Kumar
The cpufreq policy's frequency limits (min/max) can get changed at any point of time, while schedutil is trying to update the next frequency. Though the schedutil governor has necessary locking and support in place to make sure we don't miss any of those updates, there is a corner case where the governor will find that the CPU is already running at the desired frequency and so may skip an update. For example, consider that the CPU can run at 1 GHz, 1.2 GHz and 1.4 GHz and is running at 1 GHz currently. Schedutil tries to update the frequency to 1.2 GHz, during this time the policy limits get changed as policy->min = 1.4 GHz. As schedutil (and cpufreq core) does clamp the frequency at various instances, we will eventually set the frequency to 1.4 GHz, while we will save 1.2 GHz in sg_policy->next_freq. Now lets say the policy limits get changed back at this time with policy->min as 1 GHz. The next time schedutil is invoked by the scheduler, we will reevaluate the next frequency (because need_freq_update will get set due to limits change event) and lets say we want to set the frequency to 1.2 GHz again. At this point sugov_update_next_freq() will find the next_freq == current_freq and will abort the update, while the CPU actually runs at 1.4 GHz. Until now need_freq_update was used as a flag to indicate that the policy's frequency limits have changed, and that we should consider the new limits while reevaluating the next frequency. This patch fixes the above mentioned issue by extending the purpose of the need_freq_update flag. If this flag is set now, the schedutil governor will not try to abort a frequency change even if next_freq == current_freq. As similar behavior is required in the case of CPUFREQ_NEED_UPDATE_LIMITS flag as well, need_freq_update will never be set to false if that flag is set for the driver. We also don't need to consider the need_freq_update flag in sugov_update_single() anymore to handle the special case of busy CPU, as we won't abort a frequency update anymore. Reported-by: zhuguangqing <zhuguangqing@xiaomi.com> Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> [ rjw: Rearrange code to avoid a branch ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-10-29cpufreq: schedutil: Always call driver if CPUFREQ_NEED_UPDATE_LIMITS is setRafael J. Wysocki
Because sugov_update_next_freq() may skip a frequency update even if the need_freq_update flag has been set for the policy at hand, policy limits updates may not take effect as expected. For example, if the intel_pstate driver operates in the passive mode with HWP enabled, it needs to update the HWP min and max limits when the policy min and max limits change, respectively, but that may not happen if the target frequency does not change along with the limit at hand. In particular, if the policy min is changed first, causing the target frequency to be adjusted to it, and the policy max limit is changed later to the same value, the HWP max limit will not be updated to follow it as expected, because the target frequency is still equal to the policy min limit and it will not change until that limit is updated. To address this issue, modify get_next_freq() to let the driver callback run if the CPUFREQ_NEED_UPDATE_LIMITS cpufreq driver flag is set regardless of whether or not the new frequency to set is equal to the previous one. Fixes: f6ebbcf08f37 ("cpufreq: intel_pstate: Implement passive mode with HWP enabled") Reported-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Cc: 5.9+ <stable@vger.kernel.org> # 5.9+: 1c534352f47f cpufreq: Introduce CPUFREQ_NEED_UPDATE_LIMITS ... Cc: 5.9+ <stable@vger.kernel.org> # 5.9+: a62f68f5ca53 cpufreq: Introduce cpufreq_driver_test_flags() Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-10-29sched/fair: Check for idle core in wake_affineJulia Lawall
In the case of a thread wakeup, wake_affine determines whether a core will be chosen for the thread on the socket where the thread ran previously or on the socket of the waker. This is done primarily by comparing the load of the core where th thread ran previously (prev) and the load of the waker (this). commit 11f10e5420f6 ("sched/fair: Use load instead of runnable load in wakeup path") changed the load computation from the runnable load to the load average, where the latter includes the load of threads that have already blocked on the core. When a short-running daemon processes happens to run on prev, this change raised the situation that prev could appear to have a greater load than this, even when prev is actually idle. When prev and this are on the same socket, the idle prev is detected later, in select_idle_sibling. But if that does not hold, prev is completely ignored, causing the waking thread to move to the socket of the waker. In the case of N mostly active threads on N cores, this triggers other migrations and hurts performance. In contrast, before commit 11f10e5420f6, the load on an idle core was 0, and in the case of a non-idle waker core, the effect of wake_affine was to select prev as the target for searching for a core for the waking thread. To avoid unnecessary migrations, extend wake_affine_idle to check whether the core where the thread previously ran is currently idle, and if so simply return that core as the target. [1] commit 11f10e5420f6ce ("sched/fair: Use load instead of runnable load in wakeup path") This particularly has an impact when using the ondemand power manager, where kworkers run every 0.004 seconds on all cores, increasing the likelihood that an idle core will be considered to have a load. The following numbers were obtained with the benchmarking tool hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel benchmarks (https://www.nas.nasa.gov/publications/npb.html). The tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @ 2.10GHz. Active (intel_pstate) and passive (intel_cpufreq) power management were used. Times are in seconds. All experiments use all 160 hardware threads. v5.9/intel-pstate v5.9+patch/intel-pstate bt.C.c 24.725724+-0.962340 23.349608+-1.607214 lu.C.x 29.105952+-4.804203 25.249052+-5.561617 sp.C.x 31.220696+-1.831335 30.227760+-2.429792 ua.C.x 26.606118+-1.767384 25.778367+-1.263850 v5.9/ondemand v5.9+patch/ondemand bt.C.c 25.330360+-1.028316 23.544036+-1.020189 lu.C.x 35.872659+-4.872090 23.719295+-3.883848 sp.C.x 32.141310+-2.289541 29.125363+-0.872300 ua.C.x 29.024597+-1.667049 25.728888+-1.539772 On the smaller data sets (A and B) and on the other NAS benchmarks there is no impact on performance. This also has a major impact on the splash2x.volrend benchmark of the parsec benchmark suite that goes from 1m25 without this patch to 0m45, in active (intel_pstate) mode. Fixes: 11f10e5420f6 ("sched/fair: Use load instead of runnable load in wakeup path") Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lkml.kernel.org/r/1603372550-14680-1-git-send-email-Julia.Lawall@inria.fr
2020-10-29sched: Remove relyance on STRUCT_ALIGNMENTPeter Zijlstra
Florian reported that all of kernel/sched/ is rebuild when CONFIG_BLK_DEV_INITRD is changed, which, while not a bug is unexpected. This is due to us including vmlinux.lds.h. Jakub explained that the problem is that we put the alignment requirement on the type instead of on a variable. Type alignment is a minimum, the compiler is free to pick any larger alignment for a specific instance of the type (eg. the variable). So force the type alignment on all individual variable definitions and remove the undesired dependency on vmlinux.lds.h. Fixes: 85c2ce9104eb ("sched, vmlinux.lds: Increase STRUCT_ALIGNMENT to 64 bytes for GCC-4.9") Reported-by: Florian Fainelli <f.fainelli@gmail.com> Suggested-by: Jakub Jelinek <jakub@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-10-29sched: Reenable interrupts in do_sched_yield()Thomas Gleixner
do_sched_yield() invokes schedule() with interrupts disabled which is not allowed. This goes back to the pre git era to commit a6efb709806c ("[PATCH] irqlock patch 2.5.27-H6") in the history tree. Reenable interrupts and remove the misleading comment which "explains" it. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/87r1pt7y5c.fsf@nanos.tec.linutronix.de
2020-10-29sched: membarrier: document memory ordering scenariosMathieu Desnoyers
Document membarrier ordering scenarios in membarrier.c. Thanks to Alan Stern for refreshing my memory. Now that I have those in mind, it seems appropriate to serialize them to comments for posterity. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201020134715.13909-4-mathieu.desnoyers@efficios.com
2020-10-29sched: membarrier: cover kthread_use_mm (v4)Mathieu Desnoyers
Add comments and memory barrier to kthread_use_mm and kthread_unuse_mm to allow the effect of membarrier(2) to apply to kthreads accessing user-space memory as well. Given that no prior kthread use this guarantee and that it only affects kthreads, adding this guarantee does not affect user-space ABI. Refine the check in membarrier_global_expedited to exclude runqueues running the idle thread rather than all kthreads from the IPI cpumask. Now that membarrier_global_expedited can IPI kthreads, the scheduler also needs to update the runqueue's membarrier_state when entering lazy TLB state. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201020134715.13909-3-mathieu.desnoyers@efficios.com
2020-10-29sched: fix exit_mm vs membarrier (v4)Mathieu Desnoyers
exit_mm should issue memory barriers after user-space memory accesses, before clearing current->mm, to order user-space memory accesses performed prior to exit_mm before clearing tsk->mm, which has the effect of skipping the membarrier private expedited IPIs. exit_mm should also update the runqueue's membarrier_state so membarrier global expedited IPIs are not sent when they are not needed. The membarrier system call can be issued concurrently with do_exit if we have thread groups created with CLONE_VM but not CLONE_THREAD. Here is the scenario I have in mind: Two thread groups are created, A and B. Thread group B is created by issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD. Let's assume we have a single thread within each thread group (Thread A and Thread B). The AFAIU we can have: Userspace variables: int x = 0, y = 0; CPU 0 CPU 1 Thread A Thread B (in thread group A) (in thread group B) x = 1 barrier() y = 1 exit() exit_mm() current->mm = NULL; r1 = load y membarrier() skips CPU 0 (no IPI) because its current mm is NULL r2 = load x BUG_ON(r1 == 1 && r2 == 0) Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201020134715.13909-2-mathieu.desnoyers@efficios.com
2020-10-29sched/fair: Exclude the current CPU from find_new_ilb()Peter Zijlstra
It is possible for find_new_ilb() to select the current CPU, however, this only happens from newidle balancing, in which case need_resched() will be true, and consequently nohz_csd_func() will not trigger the softirq. Exclude the current CPU from becoming an ILB target. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-10-29sched/cpupri: Add CPUPRI_HIGHERPeter Zijlstra
Add CPUPRI_HIGHER above the RT99 priority to denote the CPU is in use by higher priority tasks (specifically deadline). XXX: we should probably drive PUSH-PULL from cpupri, that would automagically result in an RT-PUSH when DL sets cpupri to CPUPRI_HIGHER. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2020-10-29sched/cpupri: Remap CPUPRI_NORMAL to MAX_RT_PRIO-1Peter Zijlstra
This makes the mapping continuous and frees up 100 for other usage. Prev mapping: p->rt_priority p->prio newpri cpupri -1 -1 (CPUPRI_INVALID) 100 0 (CPUPRI_NORMAL) 1 98 98 1 ... 49 50 50 49 50 49 49 50 ... 99 0 0 99 New mapping: p->rt_priority p->prio newpri cpupri -1 -1 (CPUPRI_INVALID) 99 0 (CPUPRI_NORMAL) 1 98 98 1 ... 49 50 50 49 50 49 49 50 ... 99 0 0 99 Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2020-10-29sched/cpupri: Remove pri_to_cpu[1]Dietmar Eggemann
pri_to_cpu[1] isn't used since cpupri_set(..., newpri) is never called with newpri = 99. The valid RT priorities RT1..RT99 (p->rt_priority = [1..99]) map into cpupri (idx of pri_to_cpu[]) = [2..100] Current mapping: p->rt_priority p->prio newpri cpupri -1 -1 (CPUPRI_INVALID) 100 0 (CPUPRI_NORMAL) 1 98 98 2 ... 49 50 50 50 50 49 49 51 ... 99 0 0 100 So cpupri = 1 isn't used. Reduce the size of pri_to_cpu[] by 1 and adapt the cpupri implementation accordingly. This will save a useless for loop with an atomic_read in cpupri_find_fitness() calling __cpupri_find(). New mapping: p->rt_priority p->prio newpri cpupri -1 -1 (CPUPRI_INVALID) 100 0 (CPUPRI_NORMAL) 1 98 98 1 ... 49 50 50 49 50 49 49 50 ... 99 0 0 99 Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200922083934.19275-3-dietmar.eggemann@arm.com
2020-10-29sched/cpupri: Remove pri_to_cpu[CPUPRI_IDLE]Dietmar Eggemann
pri_to_cpu[CPUPRI_IDLE=0] isn't used since cpupri_set(..., newpri) is never called with newpri = MAX_PRIO (140). Current mapping: p->rt_priority p->prio newpri cpupri -1 -1 (CPUPRI_INVALID) 140 0 (CPUPRI_IDLE) 100 1 (CPUPRI_NORMAL) 1 98 98 3 ... 49 50 50 51 50 49 49 52 ... 99 0 0 101 Even when cpupri was introduced with commit 6e0534f27819 ("sched: use a 2-d bitmap for searching lowest-pri CPU") in v2.6.27, only (1) CPUPRI_INVALID (-1), (2) MAX_RT_PRIO (100), (3) an RT prio (RT1..RT99) were used as newprio in cpupri_set(..., newpri) -> convert_prio(newpri). MAX_RT_PRIO is used only in dec_rt_tasks() -> dec_rt_prio() -> dec_rt_prio_smp() -> cpupri_set() in case of !rt_rq->rt_nr_running. I.e. it stands for a non-rt task, including the IDLE task. Commit 57785df5ac53 ("sched: Fix task priority bug") removed code in v2.6.33 which did set the priority of the IDLE task to MAX_PRIO. Although this happened after the introduction of cpupri, it didn't have an effect on the values used for cpupri_set(..., newpri). Remove CPUPRI_IDLE and adapt the cpupri implementation accordingly. This will save a useless for loop with an atomic_read in cpupri_find_fitness() calling __cpupri_find(). New mapping: p->rt_priority p->prio newpri cpupri -1 -1 (CPUPRI_INVALID) 100 0 (CPUPRI_NORMAL) 1 98 98 2 ... 49 50 50 50 50 49 49 51 ... 99 0 0 100 Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200922083934.19275-2-dietmar.eggemann@arm.com
2020-10-29sched/deadline: Fix sched_dl_global_validate()Peng Liu
When change sched_rt_{runtime, period}_us, we validate that the new settings should at least accommodate the currently allocated -dl bandwidth: sched_rt_handler() --> sched_dl_bandwidth_validate() { new_bw = global_rt_runtime()/global_rt_period(); for_each_possible_cpu(cpu) { dl_b = dl_bw_of(cpu); if (new_bw < dl_b->total_bw) <------- ret = -EBUSY; } } But under CONFIG_SMP, dl_bw is per root domain , but not per CPU, dl_b->total_bw is the allocated bandwidth of the whole root domain. Instead, we should compare dl_b->total_bw against "cpus*new_bw", where 'cpus' is the number of CPUs of the root domain. Also, below annotation(in kernel/sched/sched.h) implied implementation only appeared in SCHED_DEADLINE v2[1], then deadline scheduler kept evolving till got merged(v9), but the annotation remains unchanged, meaningless and misleading, update it. * With respect to SMP, the bandwidth is given on a per-CPU basis, * meaning that: * - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU; * - dl_total_bw array contains, in the i-eth element, the currently * allocated bandwidth on the i-eth CPU. [1]: https://lore.kernel.org/lkml/1267385230.13676.101.camel@Palantir/ Fixes: 332ac17ef5bf ("sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks") Signed-off-by: Peng Liu <iwtbavbm@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lkml.kernel.org/r/db6bbda316048cda7a1bbc9571defde193a8d67e.1602171061.git.iwtbavbm@gmail.com
2020-10-29sched/deadline: Optimize sched_dl_global_validate()Peng Liu
Under CONFIG_SMP, dl_bw is per root domain, but not per CPU. When checking or updating dl_bw, currently iterating every CPU is overdoing, just need iterate each root domain once. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Peng Liu <iwtbavbm@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lkml.kernel.org/r/78d21ee792cc48ff79e8cd62a5f26208463684d6.1602171061.git.iwtbavbm@gmail.com
2020-10-29sched/fair: Improve the accuracy of sched_stat_wait statisticsjun qian
When the sched_schedstat changes from 0 to 1, some sched se maybe already in the runqueue, the se->statistics.wait_start will be 0. So it will let the (rq_of(cfs_rq)) - se->statistics.wait_start) wrong. We need to avoid this scenario. Signed-off-by: jun qian <qianjun.kernel@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Yafang Shao <laoar.shao@gmail.com> Link: https://lkml.kernel.org/r/20201015064846.19809-1-qianjun.kernel@gmail.com
2020-10-25treewide: Convert macro and uses of __section(foo) to __section("foo")Joe Perches
Use a more generic form for __section that requires quotes to avoid complications with clang and gcc differences. Remove the quote operator # from compiler_attributes.h __section macro. Convert all unquoted __section(foo) uses to quoted __section("foo"). Also convert __attribute__((section("foo"))) uses to __section("foo") even if the __attribute__ has multiple list entry forms. Conversion done using the script at: https://lore.kernel.org/lkml/75393e5ddc272dc7403de74d645e6c6e0f4e70eb.camel@perches.com/2-convert_section.pl Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: Nick Desaulniers <ndesaulniers@gooogle.com> Reviewed-by: Miguel Ojeda <ojeda@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-25Merge tag 'sched-urgent-2020-10-25' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Thomas Gleixner: "Two scheduler fixes: - A trivial build fix for sched_feat() to compile correctly with CONFIG_JUMP_LABEL=n - Replace a zero lenght array with a flexible array" * tag 'sched-urgent-2020-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/features: Fix !CONFIG_JUMP_LABEL case sched: Replace zero-length array with flexible-array
2020-10-23Merge tag 'pm-5.10-rc1-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more power management updates from Rafael Wysocki: "First of all, the adaptive voltage scaling (AVS) drivers go to new platform-specific locations as planned (this part was reported to have merge conflicts against the new arm-soc updates in linux-next). In addition to that, there are some fixes (intel_idle, intel_pstate, RAPL, acpi_cpufreq), the addition of on/off notifiers and idle state accounting support to the generic power domains (genpd) code and some janitorial changes all over. Specifics: - Move the AVS drivers to new platform-specific locations and get rid of the drivers/power/avs directory (Ulf Hansson). - Add on/off notifiers and idle state accounting support to the generic power domains (genpd) framework (Ulf Hansson, Lina Iyer). - Ulf will maintain the PM domain part of cpuidle-psci (Ulf Hansson). - Make intel_idle disregard ACPI _CST if it cannot use the data returned by that method (Mel Gorman). - Modify intel_pstate to avoid leaving useless sysfs directory structure behind if it cannot be registered (Chen Yu). - Fix domain detection in the RAPL power capping driver and prevent it from failing to enumerate the Psys RAPL domain (Zhang Rui). - Allow acpi-cpufreq to use ACPI _PSD information with Family 19 and later AMD chips (Wei Huang). - Update the driver assumptions comment in intel_idle and fix a kerneldoc comment in the runtime PM framework (Alexander Monakov, Bean Huo). - Avoid unnecessary resets of the cached frequency in the schedutil cpufreq governor to reduce overhead (Wei Wang). - Clean up the cpufreq core a bit (Viresh Kumar). - Make assorted minor janitorial changes (Daniel Lezcano, Geert Uytterhoeven, Hubert Jasudowicz, Tom Rix). - Clean up and optimize the cpupower utility somewhat (Colin Ian King, Martin Kaistra)" * tag 'pm-5.10-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits) PM: sleep: remove unreachable break PM: AVS: Drop the avs directory and the corresponding Kconfig PM: AVS: qcom-cpr: Move the driver to the qcom specific drivers PM: runtime: Fix typo in pm_runtime_set_active() helper comment PM: domains: Fix build error for genpd notifiers powercap: Fix typo in Kconfig "Plance" -> "Plane" cpufreq: schedutil: restore cached freq when next_f is not changed acpi-cpufreq: Honor _PSD table setting on new AMD CPUs PM: AVS: smartreflex Move driver to soc specific drivers PM: AVS: rockchip-io: Move the driver to the rockchip specific drivers PM: domains: enable domain idle state accounting PM: domains: Add curly braces to delimit comment + statement block PM: domains: Add support for PM domain on/off notifiers for genpd powercap/intel_rapl: enumerate Psys RAPL domain together with package RAPL domain powercap/intel_rapl: Fix domain detection intel_idle: Ignore _CST if control cannot be taken from the platform cpuidle: Remove pointless stub intel_idle: mention assumption that WBINVD is not needed MAINTAINERS: Add section for cpuidle-psci PM domain cpufreq: intel_pstate: Delete intel_pstate sysfs if failed to register the driver ...
2020-10-19cpufreq: schedutil: restore cached freq when next_f is not changedWei Wang
We have the raw cached freq to reduce the chance in calling cpufreq driver where it could be costly in some arch/SoC. Currently, the raw cached freq is reset in sugov_update_single() when it avoids frequency reduction (which is not desirable sometimes), but it is better to restore the previous value of it in that case, because it may not change in the next cycle and it is not necessary to change the CPU frequency then. Adapted from https://android-review.googlesource.com/1352810/ Signed-off-by: Wei Wang <wvw@google.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> [ rjw: Subject edit and changelog rewrite ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-10-17task_work: cleanup notification modesJens Axboe
A previous commit changed the notification mode from true/false to an int, allowing notify-no, notify-yes, or signal-notify. This was backwards compatible in the sense that any existing true/false user would translate to either 0 (on notification sent) or 1, the latter which mapped to TWA_RESUME. TWA_SIGNAL was assigned a value of 2. Clean this up properly, and define a proper enum for the notification mode. Now we have: - TWA_NONE. This is 0, same as before the original change, meaning no notification requested. - TWA_RESUME. This is 1, same as before the original change, meaning that we use TIF_NOTIFY_RESUME. - TWA_SIGNAL. This uses TIF_SIGPENDING/JOBCTL_TASK_WORK for the notification. Clean up all the callers, switching their 0/1/false/true to using the appropriate TWA_* mode for notifications. Fixes: e91b48162332 ("task_work: teach task_work_add() to do signal_wake_up()") Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-14sched/features: Fix !CONFIG_JUMP_LABEL caseJuri Lelli
Commit: 765cc3a4b224e ("sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds") made sched features static for !CONFIG_SCHED_DEBUG configurations, but overlooked the CONFIG_SCHED_DEBUG=y and !CONFIG_JUMP_LABEL cases. For the latter echoing changes to /sys/kernel/debug/sched_features has the nasty effect of effectively changing what sched_features reports, but without actually changing the scheduler behaviour (since different translation units get different sysctl_sched_features). Fix CONFIG_SCHED_DEBUG=y and !CONFIG_JUMP_LABEL configurations by properly restructuring ifdefs. Fixes: 765cc3a4b224e ("sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds") Co-developed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Patrick Bellasi <patrick.bellasi@matbug.net> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lore.kernel.org/r/20201013053114.160628-1-juri.lelli@redhat.com
2020-10-14sched: Replace zero-length array with flexible-arrayzhuguangqing
In the following commit: 04f5c362ec6d: ("sched/fair: Replace zero-length array with flexible-array") a zero-length array cpumask[0] has been replaced with cpumask[]. But there is still a cpumask[0] in 'struct sched_group_capacity' which was missed. The point of using [] instead of [0] is that with [] the compiler will generate a build warning if it isn't the last member of a struct. [ mingo: Rewrote the changelog. ] Signed-off-by: zhuguangqing <zhuguangqing@xiaomi.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20201014140220.11384-1-zhuguangqing83@gmail.com
2020-10-14Merge tag 'pm-5.10-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "These rework the collection of cpufreq statistics to allow it to take place if fast frequency switching is enabled in the governor, rework the frequency invariance handling in the cpufreq core and drivers, add new hardware support to a couple of cpufreq drivers, fix a number of assorted issues and clean up the code all over. Specifics: - Rework cpufreq statistics collection to allow it to take place when fast frequency switching is enabled in the governor (Viresh Kumar). - Make the cpufreq core set the frequency scale on behalf of the driver and update several cpufreq drivers accordingly (Ionela Voinescu, Valentin Schneider). - Add new hardware support to the STI and qcom cpufreq drivers and improve them (Alain Volmat, Manivannan Sadhasivam). - Fix multiple assorted issues in cpufreq drivers (Jon Hunter, Krzysztof Kozlowski, Matthias Kaehlcke, Pali Rohár, Stephan Gerhold, Viresh Kumar). - Fix several assorted issues in the operating performance points (OPP) framework (Stephan Gerhold, Viresh Kumar). - Allow devfreq drivers to fetch devfreq instances by DT enumeration instead of using explicit phandles and modify the devfreq core code to support driver-specific devfreq DT bindings (Leonard Crestez, Chanwoo Choi). - Improve initial hardware resetting in the tegra30 devfreq driver and clean up the tegra cpuidle driver (Dmitry Osipenko). - Update the cpuidle core to collect state entry rejection statistics and expose them via sysfs (Lina Iyer). - Improve the ACPI _CST code handling diagnostics (Chen Yu). - Update the PSCI cpuidle driver to allow the PM domain initialization to occur in the OSI mode as well as in the PC mode (Ulf Hansson). - Rework the generic power domains (genpd) core code to allow domain power off transition to be aborted in the absence of the "power off" domain callback (Ulf Hansson). - Fix two suspend-to-idle issues in the ACPI EC driver (Rafael Wysocki). - Fix the handling of timer_expires in the PM-runtime framework on 32-bit systems and the handling of device links in it (Grygorii Strashko, Xiang Chen). - Add IO requests batching support to the hibernate image saving and reading code and drop a bogus get_gendisk() from there (Xiaoyi Chen, Christoph Hellwig). - Allow PCIe ports to be put into the D3cold power state if they are power-manageable via ACPI (Lukas Wunner). - Add missing header file include to a power capping driver (Pujin Shi). - Clean up the qcom-cpr AVS driver a bit (Liu Shixin). - Kevin Hilman steps down as designated reviwer of adaptive voltage scaling (AVS) drivers (Kevin Hilman)" * tag 'pm-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (65 commits) cpufreq: stats: Fix string format specifier mismatch arm: disable frequency invariance for CONFIG_BL_SWITCHER cpufreq,arm,arm64: restructure definitions of arch_set_freq_scale() cpufreq: stats: Add memory barrier to store_reset() cpufreq: schedutil: Simplify sugov_fast_switch() ACPI: EC: PM: Drop ec_no_wakeup check from acpi_ec_dispatch_gpe() ACPI: EC: PM: Flush EC work unconditionally after wakeup PCI/ACPI: Whitelist hotplug ports for D3 if power managed by ACPI PM: hibernate: remove the bogus call to get_gendisk() in software_resume() cpufreq: Move traces and update to policy->cur to cpufreq core cpufreq: stats: Enable stats for fast-switch as well cpufreq: stats: Mark few conditionals with unlikely() cpufreq: stats: Remove locking cpufreq: stats: Defer stats update to cpufreq_stats_record_transition() PM: domains: Allow to abort power off when no ->power_off() callback PM: domains: Rename power state enums for genpd PM / devfreq: tegra30: Improve initial hardware resetting PM / devfreq: event: Change prototype of devfreq_event_get_edev_by_phandle function PM / devfreq: Change prototype of devfreq_get_devfreq_by_phandle function PM / devfreq: Add devfreq_get_devfreq_by_node function ...
2020-10-12Merge tag 'sched-core-2020-10-12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - reorganize & clean up the SD* flags definitions and add a bunch of sanity checks. These new checks caught quite a few bugs or at least inconsistencies, resulting in another set of patches. - rseq updates, add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ - add a new tracepoint to improve CPU capacity tracking - improve overloaded SMP system load-balancing behavior - tweak SMT balancing - energy-aware scheduling updates - NUMA balancing improvements - deadline scheduler fixes and improvements - CPU isolation fixes - misc cleanups, simplifications and smaller optimizations * tag 'sched-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits) sched/deadline: Unthrottle PI boosted threads while enqueuing sched/debug: Add new tracepoint to track cpu_capacity sched/fair: Tweak pick_next_entity() rseq/selftests: Test MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ rseq/selftests,x86_64: Add rseq_offset_deref_addv() rseq/membarrier: Add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ sched/fair: Use dst group while checking imbalance for NUMA balancer sched/fair: Reduce busy load balance interval sched/fair: Minimize concurrent LBs between domain level sched/fair: Reduce minimal imbalance threshold sched/fair: Relax constraint on task's load during load balance sched/fair: Remove the force parameter of update_tg_load_avg() sched/fair: Fix wrong cpu selecting from isolated domain sched: Remove unused inline function uclamp_bucket_base_value() sched/rt: Disable RT_RUNTIME_SHARE by default sched/deadline: Fix stale throttling on de-/boosted tasks sched/numa: Use runnable_avg to classify node sched/topology: Move sd_flag_debug out of #ifdef CONFIG_SYSCTL MAINTAINERS: Add myself as SCHED_DEADLINE reviewer sched/topology: Move SD_DEGENERATE_GROUPS_MASK out of linux/sched/topology.h ...
2020-10-07cpufreq: schedutil: Simplify sugov_fast_switch()Rafael J. Wysocki
Drop a redundant local variable definition from sugov_fast_switch() and rearrange the code in there to avoid the redundant logical negation. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-10-05cpufreq: Move traces and update to policy->cur to cpufreq coreViresh Kumar
The cpufreq core handles the updates to policy->cur and recording of cpufreq trace events for all the governors except schedutil's fast switch case. Move that as well to cpufreq core for consistency and readability. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-10-03sched/deadline: Unthrottle PI boosted threads while enqueuingDaniel Bristot de Oliveira
stress-ng has a test (stress-ng --cyclic) that creates a set of threads under SCHED_DEADLINE with the following parameters: dl_runtime = 10000 (10 us) dl_deadline = 100000 (100 us) dl_period = 100000 (100 us) These parameters are very aggressive. When using a system without HRTICK set, these threads can easily execute longer than the dl_runtime because the throttling happens with 1/HZ resolution. During the main part of the test, the system works just fine because the workload does not try to run over the 10 us. The problem happens at the end of the test, on the exit() path. During exit(), the threads need to do some cleanups that require real-time mutex locks, mainly those related to memory management, resulting in this scenario: Note: locks are rt_mutexes... ------------------------------------------------------------------------ TASK A: TASK B: TASK C: activation activation activation lock(a): OK! lock(b): OK! <overrun runtime> lock(a) -> block (task A owns it) -> self notice/set throttled +--< -> arm replenished timer | switch-out | lock(b) | -> <C prio > B prio> | -> boost TASK B | unlock(a) switch-out | -> handle lock a to B | -> wakeup(B) | -> B is throttled: | -> do not enqueue | switch-out | | +---------------------> replenishment timer -> TASK B is boosted: -> do not enqueue ------------------------------------------------------------------------ BOOM: TASK B is runnable but !enqueued, holding TASK C: the system crashes with hung task C. This problem is avoided by removing the throttle state from the boosted thread while boosting it (by TASK A in the example above), allowing it to be queued and run boosted. The next replenishment will take care of the runtime overrun, pushing the deadline further away. See the "while (dl_se->runtime <= 0)" on replenish_dl_entity() for more information. Reported-by: Mark Simmons <msimmons@redhat.com> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Tested-by: Mark Simmons <msimmons@redhat.com> Link: https://lkml.kernel.org/r/5076e003450835ec74e6fa5917d02c4fa41687e6.1600170294.git.bristot@redhat.com
2020-10-03sched/debug: Add new tracepoint to track cpu_capacityVincent Donnefort
rq->cpu_capacity is a key element in several scheduler parts, such as EAS task placement and load balancing. Tracking this value enables testing and/or debugging by a toolkit. Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1598605249-72651-1-git-send-email-vincent.donnefort@arm.com
2020-10-03sched/fair: Tweak pick_next_entity()Peter Oskolkov
Currently, pick_next_entity(...) has the following structure (simplified): [...] if (last_buddy_ok()) result = last_buddy; if (next_buddy_ok()) result = next_buddy; [...] The intended behavior is to prefer next buddy over last buddy; the current code somewhat obfuscates this, and also wastes cycles checking the last buddy when eventually the next buddy is picked up. So this patch refactors two 'ifs' above into [...] if (next_buddy_ok()) result = next_buddy; else if (last_buddy_ok()) result = last_buddy; [...] Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guitttot@linaro.org> Link: https://lkml.kernel.org/r/20200930173532.1069092-1-posk@google.com