summaryrefslogtreecommitdiff
path: root/kernel/sched
AgeCommit message (Collapse)Author
2021-10-07sched,rcu: Rework try_invoke_on_locked_down_task()Peter Zijlstra
Give try_invoke_on_locked_down_task() a saner name and have it return an int so that the caller might distinguish between different reasons of failure. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Vasily Gorbik <gor@linux.ibm.com> Tested-by: Vasily Gorbik <gor@linux.ibm.com> # on s390 Link: https://lkml.kernel.org/r/20210929152428.649944917@infradead.org
2021-10-07sched: Improve try_invoke_on_locked_down_task()Peter Zijlstra
Clarify and tighten try_invoke_on_locked_down_task(). Basically the function calls @func under task_rq_lock(), except it avoids taking rq->lock when possible. This makes calling @func unconditional (the function will get renamed in a later patch to remove the try). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vasily Gorbik <gor@linux.ibm.com> Tested-by: Vasily Gorbik <gor@linux.ibm.com> # on s390 Link: https://lkml.kernel.org/r/20210929152428.589323576@infradead.org
2021-10-06sched: Fix DEBUG && !SCHEDSTATS warnPeter Zijlstra
When !SCHEDSTATS schedstat_enabled() is an unconditional 0 and the whole block doesn't exist, however GCC figures the scoped variable 'stats' is unused and complains about it. Upgrade the warning from -Wunused-variable to -Wunused-but-set-variable by writing it in two statements. This fixes the build because the new warning is in W=1. Given that whole if(0) {} thing, I don't feel motivated to change things overly much and quite strongly feel this is the compiler being daft. Fixes: cb3e971c435d ("sched: Make struct sched_statistics independent of fair sched class") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2021-10-05sched/fair: Removed useless update of p->recent_used_cpuVincent Guittot
Since commit 89aafd67f28c ("sched/fair: Use prev instead of new target as recent_used_cpu"), p->recent_used_cpu is unconditionnaly set with prev. Fixes: 89aafd67f28c ("sched/fair: Use prev instead of new target as recent_used_cpu") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lkml.kernel.org/r/20210928103544.27489-1-vincent.guittot@linaro.org
2021-10-05sched: Remove pointless preemption disable in sched_submit_work()Thomas Gleixner
Neither wq_worker_sleeping() nor io_wq_worker_sleeping() require to be invoked with preemption disabled: - The worker flag checks operations only need to be serialized against the worker thread itself. - The accounting and worker pool operations are serialized with locks. which means that disabling preemption has neither a reason nor a value. Remove it and update the stale comment. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Link: https://lkml.kernel.org/r/8735pnafj7.ffs@tglx
2021-10-05sched: Move kprobes cleanup out of finish_task_switch()Thomas Gleixner
Doing cleanups in the tail of schedule() is a latency punishment for the incoming task. The point of invoking kprobes_task_flush() for a dead task is that the instances are returned and cannot leak when __schedule() is kprobed. Move it into the delayed cleanup. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210928122411.537994026@linutronix.de
2021-10-05sched: Disable TTWU_QUEUE on RTThomas Gleixner
The queued remote wakeup mechanism has turned out to be suboptimal for RT enabled kernels. The maximum latencies go up by a factor of > 5x in certain scenarious. This is caused by either long wake lists or by a large number of TTWU IPIs which are processed back to back. Disable it for RT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210928122411.482262764@linutronix.de
2021-10-05sched: Limit the number of task migrations per batch on RTThomas Gleixner
Batched task migrations are a source for large latencies as they keep the scheduler from running while processing the migrations. Limit the batch size to 8 instead of 32 when running on a RT enabled kernel. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210928122411.425097596@linutronix.de
2021-10-05sched: Move mmdrop to RCU on RTThomas Gleixner
mmdrop() is invoked from finish_task_switch() by the incoming task to drop the mm which was handed over by the previous task. mmdrop() can be quite expensive which prevents an incoming real-time task from getting useful work done. Provide mmdrop_sched() which maps to mmdrop() on !RT kernels. On RT kernels it delagates the eventually required invocation of __mmdrop() to RCU. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210928122411.648582026@linutronix.de
2021-10-05sched: Make cookie functions staticShaokun Zhang
Make cookie functions static as these are no longer invoked directly by other code. No functional change intended. Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210922085735.52812-1-zhangshaokun@hisilicon.com
2021-10-05sched/fair: Consider SMT in ASYM_PACKING load balanceRicardo Neri
When deciding to pull tasks in ASYM_PACKING, it is necessary not only to check for the idle state of the destination CPU, dst_cpu, but also of its SMT siblings. If dst_cpu is idle but its SMT siblings are busy, performance suffers if it pulls tasks from a medium priority CPU that does not have SMT siblings. Implement asym_smt_can_pull_tasks() to inspect the state of the SMT siblings of both dst_cpu and the CPUs in the candidate busiest group. Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Len Brown <len.brown@intel.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210911011819.12184-7-ricardo.neri-calderon@linux.intel.com
2021-10-05sched/fair: Carve out logic to mark a group for asymmetric packingRicardo Neri
Create a separate function, sched_asym(). A subsequent changeset will introduce logic to deal with SMT in conjunction with asmymmetric packing. Such logic will need the statistics of the scheduling group provided as argument. Update them before calling sched_asym(). Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Len Brown <len.brown@intel.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210911011819.12184-6-ricardo.neri-calderon@linux.intel.com
2021-10-05sched/fair: Provide update_sg_lb_stats() with sched domain statisticsRicardo Neri
Before deciding to pull tasks when using asymmetric packing of tasks, on some architectures (e.g., x86) it is necessary to know not only the state of dst_cpu but also of its SMT siblings. The decision to classify a candidate busiest group as group_asym_packing is done in update_sg_lb_stats(). Give this function access to the scheduling domain statistics, which contains the statistics of the local group. Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Len Brown <len.brown@intel.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210911011819.12184-5-ricardo.neri-calderon@linux.intel.com
2021-10-05sched/fair: Optimize checking for group_asym_packingRicardo Neri
sched_asmy_prefer() always returns false when called on the local group. By checking local_group, we can avoid additional checks and invoking sched_asmy_prefer() when it is not needed. No functional changes are introduced. Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Len Brown <len.brown@intel.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210911011819.12184-4-ricardo.neri-calderon@linux.intel.com
2021-10-05sched/topology: Introduce sched_group::flagsRicardo Neri
There exist situations in which the load balance needs to know the properties of the CPUs in a scheduling group. When using asymmetric packing, for instance, the load balancer needs to know not only the state of dst_cpu but also of its SMT siblings, if any. Use the flags of the child scheduling domains to initialize scheduling group flags. This will reflect the properties of the CPUs in the group. A subsequent changeset will make use of these new flags. No functional changes are introduced. Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Len Brown <len.brown@intel.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210911011819.12184-3-ricardo.neri-calderon@linux.intel.com
2021-10-05sched: Provide Kconfig support for default dynamic preempt modeFrederic Weisbecker
Currently the boot defined preempt behaviour (aka dynamic preempt) selects full preemption by default when the "preempt=" boot parameter is omitted. However distros may rather want to default to either no preemption or voluntary preemption. To provide with this flexibility, make dynamic preemption a visible Kconfig option and adapt the preemption behaviour selected by the user to either static or dynamic preemption. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210914103134.11309-1-frederic@kernel.org
2021-10-05sched: Remove unused inline function __rq_clock_broken()YueHaibing
These is no caller in tree since commit 523e979d3164 ("sched/core: Use PELT for scale_rt_capacity()") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210914095244.52780-1-yuehaibing@huawei.com
2021-10-05sched/dl: Support schedstats for deadline sched classYafang Shao
After we make the struct sched_statistics and the helpers of it independent of fair sched class, we can easily use the schedstats facility for deadline sched class. The schedstat usage in DL sched class is similar with fair sched class, for example, fair deadline enqueue update_stats_enqueue_fair update_stats_enqueue_dl dequeue update_stats_dequeue_fair update_stats_dequeue_dl put_prev_task update_stats_wait_start update_stats_wait_start_dl set_next_task update_stats_wait_end update_stats_wait_end_dl The user can get the schedstats information in the same way in fair sched class. For example, fair deadline /proc/[pid]/sched /proc/[pid]/sched The output of a deadline task's schedstats as follows, $ cat /proc/69662/sched ... se.sum_exec_runtime : 3067.696449 se.nr_migrations : 0 sum_sleep_runtime : 720144.029661 sum_block_runtime : 0.547853 wait_start : 0.000000 sleep_start : 14131540.828955 block_start : 0.000000 sleep_max : 2999.974045 block_max : 0.283637 exec_max : 1.000269 slice_max : 0.000000 wait_max : 0.002217 wait_sum : 0.762179 wait_count : 733 iowait_sum : 0.547853 iowait_count : 3 nr_migrations_cold : 0 nr_failed_migrations_affine : 0 nr_failed_migrations_running : 0 nr_failed_migrations_hot : 0 nr_forced_migrations : 0 nr_wakeups : 246 nr_wakeups_sync : 2 nr_wakeups_migrate : 0 nr_wakeups_local : 244 nr_wakeups_remote : 2 nr_wakeups_affine : 0 nr_wakeups_affine_attempts : 0 nr_wakeups_passive : 0 nr_wakeups_idle : 0 ... The sched:sched_stat_{wait, sleep, iowait, blocked} tracepoints can be used to trace deadlline tasks as well. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210905143547.4668-9-laoar.shao@gmail.com
2021-10-05sched/dl: Support sched_stat_runtime tracepoint for deadline sched classYafang Shao
The runtime of a DL task has already been there, so we only need to add a tracepoint. One difference between fair task and DL task is that there is no vruntime in dl task. To reuse the sched_stat_runtime tracepoint, '0' is passed as vruntime for DL task. The output of this tracepoint for DL task as follows, top-36462 [047] d.h. 6083.452103: sched_stat_runtime: comm=top pid=36462 runtime=409898 [ns] vruntime=0 [ns] Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210905143547.4668-8-laoar.shao@gmail.com
2021-10-05sched/rt: Support schedstats for RT sched classYafang Shao
We want to measure the latency of RT tasks in our production environment with schedstats facility, but currently schedstats is only supported for fair sched class. This patch enable it for RT sched class as well. After we make the struct sched_statistics and the helpers of it independent of fair sched class, we can easily use the schedstats facility for RT sched class. The schedstat usage in RT sched class is similar with fair sched class, for example, fair RT enqueue update_stats_enqueue_fair update_stats_enqueue_rt dequeue update_stats_dequeue_fair update_stats_dequeue_rt put_prev_task update_stats_wait_start update_stats_wait_start_rt set_next_task update_stats_wait_end update_stats_wait_end_rt The user can get the schedstats information in the same way in fair sched class. For example, fair RT /proc/[pid]/sched /proc/[pid]/sched schedstats is not supported for RT group. The output of a RT task's schedstats as follows, $ cat /proc/10349/sched ... sum_sleep_runtime : 972.434535 sum_block_runtime : 960.433522 wait_start : 188510.871584 sleep_start : 0.000000 block_start : 0.000000 sleep_max : 12.001013 block_max : 952.660622 exec_max : 0.049629 slice_max : 0.000000 wait_max : 0.018538 wait_sum : 0.424340 wait_count : 49 iowait_sum : 956.495640 iowait_count : 24 nr_migrations_cold : 0 nr_failed_migrations_affine : 0 nr_failed_migrations_running : 0 nr_failed_migrations_hot : 0 nr_forced_migrations : 0 nr_wakeups : 49 nr_wakeups_sync : 0 nr_wakeups_migrate : 0 nr_wakeups_local : 49 nr_wakeups_remote : 0 nr_wakeups_affine : 0 nr_wakeups_affine_attempts : 0 nr_wakeups_passive : 0 nr_wakeups_idle : 0 ... The sched:sched_stat_{wait, sleep, iowait, blocked} tracepoints can be used to trace RT tasks as well. The output of these tracepoints for a RT tasks as follows, - runtime stress-10352 [004] d.h. 1035.382286: sched_stat_runtime: comm=stress pid=10352 runtime=995769 [ns] vruntime=0 [ns] [vruntime=0 means it is a RT task] - wait <idle>-0 [004] dN.. 1227.688544: sched_stat_wait: comm=stress pid=10352 delay=46849882 [ns] - blocked kworker/4:1-465 [004] dN.. 1585.676371: sched_stat_blocked: comm=stress pid=17194 delay=189963 [ns] - iowait kworker/4:1-465 [004] dN.. 1585.675330: sched_stat_iowait: comm=stress pid=17189 delay=182848 [ns] - sleep sleep-18194 [023] dN.. 1780.891840: sched_stat_sleep: comm=sleep.sh pid=17767 delay=1001160770 [ns] sleep-18196 [023] dN.. 1781.893208: sched_stat_sleep: comm=sleep.sh pid=17767 delay=1001161970 [ns] sleep-18197 [023] dN.. 1782.894544: sched_stat_sleep: comm=sleep.sh pid=17767 delay=1001128840 [ns] [ In sleep.sh, it sleeps 1 sec each time. ] [lkp@intel.com: reported build failure in earlier version] Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210905143547.4668-7-laoar.shao@gmail.com
2021-10-05sched/rt: Support sched_stat_runtime tracepoint for RT sched classYafang Shao
The runtime of a RT task has already been there, so we only need to add a tracepoint. One difference between fair task and RT task is that there is no vruntime in RT task. To reuse the sched_stat_runtime tracepoint, '0' is passed as vruntime for RT task. The output of this tracepoint for RT task as follows, stress-9748 [039] d.h. 113.519352: sched_stat_runtime: comm=stress pid=9748 runtime=997573 [ns] vruntime=0 [ns] stress-9748 [039] d.h. 113.520352: sched_stat_runtime: comm=stress pid=9748 runtime=997627 [ns] vruntime=0 [ns] stress-9748 [039] d.h. 113.521352: sched_stat_runtime: comm=stress pid=9748 runtime=998203 [ns] vruntime=0 [ns] Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210905143547.4668-6-laoar.shao@gmail.com
2021-10-05sched: Introduce task block time in schedstatsYafang Shao
Currently in schedstats we have sum_sleep_runtime and iowait_sum, but there's no metric to show how long the task is in D state. Once a task in D state, it means the task is blocked in the kernel, for example the task may be waiting for a mutex. The D state is more frequent than iowait, and it is more critital than S state. So it is worth to add a metric to measure it. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210905143547.4668-5-laoar.shao@gmail.com
2021-10-05sched: Make schedstats helpers independent of fair sched classYafang Shao
The original prototype of the schedstats helpers are update_stats_wait_*(struct cfs_rq *cfs_rq, struct sched_entity *se) The cfs_rq in these helpers is used to get the rq_clock, and the se is used to get the struct sched_statistics and the struct task_struct. In order to make these helpers available by all sched classes, we can pass the rq, sched_statistics and task_struct directly. Then the new helpers are update_stats_wait_*(struct rq *rq, struct task_struct *p, struct sched_statistics *stats) which are independent of fair sched class. To avoid vmlinux growing too large or introducing ovehead when !schedstat_enabled(), some new helpers after schedstat_enabled() are also introduced, Suggested by Mel. These helpers are in sched/stats.c, __update_stats_wait_*(struct rq *rq, struct task_struct *p, struct sched_statistics *stats) The size of vmlinux as follows, Before After Size of vmlinux 826308552 826304640 The size is a litte smaller as some functions are not inlined again after the change. I also compared the sched performance with 'perf bench sched pipe', suggested by Mel. The result as followsi (in usecs/op), Before After kernel.sched_schedstats=0 5.2~5.4 5.2~5.4 kernel.sched_schedstats=1 5.3~5.5 5.3~5.5 [These data is a little difference with the prev version, that is because my old test machine is destroyed so I have to use a new different test machine.] Almost no difference. No functional change. [lkp@intel.com: reported build failure in prev version] Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20210905143547.4668-4-laoar.shao@gmail.com
2021-10-05sched: Make struct sched_statistics independent of fair sched classYafang Shao
If we want to use the schedstats facility to trace other sched classes, we should make it independent of fair sched class. The struct sched_statistics is the schedular statistics of a task_struct or a task_group. So we can move it into struct task_struct and struct task_group to achieve the goal. After the patch, schestats are orgnized as follows, struct task_struct { ... struct sched_entity se; struct sched_rt_entity rt; struct sched_dl_entity dl; ... struct sched_statistics stats; ... }; Regarding the task group, schedstats is only supported for fair group sched, and a new struct sched_entity_stats is introduced, suggested by Peter - struct sched_entity_stats { struct sched_entity se; struct sched_statistics stats; } __no_randomize_layout; Then with the se in a task_group, we can easily get the stats. The sched_statistics members may be frequently modified when schedstats is enabled, in order to avoid impacting on random data which may in the same cacheline with them, the struct sched_statistics is defined as cacheline aligned. As this patch changes the core struct of scheduler, so I verified the performance it may impact on the scheduler with 'perf bench sched pipe', suggested by Mel. Below is the result, in which all the values are in usecs/op. Before After kernel.sched_schedstats=0 5.2~5.4 5.2~5.4 kernel.sched_schedstats=1 5.3~5.5 5.3~5.5 [These data is a little difference with the earlier version, that is because my old test machine is destroyed so I have to use a new different test machine.] Almost no impact on the sched performance. No functional change. [lkp@intel.com: reported build failure in earlier version] Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20210905143547.4668-3-laoar.shao@gmail.com
2021-10-05sched/fair: Use __schedstat_set() in set_next_entity()Yafang Shao
schedstat_enabled() has been already checked, so we can use __schedstat_set() directly. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20210905143547.4668-2-laoar.shao@gmail.com
2021-10-05sched/fair: Add cfs bandwidth burst statisticsHuaixin Chang
Two new statistics are introduced to show the internal of burst feature and explain why burst helps or not. nr_bursts: number of periods bandwidth burst occurs burst_time: cumulative wall-time (in nanoseconds) that any cpus has used above quota in respective periods Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com> Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com> Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20210830032215.16302-2-changhuaixin@linux.alibaba.com
2021-10-05sched: adjust sleeper credit for SCHED_IDLE entitiesJosh Don
Give reduced sleeper credit to SCHED_IDLE entities. As a result, woken SCHED_IDLE entities will take longer to preempt normal entities. The benefit of this change is to make it less likely that a newly woken SCHED_IDLE entity will preempt a short-running normal entity before it blocks. We still give a small sleeper credit to SCHED_IDLE entities, so that idle<->idle competition retains some fairness. Example: With HZ=1000, spawned four threads affined to one cpu, one of which was set to SCHED_IDLE. Without this patch, wakeup latency for the SCHED_IDLE thread was ~1-2ms, with the patch the wakeup latency was ~5ms. Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Jiang Biao <benbjiang@tencent.com> Link: https://lore.kernel.org/r/20210820010403.946838-5-joshdon@google.com
2021-10-05sched: reduce sched slice for SCHED_IDLE entitiesJosh Don
Use a small, non-scaled min granularity for SCHED_IDLE entities, when competing with normal entities. This reduces the latency of getting a normal entity back on cpu, at the expense of increased context switch frequency of SCHED_IDLE entities. The benefit of this change is to reduce the round-robin latency for normal entities when competing with a SCHED_IDLE entity. Example: on a machine with HZ=1000, spawned two threads, one of which is SCHED_IDLE, and affined to one cpu. Without this patch, the SCHED_IDLE thread runs for 4ms then waits for 1.4s. With this patch, it runs for 1ms and waits 340ms (as it round-robins with the other thread). Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210820010403.946838-4-joshdon@google.com
2021-10-05sched: Account number of SCHED_IDLE entities on each cfs_rqJosh Don
Adds cfs_rq->idle_nr_running, which accounts the number of idle entities directly enqueued on the cfs_rq. Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210820010403.946838-3-joshdon@google.com
2021-10-05sched/core: Simplify core-wide task selectionPeter Zijlstra
Tao suggested a two-pass task selection to avoid the retry loop. Not only does it avoid the retry loop, it results in *much* simpler code. This also fixes an issue spotted by Josh Don where, for SMT3+, we can forget to update max on the first pass and get to do an extra round. Suggested-by: Tao Zhou <tao.zhou@linux.dev> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Don <joshdon@google.com> Reviewed-by: Vineeth Pillai (Microsoft) <vineethrp@gmail.com> Link: https://lkml.kernel.org/r/YSS9+k1teA9oPEKl@hirez.programming.kicks-ass.net
2021-10-05sched: Switch wait_task_inactive to HRTIMER_MODE_REL_HARDSebastian Andrzej Siewior
With PREEMPT_RT enabled all hrtimers callbacks will be invoked in softirq mode unless they are explicitly marked as HRTIMER_MODE_HARD. During boot kthread_bind() is used for the creation of per-CPU threads and then hangs in wait_task_inactive() if the ksoftirqd is not yet up and running. The hang disappeared since commit 26c7295be0c5e ("kthread: Do not preempt current task if it is going to call schedule()") but enabling function trace on boot reliably leads to the freeze on boot behaviour again. The timer in wait_task_inactive() can not be directly used by a user interface to abuse it and create a mass wake up of several tasks at the same time leading to long sections with disabled interrupts. Therefore it is safe to make the timer HRTIMER_MODE_REL_HARD. Switch the timer to HRTIMER_MODE_REL_HARD. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210826170408.vm7rlj7odslshwch@linutronix.de
2021-10-05sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idleValentin Schneider
Consider a system with some NOHZ-idle CPUs, such that nohz.idle_cpus_mask = S nohz.next_balance = T When a new CPU k goes NOHZ idle (nohz_balance_enter_idle()), we end up with: nohz.idle_cpus_mask = S \U {k} nohz.next_balance = T Note that the nohz.next_balance hasn't changed - it won't be updated until a NOHZ balance is triggered. This is problematic if the newly NOHZ idle CPU has an earlier rq.next_balance than the other NOHZ idle CPUs, IOW if: cpu_rq(k).next_balance < nohz.next_balance In such scenarios, the existing nohz.next_balance will prevent any NOHZ balance from happening, which itself will prevent nohz.next_balance from being updated to this new cpu_rq(k).next_balance. Unnecessary load balance delays of over 12ms caused by this were observed on an arm64 RB5 board. Use the new nohz.needs_update flag to mark the presence of newly-idle CPUs that need their rq->next_balance to be collated into nohz.next_balance. Trigger a NOHZ_NEXT_KICK when the flag is set. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210823111700.2842997-3-valentin.schneider@arm.com
2021-10-05sched/fair: Add NOHZ balancer flag for nohz.next_balance updatesValentin Schneider
A following patch will trigger NOHZ idle balances as a means to update nohz.next_balance. Vincent noted that blocked load updates can have non-negligible overhead, which should be avoided if the intent is to only update nohz.next_balance. Add a new NOHZ balance kick flag, NOHZ_NEXT_KICK. Gate NOHZ blocked load update by the presence of NOHZ_STATS_KICK - currently all NOHZ balance kicks will have the NOHZ_STATS_KICK flag set, so no change in behaviour is expected. Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210823111700.2842997-2-valentin.schneider@arm.com
2021-10-01sched/fair: Null terminate buffer when updating tunable_scalingMel Gorman
This patch null-terminates the temporary buffer in sched_scaling_write() so kstrtouint() does not return failure and checks the value is valid. Before: $ cat /sys/kernel/debug/sched/tunable_scaling 1 $ echo 0 > /sys/kernel/debug/sched/tunable_scaling -bash: echo: write error: Invalid argument $ cat /sys/kernel/debug/sched/tunable_scaling 1 After: $ cat /sys/kernel/debug/sched/tunable_scaling 1 $ echo 0 > /sys/kernel/debug/sched/tunable_scaling $ cat /sys/kernel/debug/sched/tunable_scaling 0 $ echo 3 > /sys/kernel/debug/sched/tunable_scaling -bash: echo: write error: Invalid argument Fixes: 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210927114635.GH3959@techsingularity.net
2021-10-01sched/fair: Add ancestors of unthrottled undecayed cfs_rqMichal Koutný
Since commit a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") we add cfs_rqs with no runnable tasks but not fully decayed into the load (leaf) list. We may ignore adding some ancestors and therefore breaking tmp_alone_branch invariant. This broke LTP test cfs_bandwidth01 and it was partially fixed in commit fdaba61ef8a2 ("sched/fair: Ensure that the CFS parent is added after unthrottling"). I noticed the named test still fails even with the fix (but with low probability, 1 in ~1000 executions of the test). The reason is when bailing out of unthrottle_cfs_rq early, we may miss adding ancestors of the unthrottled cfs_rq, thus, not joining tmp_alone_branch properly. Fix this by adding ancestors if we notice the unthrottled cfs_rq was added to the load list. Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Odin Ugedal <odin@uged.al> Link: https://lore.kernel.org/r/20210917153037.11176-1-mkoutny@suse.com
2021-10-01sched: Make RCU nest depth distinct in __might_resched()Thomas Gleixner
For !RT kernels RCU nest depth in __might_resched() is always expected to be 0, but on RT kernels it can be non zero while the preempt count is expected to be always 0. Instead of playing magic games in interpreting the 'preempt_offset' argument, rename it to 'offsets' and use the lower 8 bits for the expected preempt count, allow to hand in the expected RCU nest depth in the upper bits and adopt the __might_resched() code and related checks and printks. The affected call sites are updated in subsequent steps. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210923165358.243232823@linutronix.de
2021-10-01sched: Make might_sleep() output less confusingThomas Gleixner
might_sleep() output is pretty informative, but can be confusing at times especially with PREEMPT_RCU when the check triggers due to a voluntary sleep inside a RCU read side critical section: BUG: sleeping function called from invalid context at kernel/test.c:110 in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 415, name: kworker/u112:52 Preemption disabled at: migrate_disable+0x33/0xa0 in_atomic() is 0, but it still tells that preemption was disabled at migrate_disable(), which is completely useless because preemption is not disabled. But the interesting information to decode the above, i.e. the RCU nesting depth, is not printed. That becomes even more confusing when might_sleep() is invoked from cond_resched_lock() within a RCU read side critical section. Here the expected preemption count is 1 and not 0. BUG: sleeping function called from invalid context at kernel/test.c:131 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 415, name: kworker/u112:52 Preemption disabled at: test_cond_lock+0xf3/0x1c0 So in_atomic() is set, which is expected as the caller holds a spinlock, but it's unclear why this is broken and the preempt disable IP is just pointing at the correct place, i.e. spin_lock(), which is obviously not helpful either. Make that more useful in general: - Print preempt_count() and the expected value and for the CONFIG_PREEMPT_RCU case: - Print the RCU read side critical section nesting depth - Print the preempt disable IP only when preempt count does not have the expected value. So the might_sleep() dump from a within a preemptible RCU read side critical section becomes: BUG: sleeping function called from invalid context at kernel/test.c:110 in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 415, name: kworker/u112:52 preempt_count: 0, expected: 0 RCU nest depth: 1, expected: 0 and the cond_resched_lock() case becomes: BUG: sleeping function called from invalid context at kernel/test.c:141 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 415, name: kworker/u112:52 preempt_count: 1, expected: 1 RCU nest depth: 1, expected: 0 which makes is pretty obvious what's going on. For all other cases the preempt disable IP is still printed as before: BUG: sleeping function called from invalid context at kernel/test.c: 156 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0 preempt_count: 1, expected: 0 RCU nest depth: 0, expected: 0 Preemption disabled at: [<ffffffff82b48326>] test_might_sleep+0xbe/0xf8 BUG: sleeping function called from invalid context at kernel/test.c: 163 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0 preempt_count: 1, expected: 0 RCU nest depth: 1, expected: 0 Preemption disabled at: [<ffffffff82b48326>] test_might_sleep+0x1e4/0x280 This also prepares to provide a better debugging output for RT enabled kernels and their spinlock substitutions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210923165358.181022656@linutronix.de
2021-10-01sched: Cleanup might_sleep() printksThomas Gleixner
Convert them to pr_*(). No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210923165358.117496067@linutronix.de
2021-10-01sched: Remove preempt_offset argument from __might_sleep()Thomas Gleixner
All callers hand in 0 and never will hand in anything else. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210923165358.054321586@linutronix.de
2021-10-01sched: Clean up the might_sleep() underscore zooThomas Gleixner
__might_sleep() vs. ___might_sleep() is hard to distinguish. Aside of that the three underscore variant is exposed to provide a checkpoint for rescheduling points which are distinct from blocking points. They are semantically a preemption point which means that scheduling is state preserving. A real blocking operation, e.g. mutex_lock(), wait*(), which cannot preserve a task state which is not equal to RUNNING. While technically blocking on a "sleeping" spinlock in RT enabled kernels falls into the voluntary scheduling category because it has to wait until the contended spin/rw lock becomes available, the RT lock substitution code can semantically be mapped to a voluntary preemption because the RT lock substitution code and the scheduler are providing mechanisms to preserve the task state and to take regular non-lock related wakeups into account. Rename ___might_sleep() to __might_resched() to make the distinction of these functions clear. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210923165357.928693482@linutronix.de
2021-09-30sched: move CPU field back into thread_info if THREAD_INFO_IN_TASK=yArd Biesheuvel
THREAD_INFO_IN_TASK moved the CPU field out of thread_info, but this causes some issues on architectures that define raw_smp_processor_id() in terms of this field, due to the fact that #include'ing linux/sched.h to get at struct task_struct is problematic in terms of circular dependencies. Given that thread_info and task_struct are the same data structure anyway when THREAD_INFO_IN_TASK=y, let's move it back so that having access to the type definition of struct thread_info is sufficient to reference the CPU number of the current task. Note that this requires THREAD_INFO_IN_TASK's definition of the task_thread_info() helper to be updated, as task_cpu() takes a pointer-to-const, whereas task_thread_info() (which is used to generate lvalues as well), needs a non-const pointer. So make it a macro instead. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au>
2021-09-29uapi/linux/prctl: provide macro definitions for the PR_SCHED_CORE type argumentEugene Syromiatnikov
Commit 7ac592aa35a684ff ("sched: prctl() core-scheduling interface") made use of enum pid_type in prctl's arg4; this type and the associated enumeration definitions are not exposed to userspace. Christian has suggested to provide additional macro definitions that convey the meaning of the type argument more in alignment with its actual usage, and this patch does exactly that. Link: https://lore.kernel.org/r/20210825170613.GA3884@asgard.redhat.com Suggested-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com> Complements: 7ac592aa35a684ff ("sched: prctl() core-scheduling interface") Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-09-12Merge tag 'sched_urgent_for_v5.15_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Make sure the idle timer expires in hardirq context, on PREEMPT_RT - Make sure the run-queue balance callback is invoked only on the outgoing CPU * tag 'sched_urgent_for_v5.15_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Prevent balance_push() on remote runqueues sched/idle: Make the idle timer expire in hard interrupt context
2021-09-09sched: Prevent balance_push() on remote runqueuesThomas Gleixner
sched_setscheduler() and rt_mutex_setprio() invoke the run-queue balance callback after changing priorities or the scheduling class of a task. The run-queue for which the callback is invoked can be local or remote. That's not a problem for the regular rq::push_work which is serialized with a busy flag in the run-queue struct, but for the balance_push() work which is only valid to be invoked on the outgoing CPU that's wrong. It not only triggers the debug warning, but also leaves the per CPU variable push_work unprotected, which can result in double enqueues on the stop machine list. Remove the warning and validate that the function is invoked on the outgoing CPU. Fixes: ae7927023243 ("sched: Optimize finish_lock_switch()") Reported-by: Sebastian Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/87zgt1hdw7.ffs@tglx
2021-09-09sched/idle: Make the idle timer expire in hard interrupt contextSebastian Andrzej Siewior
The intel powerclamp driver will setup a per-CPU worker with RT priority. The worker will then invoke play_idle() in which it remains in the idle poll loop until it is stopped by the timer it started earlier. That timer needs to expire in hard interrupt context on PREEMPT_RT. Otherwise the timer will expire in ksoftirqd as a SOFT timer but that task won't be scheduled on the CPU because its priority is lower than the priority of the worker which is in the idle loop. Always expire the idle timer in hard interrupt context. Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210906113034.jgfxrjdvxnjqgtmc@linutronix.de
2021-08-31Merge tag 'pm-5.15-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "These address some PCI device power management issues, add new hardware support to the RAPL power capping driver, add HWP guaranteed performance change notification support to the intel_pstate driver, replace deprecated CPU-hotplug functions in a few places, update CPU PM notifiers to use raw spinlocks, update the PM domains framework (new DT property support, Kconfig fix), do a couple of cleanups in code related to system sleep, and improve the energy model and the schedutil cpufreq governor. Specifics: - Address 3 PCI device power management issues (Rafael Wysocki). - Add Power Limit4 support for Alder Lake to the Intel RAPL power capping driver (Sumeet Pawnikar). - Add HWP guaranteed performance change notification support to the intel_pstate driver (Srinivas Pandruvada). - Replace deprecated CPU-hotplug functions in code related to power management (Sebastian Andrzej Siewior). - Update CPU PM notifiers to use raw spinlocks (Valentin Schneider). - Add support for 'required-opps' DT property to the generic power domains (genpd) framework and use this property for I2C on ARM64 sc7180 (Rajendra Nayak). - Fix Kconfig issue related to genpd (Geert Uytterhoeven). - Increase energy calculation precision in the Energy Model (Lukasz Luba). - Fix kobject deletion in the exit code of the schedutil cpufreq governor (Kevin Hao). - Unmark some functions as kernel-doc in the PM core to avoid false-positive documentation build warnings (Randy Dunlap). - Check RTC features instead of ops in suspend_test Alexandre Belloni)" * tag 'pm-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: domains: Fix domain attach for CONFIG_PM_OPP=n powercap: Add Power Limit4 support for Alder Lake SoC cpufreq: intel_pstate: Process HWP Guaranteed change notification thermal: intel: Allow processing of HWP interrupt notifier: Remove atomic_notifier_call_chain_robust() PM: cpu: Make notifier chain use a raw_spinlock_t PM: sleep: unmark 'state' functions as kernel-doc arm64: dts: sc7180: Add required-opps for i2c PM: domains: Add support for 'required-opps' to set default perf state opp: Don't print an error if required-opps is missing cpufreq: schedutil: Use kobject release() method to free sugov_tunables PM: EM: Increase energy calculation precision PM: sleep: check RTC features instead of ops in suspend_test PM: sleep: s2idle: Replace deprecated CPU-hotplug functions cpufreq: Replace deprecated CPU-hotplug functions powercap: intel_rapl: Replace deprecated CPU-hotplug functions PCI: PM: Enable PME if it can be signaled from D3cold PCI: PM: Avoid forcing PCI_D0 for wakeup reasons inconsistently PCI: Use pci_update_current_state() in pci_enable_device_flags()
2021-08-30Merge tag 'locking-core-2021-08-30' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking and atomics updates from Thomas Gleixner: "The regular pile: - A few improvements to the mutex code - Documentation updates for atomics to clarify the difference between cmpxchg() and try_cmpxchg() and to explain the forward progress expectations. - Simplification of the atomics fallback generator - The addition of arch_atomic_long*() variants and generic arch_*() bitops based on them. - Add the missing might_sleep() invocations to the down*() operations of semaphores. The PREEMPT_RT locking core: - Scheduler updates to support the state preserving mechanism for 'sleeping' spin- and rwlocks on RT. This mechanism is carefully preserving the state of the task when blocking on a 'sleeping' spin- or rwlock and takes regular wake-ups targeted at the same task into account. The preserved or updated (via a regular wakeup) state is restored when the lock has been acquired. - Restructuring of the rtmutex code so it can be utilized and extended for the RT specific lock variants. - Restructuring of the ww_mutex code to allow sharing of the ww_mutex specific functionality for rtmutex based ww_mutexes. - Header file disentangling to allow substitution of the regular lock implementations with the PREEMPT_RT variants without creating an unmaintainable #ifdef mess. - Shared base code for the PREEMPT_RT specific rw_semaphore and rwlock implementations. Contrary to the regular rw_semaphores and rwlocks the PREEMPT_RT implementation is writer unfair because it is infeasible to do priority inheritance on multiple readers. Experience over the years has shown that real-time workloads are not the typical workloads which are sensitive to writer starvation. The alternative solution would be to allow only a single reader which has been tried and discarded as it is a major bottleneck especially for mmap_sem. Aside of that many of the writer starvation critical usage sites have been converted to a writer side mutex/spinlock and RCU read side protections in the past decade so that the issue is less prominent than it used to be. - The actual rtmutex based lock substitutions for PREEMPT_RT enabled kernels which affect mutex, ww_mutex, rw_semaphore, spinlock_t and rwlock_t. The spin/rw_lock*() functions disable migration across the critical section to preserve the existing semantics vs per-CPU variables. - Rework of the futex REQUEUE_PI mechanism to handle the case of early wake-ups which interleave with a re-queue operation to prevent the situation that a task would be blocked on both the rtmutex associated to the outer futex and the rtmutex based hash bucket spinlock. While this situation cannot happen on !RT enabled kernels the changes make the underlying concurrency problems easier to understand in general. As a result the difference between !RT and RT kernels is reduced to the handling of waiting for the critical section. !RT kernels simply spin-wait as before and RT kernels utilize rcu_wait(). - The substitution of local_lock for PREEMPT_RT with a spinlock which protects the critical section while staying preemptible. The CPU locality is established by disabling migration. The underlying concepts of this code have been in use in PREEMPT_RT for way more than a decade. The code has been refactored several times over the years and this final incarnation has been optimized once again to be as non-intrusive as possible, i.e. the RT specific parts are mostly isolated. It has been extensively tested in the 5.14-rt patch series and it has been verified that !RT kernels are not affected by these changes" * tag 'locking-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (92 commits) locking/rtmutex: Return success on deadlock for ww_mutex waiters locking/rtmutex: Prevent spurious EDEADLK return caused by ww_mutexes locking/rtmutex: Dequeue waiter on ww_mutex deadlock locking/rtmutex: Dont dereference waiter lockless locking/semaphore: Add might_sleep() to down_*() family locking/ww_mutex: Initialize waiter.ww_ctx properly static_call: Update API documentation locking/local_lock: Add PREEMPT_RT support locking/spinlock/rt: Prepare for RT local_lock locking/rtmutex: Add adaptive spinwait mechanism locking/rtmutex: Implement equal priority lock stealing preempt: Adjust PREEMPT_LOCK_OFFSET for RT locking/rtmutex: Prevent lockdep false positive with PI futexes futex: Prevent requeue_pi() lock nesting issue on RT futex: Simplify handle_early_requeue_pi_wakeup() futex: Reorder sanity checks in futex_requeue() futex: Clarify comment in futex_requeue() futex: Restructure futex_requeue() futex: Correct the number of requeued waiters for PI futex: Remove bogus condition for requeue PI ...
2021-08-30Merge tag 'sched-core-2021-08-30' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - The biggest change in this cycle is scheduler support for asymmetric scheduling affinity, to support the execution of legacy 32-bit tasks on AArch32 systems that also have 64-bit-only CPUs. Architectures can fill in this functionality by defining their own task_cpu_possible_mask(p). When this is done, the scheduler will make sure the task will only be scheduled on CPUs that support it. (The actual arm64 specific changes are not part of this tree.) For other architectures there will be no change in functionality. - Add cgroup SCHED_IDLE support - Increase node-distance flexibility & delay determining it until a CPU is brought online. (This enables platforms where node distance isn't final until the CPU is only.) - Deadline scheduler enhancements & fixes - Misc fixes & cleanups. * tag 'sched-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits) eventfd: Make signal recursion protection a task bit sched/fair: Mark tg_is_idle() an inline in the !CONFIG_FAIR_GROUP_SCHED case sched: Introduce dl_task_check_affinity() to check proposed affinity sched: Allow task CPU affinity to be restricted on asymmetric systems sched: Split the guts of sched_setaffinity() into a helper function sched: Introduce task_struct::user_cpus_ptr to track requested affinity sched: Reject CPU affinity changes based on task_cpu_possible_mask() cpuset: Cleanup cpuset_cpus_allowed_fallback() use in select_fallback_rq() cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus() cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1 sched: Introduce task_cpu_possible_mask() to limit fallback rq selection sched: Cgroup SCHED_IDLE support sched/topology: Skip updating masks for non-online nodes sched: Replace deprecated CPU-hotplug functions. sched: Skip priority checks with SCHED_FLAG_KEEP_PARAMS sched: Fix UCLAMP_FLAG_IDLE setting sched/deadline: Fix missing clock update in migrate_task_rq_dl() sched/fair: Avoid a second scan of target in select_idle_cpu sched/fair: Use prev instead of new target as recent_used_cpu sched: Don't report SCHED_FLAG_SUGOV in sched_getattr() ...
2021-08-30Merge branch 'core-rcu.2021.08.28a' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU updates from Paul McKenney: "RCU changes for this cycle were: - Documentation updates - Miscellaneous fixes - Offloaded-callbacks updates - Updates to the nolibc library - Tasks-RCU updates - In-kernel torture-test updates - Torture-test scripting, perhaps most notably the pinning of torture-test guest OSes so as to force differences in memory latency. For example, in a two-socket system, a four-CPU guest OS will have one pair of its CPUs pinned to threads in a single core on one socket and the other pair pinned to threads in a single core on the other socket. This approach proved able to force race conditions that earlier testing missed. Some of these race conditions are still being tracked down" * 'core-rcu.2021.08.28a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (61 commits) torture: Replace deprecated CPU-hotplug functions. rcu: Replace deprecated CPU-hotplug functions rcu: Print human-readable message for schedule() in RCU reader rcu: Explain why rcu_all_qs() is a stub in preemptible TREE RCU rcu: Use per_cpu_ptr to get the pointer of per_cpu variable rcu: Remove useless "ret" update in rcu_gp_fqs_loop() rcu: Mark accesses in tree_stall.h rcu: Make rcu_gp_init() and rcu_gp_fqs_loop noinline to conserve stack rcu: Mark lockless ->qsmask read in rcu_check_boost_fail() srcutiny: Mark read-side data races rcu: Start timing stall repetitions after warning complete rcu: Do not disable GP stall detection in rcu_cpu_stall_reset() rcu/tree: Handle VM stoppage in stall detection rculist: Unify documentation about missing list_empty_rcu() rcu: Mark accesses to ->rcu_read_lock_nesting rcu: Weaken ->dynticks accesses and updates rcu: Remove special bit at the bottom of the ->dynticks counter rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock rcu: Fix to include first blocked task in stall warning torture: Make kvm-test-1-run-qemu.sh check for reboot loops ...
2021-08-26sched: Fix get_push_task() vs migrate_disable()Sebastian Andrzej Siewior
push_rt_task() attempts to move the currently running task away if the next runnable task has migration disabled and therefore is pinned on the current CPU. The current task is retrieved via get_push_task() which only checks for nr_cpus_allowed == 1, but does not check whether the task has migration disabled and therefore cannot be moved either. The consequence is a pointless invocation of the migration thread which correctly observes that the task cannot be moved. Return NULL if the task has migration disabled and cannot be moved to another CPU. Fixes: a7c81556ec4d3 ("sched: Fix migrate_disable() vs rt/dl balancing") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210826133738.yiotqbtdaxzjsnfj@linutronix.de