summaryrefslogtreecommitdiff
path: root/kernel/sched/fair.c
AgeCommit message (Collapse)Author
2021-11-11sched/fair: Prevent dead task groups from regaining cfs_rq'sMathias Krause
Kevin is reporting crashes which point to a use-after-free of a cfs_rq in update_blocked_averages(). Initial debugging revealed that we've live cfs_rq's (on_list=1) in an about to be kfree()'d task group in free_fair_sched_group(). However, it was unclear how that can happen. His kernel config happened to lead to a layout of struct sched_entity that put the 'my_q' member directly into the middle of the object which makes it incidentally overlap with SLUB's freelist pointer. That, in combination with SLAB_FREELIST_HARDENED's freelist pointer mangling, leads to a reliable access violation in form of a #GP which made the UAF fail fast. Michal seems to have run into the same issue[1]. He already correctly diagnosed that commit a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") is causing the preconditions for the UAF to happen by re-adding cfs_rq's also to task groups that have no more running tasks, i.e. also to dead ones. His analysis, however, misses the real root cause and it cannot be seen from the crash backtrace only, as the real offender is tg_unthrottle_up() getting called via sched_cfs_period_timer() via the timer interrupt at an inconvenient time. When unregister_fair_sched_group() unlinks all cfs_rq's from the dying task group, it doesn't protect itself from getting interrupted. If the timer interrupt triggers while we iterate over all CPUs or after unregister_fair_sched_group() has finished but prior to unlinking the task group, sched_cfs_period_timer() will execute and walk the list of task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the dying task group. These will later -- in free_fair_sched_group() -- be kfree()'ed while still being linked, leading to the fireworks Kevin and Michal are seeing. To fix this race, ensure the dying task group gets unlinked first. However, simply switching the order of unregistering and unlinking the task group isn't sufficient, as concurrent RCU walkers might still see it, as can be seen below: CPU1: CPU2: : timer IRQ: : do_sched_cfs_period_timer(): : : : distribute_cfs_runtime(): : rcu_read_lock(); : : : unthrottle_cfs_rq(): sched_offline_group(): : : walk_tg_tree_from(…,tg_unthrottle_up,…): list_del_rcu(&tg->list); : (1) : list_for_each_entry_rcu(child, &parent->children, siblings) : : (2) list_del_rcu(&tg->siblings); : : tg_unthrottle_up(): unregister_fair_sched_group(): struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)]; : : list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); : : : : if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running) (3) : list_add_leaf_cfs_rq(cfs_rq); : : : : : : : : : : (4) : rcu_read_unlock(); CPU 2 walks the task group list in parallel to sched_offline_group(), specifically, it'll read the soon to be unlinked task group entry at (1). Unlinking it on CPU 1 at (2) therefore won't prevent CPU 2 from still passing it on to tg_unthrottle_up(). CPU 1 now tries to unlink all cfs_rq's via list_del_leaf_cfs_rq() in unregister_fair_sched_group(). Meanwhile CPU 2 will re-add some of these at (3), which is the cause of the UAF later on. To prevent this additional race from happening, we need to wait until walk_tg_tree_from() has finished traversing the task groups, i.e. after the RCU read critical section ends in (4). Afterwards we're safe to call unregister_fair_sched_group(), as each new walk won't see the dying task group any more. On top of that, we need to wait yet another RCU grace period after unregister_fair_sched_group() to ensure print_cfs_stats(), which might run concurrently, always sees valid objects, i.e. not already free'd ones. This patch survives Michal's reproducer[2] for 8h+ now, which used to trigger within minutes before. [1] https://lore.kernel.org/lkml/20211011172236.11223-1-mkoutny@suse.com/ [2] https://lore.kernel.org/lkml/20211102160228.GA57072@blackbody.suse.cz/ Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") [peterz: shuffle code around a bit] Reported-by: Kevin Tanguy <kevin.tanguy@corp.ovh.com> Signed-off-by: Mathias Krause <minipli@grsecurity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2021-10-31sched/fair: Cleanup newidle_balanceVincent Guittot
update_next_balance() uses sd->last_balance which is not modified by load_balance() so we can merge the 2 calls in one place. No functional change Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20211019123537.17146-6-vincent.guittot@linaro.org
2021-10-31sched/fair: Remove sysctl_sched_migration_cost conditionVincent Guittot
With a default value of 500us, sysctl_sched_migration_cost is significanlty higher than the cost of load_balance. Remove the condition and rely on the sd->max_newidle_lb_cost to abort newidle_balance. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20211019123537.17146-5-vincent.guittot@linaro.org
2021-10-31sched/fair: Wait before decaying max_newidle_lb_costVincent Guittot
Decay max_newidle_lb_cost only when it has not been updated for a while and ensure to not decay a recently changed value. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20211019123537.17146-4-vincent.guittot@linaro.org
2021-10-31sched/fair: Skip update_blocked_averages if we are defering load balanceVincent Guittot
In newidle_balance(), the scheduler skips load balance to the new idle cpu when the 1st sd of this_rq is: this_rq->avg_idle < sd->max_newidle_lb_cost Doing a costly call to update_blocked_averages() will not be useful and simply adds overhead when this condition is true. Check the condition early in newidle_balance() to skip update_blocked_averages() when possible. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20211019123537.17146-3-vincent.guittot@linaro.org
2021-10-31sched/fair: Account update_blocked_averages in newidle_balance costVincent Guittot
The time spent to update the blocked load can be significant depending of the complexity fo the cgroup hierarchy. Take this time into account in the cost of the 1st load balance of a newly idle cpu. Also reduce the number of call to sched_clock_cpu() and track more actual work. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20211019123537.17146-2-vincent.guittot@linaro.org
2021-10-14sched/numa: Fix a few commentsBharata B Rao
Fix a few comments to help understand them better. Signed-off-by: Bharata B Rao <bharata@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lkml.kernel.org/r/20211004105706.3669-4-bharata@amd.com
2021-10-14sched/numa: Remove the redundant member numa_group::fault_cpusBharata B Rao
numa_group::fault_cpus is actually a pointer to the region in numa_group::faults[] where NUMA_CPU stats are located. Remove this redundant member and use numa_group::faults[NUMA_CPU] directly like it is done for similar per-process numa fault stats. There is no functionality change due to this commit. Signed-off-by: Bharata B Rao <bharata@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lkml.kernel.org/r/20211004105706.3669-3-bharata@amd.com
2021-10-14sched/numa: Replace hard-coded number by a define in numa_task_group()Bharata B Rao
While allocating group fault stats, task_numa_group() is using a hard coded number 4. Replace this by NR_NUMA_HINT_FAULT_STATS. No functionality change in this commit. Signed-off-by: Bharata B Rao <bharata@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lkml.kernel.org/r/20211004105706.3669-2-bharata@amd.com
2021-10-05sched/fair: Removed useless update of p->recent_used_cpuVincent Guittot
Since commit 89aafd67f28c ("sched/fair: Use prev instead of new target as recent_used_cpu"), p->recent_used_cpu is unconditionnaly set with prev. Fixes: 89aafd67f28c ("sched/fair: Use prev instead of new target as recent_used_cpu") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lkml.kernel.org/r/20210928103544.27489-1-vincent.guittot@linaro.org
2021-10-05sched/fair: Consider SMT in ASYM_PACKING load balanceRicardo Neri
When deciding to pull tasks in ASYM_PACKING, it is necessary not only to check for the idle state of the destination CPU, dst_cpu, but also of its SMT siblings. If dst_cpu is idle but its SMT siblings are busy, performance suffers if it pulls tasks from a medium priority CPU that does not have SMT siblings. Implement asym_smt_can_pull_tasks() to inspect the state of the SMT siblings of both dst_cpu and the CPUs in the candidate busiest group. Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Len Brown <len.brown@intel.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210911011819.12184-7-ricardo.neri-calderon@linux.intel.com
2021-10-05sched/fair: Carve out logic to mark a group for asymmetric packingRicardo Neri
Create a separate function, sched_asym(). A subsequent changeset will introduce logic to deal with SMT in conjunction with asmymmetric packing. Such logic will need the statistics of the scheduling group provided as argument. Update them before calling sched_asym(). Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Len Brown <len.brown@intel.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210911011819.12184-6-ricardo.neri-calderon@linux.intel.com
2021-10-05sched/fair: Provide update_sg_lb_stats() with sched domain statisticsRicardo Neri
Before deciding to pull tasks when using asymmetric packing of tasks, on some architectures (e.g., x86) it is necessary to know not only the state of dst_cpu but also of its SMT siblings. The decision to classify a candidate busiest group as group_asym_packing is done in update_sg_lb_stats(). Give this function access to the scheduling domain statistics, which contains the statistics of the local group. Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Len Brown <len.brown@intel.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210911011819.12184-5-ricardo.neri-calderon@linux.intel.com
2021-10-05sched/fair: Optimize checking for group_asym_packingRicardo Neri
sched_asmy_prefer() always returns false when called on the local group. By checking local_group, we can avoid additional checks and invoking sched_asmy_prefer() when it is not needed. No functional changes are introduced. Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Len Brown <len.brown@intel.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210911011819.12184-4-ricardo.neri-calderon@linux.intel.com
2021-10-05sched: Make schedstats helpers independent of fair sched classYafang Shao
The original prototype of the schedstats helpers are update_stats_wait_*(struct cfs_rq *cfs_rq, struct sched_entity *se) The cfs_rq in these helpers is used to get the rq_clock, and the se is used to get the struct sched_statistics and the struct task_struct. In order to make these helpers available by all sched classes, we can pass the rq, sched_statistics and task_struct directly. Then the new helpers are update_stats_wait_*(struct rq *rq, struct task_struct *p, struct sched_statistics *stats) which are independent of fair sched class. To avoid vmlinux growing too large or introducing ovehead when !schedstat_enabled(), some new helpers after schedstat_enabled() are also introduced, Suggested by Mel. These helpers are in sched/stats.c, __update_stats_wait_*(struct rq *rq, struct task_struct *p, struct sched_statistics *stats) The size of vmlinux as follows, Before After Size of vmlinux 826308552 826304640 The size is a litte smaller as some functions are not inlined again after the change. I also compared the sched performance with 'perf bench sched pipe', suggested by Mel. The result as followsi (in usecs/op), Before After kernel.sched_schedstats=0 5.2~5.4 5.2~5.4 kernel.sched_schedstats=1 5.3~5.5 5.3~5.5 [These data is a little difference with the prev version, that is because my old test machine is destroyed so I have to use a new different test machine.] Almost no difference. No functional change. [lkp@intel.com: reported build failure in prev version] Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20210905143547.4668-4-laoar.shao@gmail.com
2021-10-05sched: Make struct sched_statistics independent of fair sched classYafang Shao
If we want to use the schedstats facility to trace other sched classes, we should make it independent of fair sched class. The struct sched_statistics is the schedular statistics of a task_struct or a task_group. So we can move it into struct task_struct and struct task_group to achieve the goal. After the patch, schestats are orgnized as follows, struct task_struct { ... struct sched_entity se; struct sched_rt_entity rt; struct sched_dl_entity dl; ... struct sched_statistics stats; ... }; Regarding the task group, schedstats is only supported for fair group sched, and a new struct sched_entity_stats is introduced, suggested by Peter - struct sched_entity_stats { struct sched_entity se; struct sched_statistics stats; } __no_randomize_layout; Then with the se in a task_group, we can easily get the stats. The sched_statistics members may be frequently modified when schedstats is enabled, in order to avoid impacting on random data which may in the same cacheline with them, the struct sched_statistics is defined as cacheline aligned. As this patch changes the core struct of scheduler, so I verified the performance it may impact on the scheduler with 'perf bench sched pipe', suggested by Mel. Below is the result, in which all the values are in usecs/op. Before After kernel.sched_schedstats=0 5.2~5.4 5.2~5.4 kernel.sched_schedstats=1 5.3~5.5 5.3~5.5 [These data is a little difference with the earlier version, that is because my old test machine is destroyed so I have to use a new different test machine.] Almost no impact on the sched performance. No functional change. [lkp@intel.com: reported build failure in earlier version] Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20210905143547.4668-3-laoar.shao@gmail.com
2021-10-05sched/fair: Use __schedstat_set() in set_next_entity()Yafang Shao
schedstat_enabled() has been already checked, so we can use __schedstat_set() directly. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20210905143547.4668-2-laoar.shao@gmail.com
2021-10-05sched/fair: Add cfs bandwidth burst statisticsHuaixin Chang
Two new statistics are introduced to show the internal of burst feature and explain why burst helps or not. nr_bursts: number of periods bandwidth burst occurs burst_time: cumulative wall-time (in nanoseconds) that any cpus has used above quota in respective periods Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com> Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com> Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20210830032215.16302-2-changhuaixin@linux.alibaba.com
2021-10-05sched: adjust sleeper credit for SCHED_IDLE entitiesJosh Don
Give reduced sleeper credit to SCHED_IDLE entities. As a result, woken SCHED_IDLE entities will take longer to preempt normal entities. The benefit of this change is to make it less likely that a newly woken SCHED_IDLE entity will preempt a short-running normal entity before it blocks. We still give a small sleeper credit to SCHED_IDLE entities, so that idle<->idle competition retains some fairness. Example: With HZ=1000, spawned four threads affined to one cpu, one of which was set to SCHED_IDLE. Without this patch, wakeup latency for the SCHED_IDLE thread was ~1-2ms, with the patch the wakeup latency was ~5ms. Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Jiang Biao <benbjiang@tencent.com> Link: https://lore.kernel.org/r/20210820010403.946838-5-joshdon@google.com
2021-10-05sched: reduce sched slice for SCHED_IDLE entitiesJosh Don
Use a small, non-scaled min granularity for SCHED_IDLE entities, when competing with normal entities. This reduces the latency of getting a normal entity back on cpu, at the expense of increased context switch frequency of SCHED_IDLE entities. The benefit of this change is to reduce the round-robin latency for normal entities when competing with a SCHED_IDLE entity. Example: on a machine with HZ=1000, spawned two threads, one of which is SCHED_IDLE, and affined to one cpu. Without this patch, the SCHED_IDLE thread runs for 4ms then waits for 1.4s. With this patch, it runs for 1ms and waits 340ms (as it round-robins with the other thread). Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210820010403.946838-4-joshdon@google.com
2021-10-05sched: Account number of SCHED_IDLE entities on each cfs_rqJosh Don
Adds cfs_rq->idle_nr_running, which accounts the number of idle entities directly enqueued on the cfs_rq. Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210820010403.946838-3-joshdon@google.com
2021-10-05sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idleValentin Schneider
Consider a system with some NOHZ-idle CPUs, such that nohz.idle_cpus_mask = S nohz.next_balance = T When a new CPU k goes NOHZ idle (nohz_balance_enter_idle()), we end up with: nohz.idle_cpus_mask = S \U {k} nohz.next_balance = T Note that the nohz.next_balance hasn't changed - it won't be updated until a NOHZ balance is triggered. This is problematic if the newly NOHZ idle CPU has an earlier rq.next_balance than the other NOHZ idle CPUs, IOW if: cpu_rq(k).next_balance < nohz.next_balance In such scenarios, the existing nohz.next_balance will prevent any NOHZ balance from happening, which itself will prevent nohz.next_balance from being updated to this new cpu_rq(k).next_balance. Unnecessary load balance delays of over 12ms caused by this were observed on an arm64 RB5 board. Use the new nohz.needs_update flag to mark the presence of newly-idle CPUs that need their rq->next_balance to be collated into nohz.next_balance. Trigger a NOHZ_NEXT_KICK when the flag is set. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210823111700.2842997-3-valentin.schneider@arm.com
2021-10-05sched/fair: Add NOHZ balancer flag for nohz.next_balance updatesValentin Schneider
A following patch will trigger NOHZ idle balances as a means to update nohz.next_balance. Vincent noted that blocked load updates can have non-negligible overhead, which should be avoided if the intent is to only update nohz.next_balance. Add a new NOHZ balance kick flag, NOHZ_NEXT_KICK. Gate NOHZ blocked load update by the presence of NOHZ_STATS_KICK - currently all NOHZ balance kicks will have the NOHZ_STATS_KICK flag set, so no change in behaviour is expected. Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210823111700.2842997-2-valentin.schneider@arm.com
2021-10-01sched/fair: Add ancestors of unthrottled undecayed cfs_rqMichal Koutný
Since commit a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") we add cfs_rqs with no runnable tasks but not fully decayed into the load (leaf) list. We may ignore adding some ancestors and therefore breaking tmp_alone_branch invariant. This broke LTP test cfs_bandwidth01 and it was partially fixed in commit fdaba61ef8a2 ("sched/fair: Ensure that the CFS parent is added after unthrottling"). I noticed the named test still fails even with the fix (but with low probability, 1 in ~1000 executions of the test). The reason is when bailing out of unthrottle_cfs_rq early, we may miss adding ancestors of the unthrottled cfs_rq, thus, not joining tmp_alone_branch properly. Fix this by adding ancestors if we notice the unthrottled cfs_rq was added to the load list. Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Odin Ugedal <odin@uged.al> Link: https://lore.kernel.org/r/20210917153037.11176-1-mkoutny@suse.com
2021-08-30Merge tag 'sched-core-2021-08-30' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - The biggest change in this cycle is scheduler support for asymmetric scheduling affinity, to support the execution of legacy 32-bit tasks on AArch32 systems that also have 64-bit-only CPUs. Architectures can fill in this functionality by defining their own task_cpu_possible_mask(p). When this is done, the scheduler will make sure the task will only be scheduled on CPUs that support it. (The actual arm64 specific changes are not part of this tree.) For other architectures there will be no change in functionality. - Add cgroup SCHED_IDLE support - Increase node-distance flexibility & delay determining it until a CPU is brought online. (This enables platforms where node distance isn't final until the CPU is only.) - Deadline scheduler enhancements & fixes - Misc fixes & cleanups. * tag 'sched-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits) eventfd: Make signal recursion protection a task bit sched/fair: Mark tg_is_idle() an inline in the !CONFIG_FAIR_GROUP_SCHED case sched: Introduce dl_task_check_affinity() to check proposed affinity sched: Allow task CPU affinity to be restricted on asymmetric systems sched: Split the guts of sched_setaffinity() into a helper function sched: Introduce task_struct::user_cpus_ptr to track requested affinity sched: Reject CPU affinity changes based on task_cpu_possible_mask() cpuset: Cleanup cpuset_cpus_allowed_fallback() use in select_fallback_rq() cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus() cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1 sched: Introduce task_cpu_possible_mask() to limit fallback rq selection sched: Cgroup SCHED_IDLE support sched/topology: Skip updating masks for non-online nodes sched: Replace deprecated CPU-hotplug functions. sched: Skip priority checks with SCHED_FLAG_KEEP_PARAMS sched: Fix UCLAMP_FLAG_IDLE setting sched/deadline: Fix missing clock update in migrate_task_rq_dl() sched/fair: Avoid a second scan of target in select_idle_cpu sched/fair: Use prev instead of new target as recent_used_cpu sched: Don't report SCHED_FLAG_SUGOV in sched_getattr() ...
2021-08-26sched/fair: Mark tg_is_idle() an inline in the !CONFIG_FAIR_GROUP_SCHED caseIngo Molnar
It's not actually used in the !CONFIG_FAIR_GROUP_SCHED case: kernel/sched/fair.c:488:12: warning: ‘tg_is_idle’ defined but not used [-Wunused-function] Keep around a placeholder nevertheless, for API completeness. Mark it inline, so the compiler doesn't think it must be used. Fixes: 304000390f88: ("sched: Cgroup SCHED_IDLE support") Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Josh Don <joshdon@google.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org>
2021-08-20sched: Cgroup SCHED_IDLE supportJosh Don
This extends SCHED_IDLE to cgroups. Interface: cgroup/cpu.idle. 0: default behavior 1: SCHED_IDLE Extending SCHED_IDLE to cgroups means that we incorporate the existing aspects of SCHED_IDLE; a SCHED_IDLE cgroup will count all of its descendant threads towards the idle_h_nr_running count of all of its ancestor cgroups. Thus, sched_idle_rq() will work properly. Additionally, SCHED_IDLE cgroups are configured with minimum weight. There are two key differences between the per-task and per-cgroup SCHED_IDLE interface: - The cgroup interface allows tasks within a SCHED_IDLE hierarchy to maintain their relative weights. The entity that is "idle" is the cgroup, not the tasks themselves. - Since the idle entity is the cgroup, our SCHED_IDLE wakeup preemption decision is not made by comparing the current task with the woken task, but rather by comparing their matching sched_entity. A typical use-case for this is a user that creates an idle and a non-idle subtree. The non-idle subtree will dominate competition vs the idle subtree, but the idle subtree will still be high priority vs other users on the system. The latter is accomplished via comparing matching sched_entity in the waken preemption path (this could also be improved by making the sched_idle_rq() decision dependent on the perspective of a specific task). For now, we maintain the existing SCHED_IDLE semantics. Future patches may make improvements that extend how we treat SCHED_IDLE entities. The per-task_group idle field is an integer that currently only holds either a 0 or a 1. This is explicitly typed as an integer to allow for further extensions to this API. For example, a negative value may indicate a highly latency-sensitive cgroup that should be preferred for preemption/placement/etc. Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210730020019.1487127-2-joshdon@google.com
2021-08-04sched/fair: Avoid a second scan of target in select_idle_cpuMel Gorman
When select_idle_cpu starts scanning for an idle CPU, it starts with a target CPU that has already been checked by select_idle_sibling. This patch starts with the next CPU instead. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210804115857.6253-3-mgorman@techsingularity.net
2021-08-04sched/fair: Use prev instead of new target as recent_used_cpuMel Gorman
After select_idle_sibling, p->recent_used_cpu is set to the new target. However on the next wakeup, prev will be the same as recent_used_cpu unless the load balancer has moved the task since the last wakeup. It still works, but is less efficient than it could be. This patch preserves recent_used_cpu for longer. The impact on SIS efficiency is tiny so the SIS statistic patches were used to track the hit rate for using recent_used_cpu. With perf bench pipe on a 2-socket Cascadelake machine, the hit rate went from 57.14% to 85.32%. For more intensive wakeup loads like hackbench, the hit rate is almost negligible but rose from 0.21% to 6.64%. For scaling loads like tbench, the hit rate goes from almost 0% to 25.42% overall. Broadly speaking, on tbench, the success rate is much higher for lower thread counts and drops to almost 0 as the workload scales to towards saturation. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210804115857.6253-2-mgorman@techsingularity.net
2021-08-04sched/numa: Fix is_core_idle()Mika Penttilä
Use the loop variable instead of the function argument to test the other SMT siblings for idle. Fixes: ff7db0bf24db ("sched/numa: Prefer using an idle CPU as a migration target instead of comparing tasks") Signed-off-by: Mika Penttilä <mika.penttila@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Pankaj Gupta <pankaj.gupta@ionos.com> Link: https://lkml.kernel.org/r/20210722063946.28951-1-mika.penttila@gmail.com
2021-07-11Merge tag 'sched-urgent-2021-07-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Three fixes: - Fix load tracking bug/inconsistency - Fix a sporadic CFS bandwidth constraints enforcement bug - Fix a uclamp utilization tracking bug for newly woken tasks" * tag 'sched-urgent-2021-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/uclamp: Ignore max aggregation if rq is idle sched/fair: Fix CFS bandwidth hrtimer expiry type sched/fair: Sync load_sum with load_avg after dequeue
2021-07-02sched/fair: Fix CFS bandwidth hrtimer expiry typeOdin Ugedal
The time remaining until expiry of the refresh_timer can be negative. Casting the type to an unsigned 64-bit value will cause integer underflow, making the runtime_refresh_within return false instead of true. These situations are rare, but they do happen. This does not cause user-facing issues or errors; other than possibly unthrottling cfs_rq's using runtime from the previous period(s), making the CFS bandwidth enforcement less strict in those (special) situations. Signed-off-by: Odin Ugedal <odin@uged.al> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Link: https://lore.kernel.org/r/20210629121452.18429-1-odin@uged.al
2021-07-02sched/fair: Sync load_sum with load_avg after dequeueVincent Guittot
commit 9e077b52d86a ("sched/pelt: Check that *_avg are null when *_sum are") reported some inconsitencies between *_avg and *_sum. commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent") fixed some but one remains when dequeuing load. sync the cfs's load_sum with its load_avg after dequeuing the load of a sched_entity. Fixes: 9e077b52d86a ("sched/pelt: Check that *_avg are null when *_sum are") Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Odin Ugedal <odin@uged.al> Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Link: https://lore.kernel.org/r/20210701171837.32156-1-vincent.guittot@linaro.org
2021-06-30Merge tag 'sched-urgent-2021-06-30' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: - Fix a small inconsistency (bug) in load tracking, caught by a new warning that several people reported. - Flip CONFIG_SCHED_CORE to default-disabled, and update the Kconfig help text. * tag 'sched-urgent-2021-06-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Disable CONFIG_SCHED_CORE by default sched/fair: Ensure _sum and _avg values stay consistent
2021-06-28Merge tag 'sched-core-2021-06-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler udpates from Ingo Molnar: - Changes to core scheduling facilities: - Add "Core Scheduling" via CONFIG_SCHED_CORE=y, which enables coordinated scheduling across SMT siblings. This is a much requested feature for cloud computing platforms, to allow the flexible utilization of SMT siblings, without exposing untrusted domains to information leaks & side channels, plus to ensure more deterministic computing performance on SMT systems used by heterogenous workloads. There are new prctls to set core scheduling groups, which allows more flexible management of workloads that can share siblings. - Fix task->state access anti-patterns that may result in missed wakeups and rename it to ->__state in the process to catch new abuses. - Load-balancing changes: - Tweak newidle_balance for fair-sched, to improve 'memcache'-like workloads. - "Age" (decay) average idle time, to better track & improve workloads such as 'tbench'. - Fix & improve energy-aware (EAS) balancing logic & metrics. - Fix & improve the uclamp metrics. - Fix task migration (taskset) corner case on !CONFIG_CPUSET. - Fix RT and deadline utilization tracking across policy changes - Introduce a "burstable" CFS controller via cgroups, which allows bursty CPU-bound workloads to borrow a bit against their future quota to improve overall latencies & batching. Can be tweaked via /sys/fs/cgroup/cpu/<X>/cpu.cfs_burst_us. - Rework assymetric topology/capacity detection & handling. - Scheduler statistics & tooling: - Disable delayacct by default, but add a sysctl to enable it at runtime if tooling needs it. Use static keys and other optimizations to make it more palatable. - Use sched_clock() in delayacct, instead of ktime_get_ns(). - Misc cleanups and fixes. * tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits) sched/doc: Update the CPU capacity asymmetry bits sched/topology: Rework CPU capacity asymmetry detection sched/core: Introduce SD_ASYM_CPUCAPACITY_FULL sched_domain flag psi: Fix race between psi_trigger_create/destroy sched/fair: Introduce the burstable CFS controller sched/uclamp: Fix uclamp_tg_restrict() sched/rt: Fix Deadline utilization tracking during policy change sched/rt: Fix RT utilization tracking during policy change sched: Change task_struct::state sched,arch: Remove unused TASK_STATE offsets sched,timer: Use __set_current_state() sched: Add get_current_state() sched,perf,kvm: Fix preemption condition sched: Introduce task_is_running() sched: Unbreak wakeups sched/fair: Age the average idle time sched/cpufreq: Consider reduced CPU capacity in energy calculation sched/fair: Take thermal pressure into account while estimating energy thermal/cpufreq_cooling: Update offline CPUs per-cpu thermal_pressure sched/fair: Return early from update_tg_cfs_load() if delta == 0 ...
2021-06-28sched: Optimize housekeeping_cpumask() in for_each_cpu_and()Yuan ZhaoXiong
On a 128 cores AMD machine, there are 8 cores in nohz_full mode, and the others are used for housekeeping. When many housekeeping cpus are in idle state, we can observe huge time burn in the loop for searching nearest busy housekeeper cpu by ftrace. 9) | get_nohz_timer_target() { 9) | housekeeping_test_cpu() { 9) 0.390 us | housekeeping_get_mask.part.1(); 9) 0.561 us | } 9) 0.090 us | __rcu_read_lock(); 9) 0.090 us | housekeeping_cpumask(); 9) 0.521 us | housekeeping_cpumask(); 9) 0.140 us | housekeeping_cpumask(); ... 9) 0.500 us | housekeeping_cpumask(); 9) | housekeeping_any_cpu() { 9) 0.090 us | housekeeping_get_mask.part.1(); 9) 0.100 us | sched_numa_find_closest(); 9) 0.491 us | } 9) 0.100 us | __rcu_read_unlock(); 9) + 76.163 us | } for_each_cpu_and() is a micro function, so in get_nohz_timer_target() function the for_each_cpu_and(i, sched_domain_span(sd), housekeeping_cpumask(HK_FLAG_TIMER)) equals to below: for (i = -1; i = cpumask_next_and(i, sched_domain_span(sd), housekeeping_cpumask(HK_FLAG_TIMER)), i < nr_cpu_ids;) That will cause that housekeeping_cpumask() will be invoked many times. The housekeeping_cpumask() function returns a const value, so it is unnecessary to invoke it every time. This patch can minimize the worst searching time from ~76us to ~16us in my testing. Similarly, the find_new_ilb() function has the same problem. Co-developed-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Yuan ZhaoXiong <yuanzhaoxiong@baidu.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1622985115-51007-1-git-send-email-yuanzhaoxiong@baidu.com
2021-06-28sched/fair: Ensure _sum and _avg values stay consistentOdin Ugedal
The _sum and _avg values are in general sync together with the PELT divider. They are however not always completely in perfect sync, resulting in situations where _sum gets to zero while _avg stays positive. Such situations are undesirable. This comes from the fact that PELT will increase period_contrib, also increasing the PELT divider, without updating _sum and _avg values to stay in perfect sync where (_sum == _avg * divider). However, such PELT change will never lower _sum, making it impossible to end up in a situation where _sum is zero and _avg is not. Therefore, we need to ensure that when subtracting load outside PELT, that when _sum is zero, _avg is also set to zero. This occurs when (_sum < _avg * divider), and the subtracted (_avg * divider) is bigger or equal to the current _sum, while the subtracted _avg is smaller than the current _avg. Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Odin Ugedal <odin@uged.al> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Link: https://lore.kernel.org/r/20210624111815.57937-1-odin@uged.al
2021-06-24sched/fair: Introduce the burstable CFS controllerHuaixin Chang
The CFS bandwidth controller limits CPU requests of a task group to quota during each period. However, parallel workloads might be bursty so that they get throttled even when their average utilization is under quota. And they are latency sensitive at the same time so that throttling them is undesired. We borrow time now against our future underrun, at the cost of increased interference against the other system users. All nicely bounded. Traditional (UP-EDF) bandwidth control is something like: (U = \Sum u_i) <= 1 This guaranteeds both that every deadline is met and that the system is stable. After all, if U were > 1, then for every second of walltime, we'd have to run more than a second of program time, and obviously miss our deadline, but the next deadline will be further out still, there is never time to catch up, unbounded fail. This work observes that a workload doesn't always executes the full quota; this enables one to describe u_i as a statistical distribution. For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100) (the traditional WCET). This effectively allows u to be smaller, increasing the efficiency (we can pack more tasks in the system), but at the cost of missing deadlines when all the odds line up. However, it does maintain stability, since every overrun must be paired with an underrun as long as our x is above the average. That is, suppose we have 2 tasks, both specify a p(95) value, then we have a p(95)*p(95) = 90.25% chance both tasks are within their quota and everything is good. At the same time we have a p(5)p(5) = 0.25% chance both tasks will exceed their quota at the same time (guaranteed deadline fail). Somewhere in between there's a threshold where one exceeds and the other doesn't underrun enough to compensate; this depends on the specific CDFs. At the same time, we can say that the worst case deadline miss, will be \Sum e_i; that is, there is a bounded tardiness (under the assumption that x+e is indeed WCET). The benefit of burst is seen when testing with schbench. Default value of kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used. mkdir /sys/fs/cgroup/cpu/test echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us ./schbench -m 1 -t 3 -r 20 -c 80000 -R 10 The average CPU usage is at 80%. I run this for 10 times, and got long tail latency for 6 times and got throttled for 8 times. Tail latencies are shown below, and it wasn't the worst case. Latency percentiles (usec) 50.0000th: 19872 75.0000th: 21344 90.0000th: 22176 95.0000th: 22496 *99.0000th: 22752 99.5000th: 22752 99.9000th: 22752 min=0, max=22727 rps: 9.90 p95 (usec) 22496 p99 (usec) 22752 p95/cputime 28.12% p99/cputime 28.44% The interferenece when using burst is valued by the possibilities for missing the deadline and the average WCET. Test results showed that when there many cgroups or CPU is under utilized, the interference is limited. More details are shown in: https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/ Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com> Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com> Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20210621092800.23714-2-changhuaixin@linux.alibaba.com
2021-06-22sched/fair: Ensure that the CFS parent is added after unthrottlingRik van Riel
Ensure that a CFS parent will be in the list whenever one of its children is also in the list. A warning on rq->tmp_alone_branch != &rq->leaf_cfs_rq_list has been reported while running LTP test cfs_bandwidth01. Odin Ugedal found the root cause: $ tree /sys/fs/cgroup/ltp/ -d --charset=ascii /sys/fs/cgroup/ltp/ |-- drain `-- test-6851 `-- level2 |-- level3a | |-- worker1 | `-- worker2 `-- level3b `-- worker3 Timeline (ish): - worker3 gets throttled - level3b is decayed, since it has no more load - level2 get throttled - worker3 get unthrottled - level2 get unthrottled - worker3 is added to list - level3b is not added to list, since nr_running==0 and is decayed [ Vincent Guittot: Rebased and updated to fix for the reported warning. ] Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Acked-by: Odin Ugedal <odin@uged.al> Link: https://lore.kernel.org/r/20210621174330.11258-1-vincent.guittot@linaro.org
2021-06-18sched: Change task_struct::statePeter Zijlstra
Change the type and name of task_struct::state. Drop the volatile and shrink it to an 'unsigned int'. Rename it in order to find all uses such that we can use READ_ONCE/WRITE_ONCE as appropriate. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Acked-by: Will Deacon <will@kernel.org> Acked-by: Daniel Thompson <daniel.thompson@linaro.org> Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org
2021-06-18Merge branch 'sched/urgent' into sched/core, to resolve conflictsIngo Molnar
This commit in sched/urgent moved the cfs_rq_is_decayed() function: a7b359fc6a37: ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") and this fresh commit in sched/core modified it in the old location: 9e077b52d86a: ("sched/pelt: Check that *_avg are null when *_sum are") Merge the two variants. Conflicts: kernel/sched/fair.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-06-17sched/fair: Age the average idle timePeter Zijlstra
This is a partial forward-port of Peter Ziljstra's work first posted at: https://lore.kernel.org/lkml/20180530142236.667774973@infradead.org/ Currently select_idle_cpu()'s proportional scheme uses the average idle time *for when we are idle*, that is temporally challenged. When a CPU is not at all idle, we'll happily continue using whatever value we did see when the CPU goes idle. To fix this, introduce a separate average idle and age it (the existing value still makes sense for things like new-idle balancing, which happens when we do go idle). The overall goal is to not spend more time scanning for idle CPUs than we're idle for. Otherwise we're inhibiting work. This means that we need to consider the cost over all the wake-ups between consecutive idle periods. To track this, the scan cost is subtracted from the estimated average idle time. The impact of this patch is related to workloads that have domains that are fully busy or overloaded. Without the patch, the scan depth may be too high because a CPU is not reaching idle. Due to the nature of the patch, this is a regression magnet. It potentially wins when domains are almost fully busy or overloaded -- at that point searches are likely to fail but idle is not being aged as CPUs are active so search depth is too large and useless. It will potentially show regressions when there are idle CPUs and a deep search is beneficial. This tbench result on a 2-socket broadwell machine partially illustates the problem 5.13.0-rc2 5.13.0-rc2 vanilla sched-avgidle-v1r5 Hmean 1 445.02 ( 0.00%) 451.36 * 1.42%* Hmean 2 830.69 ( 0.00%) 846.03 * 1.85%* Hmean 4 1350.80 ( 0.00%) 1505.56 * 11.46%* Hmean 8 2888.88 ( 0.00%) 2586.40 * -10.47%* Hmean 16 5248.18 ( 0.00%) 5305.26 * 1.09%* Hmean 32 8914.03 ( 0.00%) 9191.35 * 3.11%* Hmean 64 10663.10 ( 0.00%) 10192.65 * -4.41%* Hmean 128 18043.89 ( 0.00%) 18478.92 * 2.41%* Hmean 256 16530.89 ( 0.00%) 17637.16 * 6.69%* Hmean 320 16451.13 ( 0.00%) 17270.97 * 4.98%* Note that 8 was a regression point where a deeper search would have helped but it gains for high thread counts when searches are useless. Hackbench is a more extreme example although not perfect as the tasks idle rapidly hackbench-process-pipes 5.13.0-rc2 5.13.0-rc2 vanilla sched-avgidle-v1r5 Amean 1 0.3950 ( 0.00%) 0.3887 ( 1.60%) Amean 4 0.9450 ( 0.00%) 0.9677 ( -2.40%) Amean 7 1.4737 ( 0.00%) 1.4890 ( -1.04%) Amean 12 2.3507 ( 0.00%) 2.3360 * 0.62%* Amean 21 4.0807 ( 0.00%) 4.0993 * -0.46%* Amean 30 5.6820 ( 0.00%) 5.7510 * -1.21%* Amean 48 8.7913 ( 0.00%) 8.7383 ( 0.60%) Amean 79 14.3880 ( 0.00%) 13.9343 * 3.15%* Amean 110 21.2233 ( 0.00%) 19.4263 * 8.47%* Amean 141 28.2930 ( 0.00%) 25.1003 * 11.28%* Amean 172 34.7570 ( 0.00%) 30.7527 * 11.52%* Amean 203 41.0083 ( 0.00%) 36.4267 * 11.17%* Amean 234 47.7133 ( 0.00%) 42.0623 * 11.84%* Amean 265 53.0353 ( 0.00%) 47.7720 * 9.92%* Amean 296 60.0170 ( 0.00%) 53.4273 * 10.98%* Stddev 1 0.0052 ( 0.00%) 0.0025 ( 51.57%) Stddev 4 0.0357 ( 0.00%) 0.0370 ( -3.75%) Stddev 7 0.0190 ( 0.00%) 0.0298 ( -56.64%) Stddev 12 0.0064 ( 0.00%) 0.0095 ( -48.38%) Stddev 21 0.0065 ( 0.00%) 0.0097 ( -49.28%) Stddev 30 0.0185 ( 0.00%) 0.0295 ( -59.54%) Stddev 48 0.0559 ( 0.00%) 0.0168 ( 69.92%) Stddev 79 0.1559 ( 0.00%) 0.0278 ( 82.17%) Stddev 110 1.1728 ( 0.00%) 0.0532 ( 95.47%) Stddev 141 0.7867 ( 0.00%) 0.0968 ( 87.69%) Stddev 172 1.0255 ( 0.00%) 0.0420 ( 95.91%) Stddev 203 0.8106 ( 0.00%) 0.1384 ( 82.92%) Stddev 234 1.1949 ( 0.00%) 0.1328 ( 88.89%) Stddev 265 0.9231 ( 0.00%) 0.0820 ( 91.11%) Stddev 296 1.0456 ( 0.00%) 0.1327 ( 87.31%) Again, higher thread counts benefit and the standard deviation shows that results are also a lot more stable when the idle time is aged. The patch potentially matters when a socket was multiple LLCs as the maximum search depth is lower. However, some of the test results were suspiciously good (e.g. specjbb2005 gaining 50% on a Zen1 machine) and other results were not dramatically different to other mcahines. Given the nature of the patch, Peter's full series is not being forward ported as each part should stand on its own. Preferably they would be merged at different times to reduce the risk of false bisections. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210615111611.GH30378@techsingularity.net
2021-06-17sched/cpufreq: Consider reduced CPU capacity in energy calculationLukasz Luba
Energy Aware Scheduling (EAS) needs to predict the decisions made by SchedUtil. The map_util_freq() exists to do that. There are corner cases where the max allowed frequency might be reduced (due to thermal). SchedUtil as a CPUFreq governor, is aware of that but EAS is not. This patch aims to address it. SchedUtil stores the maximum allowed frequency in 'sugov_policy::next_freq' field. EAS has to predict that value, which is the real used frequency. That value is made after a call to cpufreq_driver_resolve_freq() which clamps to the CPUFreq policy limits. In the existing code EAS is not able to predict that real frequency. This leads to energy estimation errors. To avoid wrong energy estimation in EAS (due to frequency miss prediction) make sure that the step which calculates Performance Domain frequency, is also aware of the allowed CPU capacity. Furthermore, modify map_util_freq() to not extend the frequency value. Instead, use map_util_perf() to extend the util value in both places: SchedUtil and EAS, but for EAS clamp it to max allowed CPU capacity. In the end, we achieve the same desirable behavior for both subsystems and alignment in regards to the real CPU frequency. Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> (For the schedutil part) Link: https://lore.kernel.org/r/20210614191238.23224-1-lukasz.luba@arm.com
2021-06-17sched/fair: Take thermal pressure into account while estimating energyLukasz Luba
Energy Aware Scheduling (EAS) needs to be able to predict the frequency requests made by the SchedUtil governor to properly estimate energy used in the future. It has to take into account CPUs utilization and forecast Performance Domain (PD) frequency. There is a corner case when the max allowed frequency might be reduced due to thermal. SchedUtil is aware of that reduced frequency, so it should be taken into account also in EAS estimations. SchedUtil, as a CPUFreq governor, knows the maximum allowed frequency of a CPU, thanks to cpufreq_driver_resolve_freq() and internal clamping to 'policy::max'. SchedUtil is responsible to respect that upper limit while setting the frequency through CPUFreq drivers. This effective frequency is stored internally in 'sugov_policy::next_freq' and EAS has to predict that value. In the existing code the raw value of arch_scale_cpu_capacity() is used for clamping the returned CPU utilization from effective_cpu_util(). This patch fixes issue with too big single CPU utilization, by introducing clamping to the allowed CPU capacity. The allowed CPU capacity is a CPU capacity reduced by thermal pressure raw value. Thanks to knowledge about allowed CPU capacity, we don't get too big value for a single CPU utilization, which is then added to the util sum. The util sum is used as a source of information for estimating whole PD energy. To avoid wrong energy estimation in EAS (due to capped frequency), make sure that the calculation of util sum is aware of allowed CPU capacity. This thermal pressure might be visible in scenarios where the CPUs are not heavily loaded, but some other component (like GPU) drastically reduced available power budget and increased the SoC temperature. Thus, we still use EAS for task placement and CPUs are not over-utilized. Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20210614191128.22735-1-lukasz.luba@arm.com
2021-06-17sched/fair: Return early from update_tg_cfs_load() if delta == 0Dietmar Eggemann
In case the _avg delta is 0 there is no need to update se's _avg (level n) nor cfs_rq's _avg (level n-1). These values stay the same. Since cfs_rq's _avg isn't changed, i.e. no load is propagated down, cfs_rq's _sum should stay the same as well. So bail out after se's _sum has been updated. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210601083616.804229-1-dietmar.eggemann@arm.com
2021-06-17sched/pelt: Check that *_avg are null when *_sum areVincent Guittot
Check that we never break the rule that pelt's avg values are null if pelt's sum are. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Odin Ugedal <odin@uged.al> Link: https://lore.kernel.org/r/20210601155328.19487-1-vincent.guittot@linaro.org
2021-06-14sched/fair: Correctly insert cfs_rq's to list on unthrottleOdin Ugedal
Fix an issue where fairness is decreased since cfs_rq's can end up not being decayed properly. For two sibling control groups with the same priority, this can often lead to a load ratio of 99/1 (!!). This happens because when a cfs_rq is throttled, all the descendant cfs_rq's will be removed from the leaf list. When they initial cfs_rq is unthrottled, it will currently only re add descendant cfs_rq's if they have one or more entities enqueued. This is not a perfect heuristic. Instead, we insert all cfs_rq's that contain one or more enqueued entities, or it its load is not completely decayed. Can often lead to situations like this for equally weighted control groups: $ ps u -C stress USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 10009 88.8 0.0 3676 100 pts/1 R+ 11:04 0:13 stress --cpu 1 root 10023 3.0 0.0 3676 104 pts/1 R+ 11:04 0:00 stress --cpu 1 Fixes: 31bc6aeaab1d ("sched/fair: Optimize update_blocked_averages()") [vingo: !SMP build fix] Signed-off-by: Odin Ugedal <odin@uged.al> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210612112815.61678-1-odin@uged.al
2021-06-03Merge branch 'sched/urgent' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-06-03sched/fair: Fix util_est UTIL_AVG_UNCHANGED handlingDietmar Eggemann
The util_est internal UTIL_AVG_UNCHANGED flag which is used to prevent unnecessary util_est updates uses the LSB of util_est.enqueued. It is exposed via _task_util_est() (and task_util_est()). Commit 92a801e5d5b7 ("sched/fair: Mask UTIL_AVG_UNCHANGED usages") mentions that the LSB is lost for util_est resolution but find_energy_efficient_cpu() checks if task_util_est() returns 0 to return prev_cpu early. _task_util_est() returns the max value of util_est.ewma and util_est.enqueued or'ed w/ UTIL_AVG_UNCHANGED. So task_util_est() returning the max of task_util() and _task_util_est() will never return 0 under the default SCHED_FEAT(UTIL_EST, true). To fix this use the MSB of util_est.enqueued instead and keep the flag util_est internal, i.e. don't export it via _task_util_est(). The maximal possible util_avg value for a task is 1024 so the MSB of 'unsigned int util_est.enqueued' isn't used to store a util value. As a caveat the code behind the util_est_se trace point has to filter UTIL_AVG_UNCHANGED to see the real util_est.enqueued value which should be easy to do. This also fixes an issue report by Xuewen Yan that util_est_update() only used UTIL_AVG_UNCHANGED for the subtrahend of the equation: last_enqueued_diff = ue.enqueued - (task_util() | UTIL_AVG_UNCHANGED) Fixes: b89997aa88f0b sched/pelt: Fix task util_est update filtering Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Xuewen Yan <xuewen.yan@unisoc.com> Reviewed-by: Vincent Donnefort <vincent.donnefort@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210602145808.1562603-1-dietmar.eggemann@arm.com
2021-06-03sched/pelt: Ensure that *_sum is always synced with *_avgVincent Guittot
Rounding in PELT calculation happening when entities are attached/detached of a cfs_rq can result into situations where util/runnable_avg is not null but util/runnable_sum is. This is normally not possible so we need to ensure that util/runnable_sum stays synced with util/runnable_avg. detach_entity_load_avg() is the last place where we don't sync util/runnable_sum with util/runnbale_avg when moving some sched_entities Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210601085832.12626-1-vincent.guittot@linaro.org