summaryrefslogtreecommitdiff
path: root/kernel/sched
AgeCommit message (Collapse)Author
2024-06-18sched_ext: Implement BPF extensible scheduler classTejun Heo
Implement a new scheduler class sched_ext (SCX), which allows scheduling policies to be implemented as BPF programs to achieve the following: 1. Ease of experimentation and exploration: Enabling rapid iteration of new scheduling policies. 2. Customization: Building application-specific schedulers which implement policies that are not applicable to general-purpose schedulers. 3. Rapid scheduler deployments: Non-disruptive swap outs of scheduling policies in production environments. sched_ext leverages BPF’s struct_ops feature to define a structure which exports function callbacks and flags to BPF programs that wish to implement scheduling policies. The struct_ops structure exported by sched_ext is struct sched_ext_ops, and is conceptually similar to struct sched_class. The role of sched_ext is to map the complex sched_class callbacks to the more simple and ergonomic struct sched_ext_ops callbacks. For more detailed discussion on the motivations and overview, please refer to the cover letter. Later patches will also add several example schedulers and documentation. This patch implements the minimum core framework to enable implementation of BPF schedulers. Subsequent patches will gradually add functionalities including safety guarantee mechanisms, nohz and cgroup support. include/linux/sched/ext.h defines struct sched_ext_ops. With the comment on top, each operation should be self-explanatory. The followings are worth noting: - Both "sched_ext" and its shorthand "scx" are used. If the identifier already has "sched" in it, "ext" is used; otherwise, "scx". - In sched_ext_ops, only .name is mandatory. Every operation is optional and if omitted a simple but functional default behavior is provided. - A new policy constant SCHED_EXT is added and a task can select sched_ext by invoking sched_setscheduler(2) with the new policy constant. However, if the BPF scheduler is not loaded, SCHED_EXT is the same as SCHED_NORMAL and the task is scheduled by CFS. When the BPF scheduler is loaded, all tasks which have the SCHED_EXT policy are switched to sched_ext. - To bridge the workflow imbalance between the scheduler core and sched_ext_ops callbacks, sched_ext uses simple FIFOs called dispatch queues (dsq's). By default, there is one global dsq (SCX_DSQ_GLOBAL), and one local per-CPU dsq (SCX_DSQ_LOCAL). SCX_DSQ_GLOBAL is provided for convenience and need not be used by a scheduler that doesn't require it. SCX_DSQ_LOCAL is the per-CPU FIFO that sched_ext pulls from when putting the next task on the CPU. The BPF scheduler can manage an arbitrary number of dsq's using scx_bpf_create_dsq() and scx_bpf_destroy_dsq(). - sched_ext guarantees system integrity no matter what the BPF scheduler does. To enable this, each task's ownership is tracked through p->scx.ops_state and all tasks are put on scx_tasks list. The disable path can always recover and revert all tasks back to CFS. See p->scx.ops_state and scx_tasks. - A task is not tied to its rq while enqueued. This decouples CPU selection from queueing and allows sharing a scheduling queue across an arbitrary subset of CPUs. This adds some complexities as a task may need to be bounced between rq's right before it starts executing. See dispatch_to_local_dsq() and move_task_to_local_dsq(). - One complication that arises from the above weak association between task and rq is that synchronizing with dequeue() gets complicated as dequeue() may happen anytime while the task is enqueued and the dispatch path might need to release the rq lock to transfer the task. Solving this requires a bit of complexity. See the logic around p->scx.sticky_cpu and p->scx.ops_qseq. - Both enable and disable paths are a bit complicated. The enable path switches all tasks without blocking to avoid issues which can arise from partially switched states (e.g. the switching task itself being starved). The disable path can't trust the BPF scheduler at all, so it also has to guarantee forward progress without blocking. See scx_ops_enable() and scx_ops_disable_workfn(). - When sched_ext is disabled, static_branches are used to shut down the entry points from hot paths. v7: - scx_ops_bypass() was incorrectly and unnecessarily trying to grab scx_ops_enable_mutex which can lead to deadlocks in the disable path. Fixed. - Fixed TASK_DEAD handling bug in scx_ops_enable() path which could lead to use-after-free. - Consolidated per-cpu variable usages and other cleanups. v6: - SCX_NR_ONLINE_OPS replaced with SCX_OPI_*_BEGIN/END so that multiple groups can be expressed. Later CPU hotplug operations are put into their own group. - SCX_OPS_DISABLING state is replaced with the new bypass mechanism which allows temporarily putting the system into simple FIFO scheduling mode bypassing the BPF scheduler. In addition to the shut down path, this will also be used to isolate the BPF scheduler across PM events. Enabling and disabling the bypass mode requires iterating all runnable tasks. rq->scx.runnable_list addition is moved from the later watchdog patch. - ops.prep_enable() is replaced with ops.init_task() and ops.enable/disable() are now called whenever the task enters and leaves sched_ext instead of when the task becomes schedulable on sched_ext and stops being so. A new operation - ops.exit_task() - is called when the task stops being schedulable on sched_ext. - scx_bpf_dispatch() can now be called from ops.select_cpu() too. This removes the need for communicating local dispatch decision made by ops.select_cpu() to ops.enqueue() via per-task storage. SCX_KF_SELECT_CPU is added to support the change. - SCX_TASK_ENQ_LOCAL which told the BPF scheudler that scx_select_cpu_dfl() wants the task to be dispatched to the local DSQ was removed. Instead, scx_bpf_select_cpu_dfl() now dispatches directly if it finds a suitable idle CPU. If such behavior is not desired, users can use scx_bpf_select_cpu_dfl() which returns the verdict in a bool out param. - scx_select_cpu_dfl() was mishandling WAKE_SYNC and could end up queueing many tasks on a local DSQ which makes tasks to execute in order while other CPUs stay idle which made some hackbench numbers really bad. Fixed. - The current state of sched_ext can now be monitored through files under /sys/sched_ext instead of /sys/kernel/debug/sched/ext. This is to enable monitoring on kernels which don't enable debugfs. - sched_ext wasn't telling BPF that ops.dispatch()'s @prev argument may be NULL and a BPF scheduler which derefs the pointer without checking could crash the kernel. Tell BPF. This is currently a bit ugly. A better way to annotate this is expected in the future. - scx_exit_info updated to carry pointers to message buffers instead of embedding them directly. This decouples buffer sizes from API so that they can be changed without breaking compatibility. - exit_code added to scx_exit_info. This is used to indicate different exit conditions on non-error exits and will be used to handle e.g. CPU hotplugs. - The patch "sched_ext: Allow BPF schedulers to switch all eligible tasks into sched_ext" is folded in and the interface is changed so that partial switching is indicated with a new ops flag %SCX_OPS_SWITCH_PARTIAL. This makes scx_bpf_switch_all() unnecessasry and in turn SCX_KF_INIT. ops.init() is now called with SCX_KF_SLEEPABLE. - Code reorganized so that only the parts necessary to integrate with the rest of the kernel are in the header files. - Changes to reflect the BPF and other kernel changes including the addition of bpf_sched_ext_ops.cfi_stubs. v5: - To accommodate 32bit configs, p->scx.ops_state is now atomic_long_t instead of atomic64_t and scx_dsp_buf_ent.qseq which uses load_acquire/store_release is now unsigned long instead of u64. - Fix the bug where bpf_scx_btf_struct_access() was allowing write access to arbitrary fields. - Distinguish kfuncs which can be called from any sched_ext ops and from anywhere. e.g. scx_bpf_pick_idle_cpu() can now be called only from sched_ext ops. - Rename "type" to "kind" in scx_exit_info to make it easier to use on languages in which "type" is a reserved keyword. - Since cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup"), PF_IDLE is not set on idle tasks which haven't been online yet which made scx_task_iter_next_filtered() include those idle tasks in iterations leading to oopses. Update scx_task_iter_next_filtered() to directly test p->sched_class against idle_sched_class instead of using is_idle_task() which tests PF_IDLE. - Other updates to match upstream changes such as adding const to set_cpumask() param and renaming check_preempt_curr() to wakeup_preempt(). v4: - SCHED_CHANGE_BLOCK replaced with the previous sched_deq_and_put_task()/sched_enq_and_set_tsak() pair. This is because upstream is adaopting a different generic cleanup mechanism. Once that lands, the code will be adapted accordingly. - task_on_scx() used to test whether a task should be switched into SCX, which is confusing. Renamed to task_should_scx(). task_on_scx() now tests whether a task is currently on SCX. - scx_has_idle_cpus is barely used anymore and replaced with direct check on the idle cpumask. - SCX_PICK_IDLE_CORE added and scx_pick_idle_cpu() improved to prefer fully idle cores. - ops.enable() now sees up-to-date p->scx.weight value. - ttwu_queue path is disabled for tasks on SCX to avoid confusing BPF schedulers expecting ->select_cpu() call. - Use cpu_smt_mask() instead of topology_sibling_cpumask() like the rest of the scheduler. v3: - ops.set_weight() added to allow BPF schedulers to track weight changes without polling p->scx.weight. - move_task_to_local_dsq() was losing SCX-specific enq_flags when enqueueing the task on the target dsq because it goes through activate_task() which loses the upper 32bit of the flags. Carry the flags through rq->scx.extra_enq_flags. - scx_bpf_dispatch(), scx_bpf_pick_idle_cpu(), scx_bpf_task_running() and scx_bpf_task_cpu() now use the new KF_RCU instead of KF_TRUSTED_ARGS to make it easier for BPF schedulers to call them. - The kfunc helper access control mechanism implemented through sched_ext_entity.kf_mask is improved. Now SCX_CALL_OP*() is always used when invoking scx_ops operations. v2: - balance_scx_on_up() is dropped. Instead, on UP, balance_scx() is called from put_prev_taks_scx() and pick_next_task_scx() as necessary. To determine whether balance_scx() should be called from put_prev_task_scx(), SCX_TASK_DEQD_FOR_SLEEP flag is added. See the comment in put_prev_task_scx() for details. - sched_deq_and_put_task() / sched_enq_and_set_task() sequences replaced with SCHED_CHANGE_BLOCK(). - Unused all_dsqs list removed. This was a left-over from previous iterations. - p->scx.kf_mask is added to track and enforce which kfunc helpers are allowed. Also, init/exit sequences are updated to make some kfuncs always safe to call regardless of the current BPF scheduler state. Combined, this should make all the kfuncs safe. - BPF now supports sleepable struct_ops operations. Hacky workaround removed and operations and kfunc helpers are tagged appropriately. - BPF now supports bitmask / cpumask helpers. scx_bpf_get_idle_cpumask() and friends are added so that BPF schedulers can use the idle masks with the generic helpers. This replaces the hacky kfunc helpers added by a separate patch in V1. - CONFIG_SCHED_CLASS_EXT can no longer be enabled if SCHED_CORE is enabled. This restriction will be removed by a later patch which adds core-sched support. - Add MAINTAINERS entries and other misc changes. Signed-off-by: Tejun Heo <tj@kernel.org> Co-authored-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Cc: Andrea Righi <andrea.righi@canonical.com>
2024-06-18sched_ext: Add boilerplate for extensible scheduler classTejun Heo
This adds dummy implementations of sched_ext interfaces which interact with the scheduler core and hook them in the correct places. As they're all dummies, this doesn't cause any behavior changes. This is split out to help reviewing. v2: balance_scx_on_up() dropped. This will be handled in sched_ext proper. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-06-18sched: Add normal_policy()Tejun Heo
A new BPF extensible sched_class will need to dynamically change how a task picks its sched_class. For example, if the loaded BPF scheduler progs fail, the tasks will be forced back on CFS even if the task's policy is set to the new sched_class. To support such mapping, add normal_policy() which wraps testing for %SCHED_NORMAL. This doesn't cause any behavior changes. v2: Update the description with more details on the expected use. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-06-18sched: Factor out update_other_load_avgs() from __update_blocked_others()Tejun Heo
RT, DL, thermal and irq load and utilization metrics need to be decayed and updated periodically and before consumption to keep the numbers reasonable. This is currently done from __update_blocked_others() as a part of the fair class load balance path. Let's factor it out to update_other_load_avgs(). Pure refactor. No functional changes. This will be used by the new BPF extensible scheduling class to ensure that the above metrics are properly maintained. v2: Refreshed on top of tip:sched/core. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com>
2024-06-18sched: Factor out cgroup weight conversion functionsTejun Heo
Factor out sched_weight_from/to_cgroup() which convert between scheduler shares and cgroup weight. No functional change. The factored out functions will be used by a new BPF extensible sched_class so that the weights can be exposed to the BPF programs in a way which is consistent cgroup weights and easier to interpret. The weight conversions will be used regardless of cgroup usage. It's just borrowing the cgroup weight range as it's more intuitive. CGROUP_WEIGHT_MIN/DFL/MAX constants are moved outside CONFIG_CGROUPS so that the conversion helpers can always be defined. v2: The helpers are now defined regardless of COFNIG_CGROUPS. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-06-18sched: Add sched_class->switching_to() and expose check_class_changing/changed()Tejun Heo
When a task switches to a new sched_class, the prev and new classes are notified through ->switched_from() and ->switched_to(), respectively, after the switching is done. A new BPF extensible sched_class will have callbacks that allow the BPF scheduler to keep track of relevant task states (like priority and cpumask). Those callbacks aren't called while a task is on a different sched_class. When a task comes back, we wanna tell the BPF progs the up-to-date state before the task gets enqueued, so we need a hook which is called before the switching is committed. This patch adds ->switching_to() which is called during sched_class switch through check_class_changing() before the task is restored. Also, this patch exposes check_class_changing/changed() in kernel/sched/sched.h. They will be used by the new BPF extensible sched_class to implement implicit sched_class switching which is used e.g. when falling back to CFS when the BPF scheduler fails or unloads. This is a prep patch and doesn't cause any behavior changes. The new operation and exposed functions aren't used yet. v3: Refreshed on top of tip:sched/core. v2: Improve patch description w/ details on planned use. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-06-18sched: Add sched_class->reweight_task()Tejun Heo
Currently, during a task weight change, sched core directly calls reweight_task() defined in fair.c if @p is on CFS. Let's make it a proper sched_class operation instead. CFS's reweight_task() is renamed to reweight_task_fair() and now called through sched_class. While it turns a direct call into an indirect one, set_load_weight() isn't called from a hot path and this change shouldn't cause any noticeable difference. This will be used to implement reweight_task for a new BPF extensible sched_class so that it can keep its cached task weight up-to-date. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-06-18sched: Allow sched_cgroup_fork() to fail and introduce sched_cancel_fork()Tejun Heo
A new BPF extensible sched_class will need more control over the forking process. It wants to be able to fail from sched_cgroup_fork() after the new task's sched_task_group is initialized so that the loaded BPF program can prepare the task with its cgroup association is established and reject fork if e.g. allocation fails. Allow sched_cgroup_fork() to fail by making it return int instead of void and adding sched_cancel_fork() to undo sched_fork() in the error path. sched_cgroup_fork() doesn't fail yet and this patch shouldn't cause any behavior changes. v2: Patch description updated to detail the expected use. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-06-18sched: Restructure sched_class order sanity checks in sched_init()Tejun Heo
Currently, sched_init() checks that the sched_class'es are in the expected order by testing each adjacency which is a bit brittle and makes it cumbersome to add optional sched_class'es. Instead, let's verify whether they're in the expected order using sched_class_above() which is what matters. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Reviewed-by: David Vernet <dvernet@meta.com>
2024-06-06rcu/tasks: Fix stale task snaphot for Tasks TraceFrederic Weisbecker
When RCU-TASKS-TRACE pre-gp takes a snapshot of the current task running on all online CPUs, no explicit ordering synchronizes properly with a context switch. This lack of ordering can permit the new task to miss pre-grace-period update-side accesses. The following diagram, courtesy of Paul, shows the possible bad scenario: CPU 0 CPU 1 ----- ----- // Pre-GP update side access WRITE_ONCE(*X, 1); smp_mb(); r0 = rq->curr; RCU_INIT_POINTER(rq->curr, TASK_B) spin_unlock(rq) rcu_read_lock_trace() r1 = X; /* ignore TASK_B */ Either r0==TASK_B or r1==1 is needed but neither is guaranteed. One possible solution to solve this is to wait for an RCU grace period at the beginning of the RCU-tasks-trace grace period before taking the current tasks snaphot. However this would introduce large additional latencies to RCU-tasks-trace grace periods. Another solution is to lock the target runqueue while taking the current task snapshot. This ensures that the update side sees the latest context switch and subsequent context switches will see the pre-grace-period update side accesses. This commit therefore adds runqueue locking to cpu_curr_snapshot(). Fixes: e386b6725798 ("rcu-tasks: Eliminate RCU Tasks Trace IPIs to online CPUs") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2024-06-05sched/balance: Skip unnecessary updates to idle load balancer's flagsTim Chen
We observed that the overhead on trigger_load_balance(), now renamed sched_balance_trigger(), has risen with a system's core counts. For an OLTP workload running 6.8 kernel on a 2 socket x86 systems having 96 cores/socket, we saw that 0.7% cpu cycles are spent in trigger_load_balance(). On older systems with fewer cores/socket, this function's overhead was less than 0.1%. The cause of this overhead was that there are multiple cpus calling kick_ilb(flags), updating the balancing work needed to a common idle load balancer cpu. The ilb_cpu's flags field got updated unconditionally with atomic_fetch_or(). The atomic read and writes to ilb_cpu's flags causes much cache bouncing and cpu cycles overhead. This is seen in the annotated profile below. kick_ilb(): if (ilb_cpu < 0) test %r14d,%r14d ↑ js 6c flags = atomic_fetch_or(flags, nohz_flags(ilb_cpu)); mov $0x2d600,%rdi movslq %r14d,%r8 mov %rdi,%rdx add -0x7dd0c3e0(,%r8,8),%rdx arch_atomic_read(): 0.01 mov 0x64(%rdx),%esi 35.58 add $0x64,%rdx arch_atomic_fetch_or(): static __always_inline int arch_atomic_fetch_or(int i, atomic_t *v) { int val = arch_atomic_read(v); do { } while (!arch_atomic_try_cmpxchg(v, &val, val | i)); 0.03 157: mov %r12d,%ecx arch_atomic_try_cmpxchg(): return arch_try_cmpxchg(&v->counter, old, new); 0.00 mov %esi,%eax arch_atomic_fetch_or(): do { } while (!arch_atomic_try_cmpxchg(v, &val, val | i)); or %esi,%ecx arch_atomic_try_cmpxchg(): return arch_try_cmpxchg(&v->counter, old, new); 0.01 lock cmpxchg %ecx,(%rdx) 42.96 ↓ jne 2d2 kick_ilb(): With instrumentation, we found that 81% of the updates do not result in any change in the ilb_cpu's flags. That is, multiple cpus are asking the ilb_cpu to do the same things over and over again, before the ilb_cpu has a chance to run NOHZ load balance. Skip updates to ilb_cpu's flags if no new work needs to be done. Such updates do not change ilb_cpu's NOHZ flags. This requires an extra atomic read but it is less expensive than frequent unnecessary atomic updates that generate cache bounces. We saw that on the OLTP workload, cpu cycles from trigger_load_balance() (or sched_balance_trigger()) got reduced from 0.7% to 0.2%. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240531205452.65781-1-tim.c.chen@linux.intel.com
2024-06-05idle: Remove stale RCU commentChristian Loehle
The call of rcu_idle_enter() from within cpuidle_idle_call() was removed in commit 1098582a0f6c ("sched,idle,rcu: Push rcu_idle deeper into the idle path") which makes the comment out of place. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/5b936388-47df-4050-9229-6617a6c2bba5@arm.com
2024-06-05sched/headers: Move struct pre-declarations to the beginning of the headerIngo Molnar
There's a random number of structure pre-declaration lines in kernel/sched/sched.h, some of which are unnecessary duplicates. Move them to the head & order them a bit for readability. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org
2024-06-05sched/core: Clean up kernel/sched/sched.h a bitIngo Molnar
- Fix whitespace noise - Fix col80 linebreak damage where possible - Apply CodingStyle consistently - Use consistent #else and #endif comments - Use consistent vertical alignment - Use 'extern' consistently Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org
2024-06-05sched/core: Simplify prefetch_curr_exec_start()Ingo Molnar
Remove unnecessary use of the address operator. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org
2024-05-27sched: Fix spelling in commentsIngo Molnar
Do a spell-checking pass. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-05-27sched/syscalls: Split out kernel/sched/syscalls.c from kernel/sched/core.cIngo Molnar
core.c has become rather large, move most scheduler syscall related functionality into a separate file, syscalls.c. This is about ~15% of core.c's raw linecount. Move the alloc_user_cpus_ptr(), __rt_effective_prio(), rt_effective_prio(), uclamp_none(), uclamp_se_set() and uclamp_bucket_id() inlines to kernel/sched/sched.h. Internally export the __sched_setscheduler(), __sched_setaffinity(), __setscheduler_prio(), set_load_weight(), enqueue_task(), dequeue_task(), check_class_changed(), splice_balance_callbacks() and balance_callbacks() methods to better facilitate this. Move the new file's build to sched_policy.c, because it fits there semantically, but also because it's the smallest of the 4 build units under an allmodconfig build: -rw-rw-r-- 1 mingo mingo 7.3M May 27 12:35 kernel/sched/core.i -rw-rw-r-- 1 mingo mingo 6.4M May 27 12:36 kernel/sched/build_utility.i -rw-rw-r-- 1 mingo mingo 6.3M May 27 12:36 kernel/sched/fair.i -rw-rw-r-- 1 mingo mingo 5.8M May 27 12:36 kernel/sched/build_policy.i This better balances build time for scheduler subsystem rebuilds. I build-tested this new file as a standalone syscalls.o file for a bit, to make sure all the encapsulations & abstractions are robust. Also update/add my copyright notices to these files. Build time measurements: # -Before/+After: kepler:~/tip> perf stat -e 'cycles,instructions,duration_time' --sync --repeat 5 --pre 'rm -f kernel/sched/*.o' m kernel/sched/built-in.a >/dev/null Performance counter stats for 'm kernel/sched/built-in.a' (5 runs): - 71,938,508,607 cycles ( +- 0.17% ) + 71,992,916,493 cycles ( +- 0.22% ) - 106,214,780,964 instructions # 1.48 insn per cycle ( +- 0.01% ) + 105,450,231,154 instructions # 1.46 insn per cycle ( +- 0.01% ) - 5,878,232,620 ns duration_time ( +- 0.38% ) + 5,290,085,069 ns duration_time ( +- 0.21% ) - 5.8782 +- 0.0221 seconds time elapsed ( +- 0.38% ) + 5.2901 +- 0.0111 seconds time elapsed ( +- 0.21% ) Build time improvement of -11.1% (duration_time) is expected: the parallel build time of the scheduler subsystem is determined by the largest, slowest to build object file, which is kernel/sched/core.o. By moving ~15% of its complexity into another build unit, we reduced build time by -11%. Measured cycles spent on building is within its ~0.2% stddev noise envelope. The -0.7% reduction in instructions spent on building the scheduler is statistically reliable and somewhat surprising - I can only speculate: maybe compilers aren't that efficient at building & optimizing 10+ KLOC files (core.c), and it's an overall win to balance the linecount a bit. Anyway, this might be a data point that suggests that reducing the linecount of our largest files will improve not just code readability and maintainability, but might also improve build times a bit. Code generation got a bit worse, by 0.5kb text on an x86 defconfig build: # -Before/+After: kepler:~/tip> size vmlinux text data bss dec hex filename -26475475 10439178 1740804 38655457 24dd5e1 vmlinux +26476003 10439178 1740804 38655985 24dd7f1 vmlinux kepler:~/tip> size kernel/sched/built-in.a text data bss dec hex filename - 76056 30025 489 106570 1a04a kernel/sched/core.o (ex kernel/sched/built-in.a) + 63452 29453 489 93394 16cd2 kernel/sched/core.o (ex kernel/sched/built-in.a) 44299 2181 104 46584 b5f8 kernel/sched/fair.o (ex kernel/sched/built-in.a) - 42764 3424 120 46308 b4e4 kernel/sched/build_policy.o (ex kernel/sched/built-in.a) + 55651 4044 120 59815 e9a7 kernel/sched/build_policy.o (ex kernel/sched/built-in.a) 44866 12655 2192 59713 e941 kernel/sched/build_utility.o (ex kernel/sched/built-in.a) 44866 12655 2192 59713 e941 kernel/sched/build_utility.o (ex kernel/sched/built-in.a) This is primarily due to the extra functions exported, and the size gets exaggerated somewhat by __pfx CFI function padding: ffffffff810cc710 <__pfx_enqueue_task>: ffffffff810cc710: 90 nop ffffffff810cc711: 90 nop ffffffff810cc712: 90 nop ffffffff810cc713: 90 nop ffffffff810cc714: 90 nop ffffffff810cc715: 90 nop ffffffff810cc716: 90 nop ffffffff810cc717: 90 nop ffffffff810cc718: 90 nop ffffffff810cc719: 90 nop ffffffff810cc71a: 90 nop ffffffff810cc71b: 90 nop ffffffff810cc71c: 90 nop ffffffff810cc71d: 90 nop ffffffff810cc71e: 90 nop ffffffff810cc71f: 90 nop AFAICS the cost is primarily not to core.o and fair.o though (which contain most performance sensitive scheduler functions), only to syscalls.o that get called with much lower frequency - so I think this is an acceptable trade-off for better code separation. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20240407084319.1462211-2-mingo@kernel.org
2024-05-21Merge tag 'bitmap-for-6.10v2' of https://github.com/norov/linuxLinus Torvalds
Pull bitmap updates from Yury Norov: - topology_span_sane() optimization from Kyle Meyer - fns() rework from Kuan-Wei Chiu (used in cpumask_local_spread() and other places) - headers cleanup from Andy - add a MAINTAINERS record for bitops API * tag 'bitmap-for-6.10v2' of https://github.com/norov/linux: usercopy: Don't use "proxy" headers bitops: Move aligned_byte_mask() to wordpart.h MAINTAINERS: add BITOPS API record bitmap: relax find_nth_bit() limitation on return value lib: make test_bitops compilable into the kernel image bitops: Optimize fns() for improved performance lib/test_bitops: Add benchmark test for fns() Compiler Attributes: Add __always_used macro sched/topology: Optimize topology_span_sane() cpumask: Add for_each_cpu_from()
2024-05-19Merge tag 'sched-urgent-2024-05-18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: - Fix a sched_balance_newidle setting bug - Fix bug in the setting of /sys/fs/cgroup/test/cpu.max.burst - Fix variable-shadowing build warning - Extend sched-domains debug output - Fix documentation - Fix comments * tag 'sched-urgent-2024-05-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Fix incorrect initialization of the 'burst' parameter in cpu_max_write() sched/fair: Remove stale FREQUENCY_UTIL comment sched/fair: Fix initial util_avg calculation docs: cgroup-v1: Clarify that domain levels are system-specific sched/debug: Dump domains' level sched/fair: Allow disabling sched_balance_newidle with sched_relax_domain_level arch/topology: Fix variable naming to avoid shadowing
2024-05-17Merge tag 'sysctl-6.10-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl Pull sysctl updates from Joel Granados: - Remove sentinel elements from ctl_table structs in kernel/* Removing sentinels in ctl_table arrays reduces the build time size and runtime memory consumed by ~64 bytes per array. Removals for net/, io_uring/, mm/, ipc/ and security/ are set to go into mainline through their respective subsystems making the next release the most likely place where the final series that removes the check for proc_name == NULL will land. This adds to removals already in arch/, drivers/ and fs/. - Adjust ctl_table definitions and references to allow constification - Remove unused ctl_table function arguments - Move non-const elements from ctl_table to ctl_table_header - Make ctl_table pointers const in ctl_table_root structure Making the static ctl_table structs const will increase safety by keeping the pointers to proc_handler functions in .rodata. Though no ctl_tables where made const in this PR, the ground work for making that possible has started with these changes sent by Thomas Weißschuh. * tag 'sysctl-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: sysctl: drop now unnecessary out-of-bounds check sysctl: move sysctl type to ctl_table_header sysctl: drop sysctl_is_perm_empty_ctl_table sysctl: treewide: constify argument ctl_table_root::permissions(table) sysctl: treewide: drop unused argument ctl_table_root::set_ownership(table) bpf: Remove the now superfluous sentinel elements from ctl_table array delayacct: Remove the now superfluous sentinel elements from ctl_table array kprobes: Remove the now superfluous sentinel elements from ctl_table array printk: Remove the now superfluous sentinel elements from ctl_table array scheduler: Remove the now superfluous sentinel elements from ctl_table array seccomp: Remove the now superfluous sentinel elements from ctl_table array timekeeping: Remove the now superfluous sentinel elements from ctl_table array ftrace: Remove the now superfluous sentinel elements from ctl_table array umh: Remove the now superfluous sentinel elements from ctl_table array kernel misc: Remove the now superfluous sentinel elements from ctl_table array
2024-05-17sched/core: Fix incorrect initialization of the 'burst' parameter in ↵Cheng Yu
cpu_max_write() In the cgroup v2 CPU subsystem, assuming we have a cgroup named 'test', and we set cpu.max and cpu.max.burst: # echo 1000000 > /sys/fs/cgroup/test/cpu.max # echo 1000000 > /sys/fs/cgroup/test/cpu.max.burst then we check cpu.max and cpu.max.burst: # cat /sys/fs/cgroup/test/cpu.max 1000000 100000 # cat /sys/fs/cgroup/test/cpu.max.burst 1000000 Next we set cpu.max again and check cpu.max and cpu.max.burst: # echo 2000000 > /sys/fs/cgroup/test/cpu.max # cat /sys/fs/cgroup/test/cpu.max 2000000 100000 # cat /sys/fs/cgroup/test/cpu.max.burst 1000 ... we find that the cpu.max.burst value changed unexpectedly. In cpu_max_write(), the unit of the burst value returned by tg_get_cfs_burst() is microseconds, while in cpu_max_write(), the burst unit used for calculation should be nanoseconds, which leads to the bug. To fix it, get the burst value directly from tg->cfs_bandwidth.burst. Fixes: f4183717b370 ("sched/fair: Introduce the burstable CFS controller") Reported-by: Qixin Liao <liaoqixin@huawei.com> Signed-off-by: Cheng Yu <serein.chengyu@huawei.com> Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240424132438.514720-1-serein.chengyu@huawei.com
2024-05-17sched/fair: Remove stale FREQUENCY_UTIL commentChristian Loehle
On 05/03/2024 15:05, Vincent Guittot wrote: I'm fine with either and that was my first thought here, too, but it did seem like the comment was mostly placed there to justify the 'unexpected' high utilization when explicitly passing FREQUENCY_UTIL and the need to clamp it then. So removing did feel slightly more natural to me anyway. So alternatively: From: Christian Loehle <christian.loehle@arm.com> Date: Tue, 5 Mar 2024 09:34:41 +0000 Subject: [PATCH] sched/fair: Remove stale FREQUENCY_UTIL mention effective_cpu_util() flags were removed, so remove mentioning of the flag. commit 9c0b4bb7f6303 ("sched/cpufreq: Rework schedutil governor performance estimation") reworked effective_cpu_util() removing enum cpu_util_type. Modify the comment accordingly. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/0e2833ee-0939-44e0-82a2-520a585a0153@arm.com
2024-05-17sched/fair: Fix initial util_avg calculationDawei Li
Change se->load.weight to se_weight(se) in the calculation for the initial util_avg to avoid unnecessarily inflating the util_avg by 1024 times. The reason is that se->load.weight has the unit/scale as the scaled-up load, while cfs_rg->avg.load_avg has the unit/scale as the true task weight (as mapped directly from the task's nice/priority value). With CONFIG_32BIT, the scaled-up load is equal to the true task weight. With CONFIG_64BIT, the scaled-up load is 1024 times the true task weight. Thus, the current code may inflate the util_avg by 1024 times. The follow-up capping will not allow the util_avg value to go wild. But the calculation should have the correct logic. Signed-off-by: Dawei Li <daweilics@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Vishal Chourasia <vishalc@linux.ibm.com> Link: https://lore.kernel.org/r/20240315015916.21545-1-daweilics@gmail.com
2024-05-17sched/debug: Dump domains' levelVitalii Bursov
Knowing domain's level exactly can be useful when setting relax_domain_level or cpuset.sched_relax_domain_level Usage: cat /debug/sched/domains/cpu0/domain1/level to dump cpu0 domain1's level. SDM macro is not used because sd->level is 'int' and it would hide the type mismatch between 'int' and 'u32'. Signed-off-by: Vitalii Bursov <vitaly@bursov.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/9489b6475f6dd6fbc67c617752d4216fa094da53.1714488502.git.vitaly@bursov.com
2024-05-17sched/fair: Allow disabling sched_balance_newidle with sched_relax_domain_levelVitalii Bursov
Change relax_domain_level checks so that it would be possible to include or exclude all domains from newidle balancing. This matches the behavior described in the documentation: -1 no request. use system default or follow request of others. 0 no search. 1 search siblings (hyperthreads in a core). "2" enables levels 0 and 1, level_max excludes the last (level_max) level, and level_max+1 includes all levels. Fixes: 1d3504fcf560 ("sched, cpuset: customize sched domains, core") Signed-off-by: Vitalii Bursov <vitaly@bursov.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/bd6de28e80073c79466ec6401cdeae78f0d4423d.1714488502.git.vitaly@bursov.com
2024-05-13Merge tag 'sched-core-2024-05-13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Add cpufreq pressure feedback for the scheduler - Rework misfit load-balancing wrt affinity restrictions - Clean up and simplify the code around ::overutilized and ::overload access. - Simplify sched_balance_newidle() - Bump SCHEDSTAT_VERSION to 16 due to a cleanup of CPU_MAX_IDLE_TYPES handling that changed the output. - Rework & clean up <asm/vtime.h> interactions wrt arch_vtime_task_switch() - Reorganize, clean up and unify most of the higher level scheduler balancing function names around the sched_balance_*() prefix - Simplify the balancing flag code (sched_balance_running) - Miscellaneous cleanups & fixes * tag 'sched-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits) sched/pelt: Remove shift of thermal clock sched/cpufreq: Rename arch_update_thermal_pressure() => arch_update_hw_pressure() thermal/cpufreq: Remove arch_update_thermal_pressure() sched/cpufreq: Take cpufreq feedback into account cpufreq: Add a cpufreq pressure feedback for the scheduler sched/fair: Fix update of rd->sg_overutilized sched/vtime: Do not include <asm/vtime.h> header s390/irq,nmi: Include <asm/vtime.h> header directly s390/vtime: Remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover sched/vtime: Get rid of generic vtime_task_switch() implementation sched/vtime: Remove confusing arch_vtime_task_switch() declaration sched/balancing: Simplify the sg_status bitmask and use separate ->overloaded and ->overutilized flags sched/fair: Rename set_rd_overutilized_status() to set_rd_overutilized() sched/fair: Rename SG_OVERLOAD to SG_OVERLOADED sched/fair: Rename {set|get}_rd_overload() to {set|get}_rd_overloaded() sched/fair: Rename root_domain::overload to ::overloaded sched/fair: Use helper functions to access root_domain::overload sched/fair: Check root_domain::overload value before update sched/fair: Combine EAS check with root_domain::overutilized access sched/fair: Simplify the continue_balancing logic in sched_balance_newidle() ...
2024-05-09sched/topology: Optimize topology_span_sane()Kyle Meyer
Optimize topology_span_sane() by removing duplicate comparisons. Since topology_span_sane() is called inside of for_each_cpu(), each previous CPU has already been compared against every other CPU. The current CPU only needs to be compared against higher-numbered CPUs. The total number of comparisons is reduced from N * (N - 1) to N * (N - 1) / 2 on each non-NUMA scheduling domain level. Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com> Reviewed-by: Yury Norov <yury.norov@gmail.com> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Yury Norov <yury.norov@gmail.com>
2024-04-28sched/isolation: Fix boot crash when maxcpus < first housekeeping CPUOleg Nesterov
housekeeping_setup() checks cpumask_intersects(present, online) to ensure that the kernel will have at least one housekeeping CPU after smp_init(), but this doesn't work if the maxcpus= kernel parameter limits the number of processors available after bootup. For example, a kernel with "maxcpus=2 nohz_full=0-2" parameters crashes at boot time on a virtual machine with 4 CPUs. Change housekeeping_setup() to use cpumask_first_and() and check that the returned CPU number is valid and less than setup_max_cpus. Another corner case is "nohz_full=0" on a machine with a single CPU or with the maxcpus=1 kernel argument. In this case non_housekeeping_mask is empty and tick_nohz_full_setup() makes no sense. And indeed, the kernel hits the WARN_ON(tick_nohz_full_running) in tick_sched_do_timer(). And how should the kernel interpret the "nohz_full=" parameter? It should be silently ignored, but currently cpulist_parse() happily returns the empty cpumask and this leads to the same problem. Change housekeeping_setup() to check cpumask_empty(non_housekeeping_mask) and do nothing in this case. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Phil Auld <pauld@redhat.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240413141746.GA10008@redhat.com
2024-04-28sched/isolation: Prevent boot crash when the boot CPU is nohz_fullOleg Nesterov
Documentation/timers/no_hz.rst states that the "nohz_full=" mask must not include the boot CPU, which is no longer true after: 08ae95f4fd3b ("nohz_full: Allow the boot CPU to be nohz_full"). However after: aae17ebb53cd ("workqueue: Avoid using isolated cpus' timers on queue_delayed_work") the kernel will crash at boot time in this case; housekeeping_any_cpu() returns an invalid CPU number until smp_init() brings the first housekeeping CPU up. Change housekeeping_any_cpu() to check the result of cpumask_any_and() and return smp_processor_id() in this case. This is just the simple and backportable workaround which fixes the symptom, but smp_processor_id() at boot time should be safe at least for type == HK_TYPE_TIMER, this more or less matches the tick_do_timer_boot_cpu logic. There is no worry about cpu_down(); tick_nohz_cpu_down() will not allow to offline tick_do_timer_cpu (the 1st online housekeeping CPU). Fixes: aae17ebb53cd ("workqueue: Avoid using isolated cpus' timers on queue_delayed_work") Reported-by: Chris von Recklinghausen <crecklin@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Phil Auld <pauld@redhat.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240411143905.GA19288@redhat.com Closes: https://lore.kernel.org/all/20240402105847.GA24832@redhat.com/
2024-04-24sched/pelt: Remove shift of thermal clockVincent Guittot
The optional shift of the clock used by thermal/hw load avg has been introduced to handle case where the signal was not always a high frequency hw signal. Now that cpufreq provides a signal for firmware and SW pressure, we can remove this exception and always keep this PELT signal aligned with other signals. Mark sysctl_sched_migration_cost boot parameter as deprecated Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lore.kernel.org/r/20240326091616.3696851-6-vincent.guittot@linaro.org
2024-04-24sched/cpufreq: Rename arch_update_thermal_pressure() => ↵Vincent Guittot
arch_update_hw_pressure() Now that cpufreq provides a pressure value to the scheduler, rename arch_update_thermal_pressure into HW pressure to reflect that it returns a pressure applied by HW (i.e. with a high frequency change) and not always related to thermal mitigation but also generated by max current limitation as an example. Such high frequency signal needs filtering to be smoothed and provide an value that reflects the average available capacity into the scheduler time scale. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lore.kernel.org/r/20240326091616.3696851-5-vincent.guittot@linaro.org
2024-04-24sched/cpufreq: Take cpufreq feedback into accountVincent Guittot
Aggregate the different pressures applied on the capacity of CPUs and create a new function that returns the actual capacity of the CPU: get_actual_cpu_capacity(). Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Qais Yousef <qyousef@layalina.io> Link: https://lore.kernel.org/r/20240326091616.3696851-3-vincent.guittot@linaro.org
2024-04-24sched/fair: Fix update of rd->sg_overutilizedVincent Guittot
sg_overloaded is used instead of sg_overutilized to update rd->sg_overutilized. Fixes: 4475cd8bfd9b ("sched/balancing: Simplify the sg_status bitmask and use separate ->overloaded and ->overutilized flags") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240404155738.2866102-1-vincent.guittot@linaro.org
2024-04-24scheduler: Remove the now superfluous sentinel elements from ctl_table arrayJoel Granados
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) rm sentinel element from ctl_table arrays Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Joel Granados <j.granados@samsung.com>
2024-04-22sched/eevdf: Prevent vlag from going out of bounds in reweight_eevdf()Xuewen Yan
It was possible to have pick_eevdf() return NULL, which then causes a NULL-deref. This turned out to be due to entity_eligible() returning falsely negative because of a s64 multiplcation overflow. Specifically, reweight_eevdf() computes the vlag without considering the limit placed upon vlag as update_entity_lag() does, and then the scaling multiplication (remember that weight is 20bit fixed point) can overflow. This then leads to the new vruntime being weird which then causes the above entity_eligible() to go side-ways and claim nothing is eligible. Thus limit the range of vlag accordingly. All this was quite rare, but fatal when it does happen. Closes: https://lore.kernel.org/all/ZhuYyrh3mweP_Kd8@nz.home/ Closes: https://lore.kernel.org/all/CA+9S74ih+45M_2TPUY_mPPVDhNvyYfy1J1ftSix+KjiTVxg8nw@mail.gmail.com/ Closes: https://lore.kernel.org/lkml/202401301012.2ed95df0-oliver.sang@intel.com/ Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight") Reported-by: Sergei Trofimovich <slyich@gmail.com> Reported-by: Igor Raits <igor@gooddata.com> Reported-by: Breno Leitao <leitao@debian.org> Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: Yujie Liu <yujie.liu@intel.com> Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Reviewed-and-tested-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240422082238.5784-1-xuewen.yan@unisoc.com
2024-04-22sched/eevdf: Fix miscalculation in reweight_entity() when se is not currTianchen Ding
reweight_eevdf() only keeps V unchanged inside itself. When se != cfs_rq->curr, it would be dequeued from rb tree first. So that V is changed and the result is wrong. Pass the original V to reweight_eevdf() to fix this issue. Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight") Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> [peterz: flip if() condition for clarity] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Abel Wu <wuyun.abel@bytedance.com> Link: https://lkml.kernel.org/r/20240306022133.81008-3-dtcccc@linux.alibaba.com
2024-04-22sched/eevdf: Always update V if se->on_rq when reweightingTianchen Ding
reweight_eevdf() needs the latest V to do accurate calculation for new ve and vd. So update V unconditionally when se is runnable. Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight") Suggested-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Abel Wu <wuyun.abel@bytedance.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Chen Yu <yu.c.chen@intel.com> Link: https://lore.kernel.org/r/20240306022133.81008-2-dtcccc@linux.alibaba.com
2024-04-17sched/vtime: Get rid of generic vtime_task_switch() implementationAlexander Gordeev
The generic vtime_task_switch() implementation gets built only if __ARCH_HAS_VTIME_TASK_SWITCH is not defined, but requires an architecture to implement arch_vtime_task_switch() callback at the same time, which is confusing. Further, arch_vtime_task_switch() is implemented for 32-bit PowerPC architecture only and vtime_task_switch() generic variant is rather superfluous. Simplify the whole vtime_task_switch() wiring by moving the existing generic implementation to PowerPC. Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/2cb6e3caada93623f6d4f78ad938ac6cd0e2fda8.1712760275.git.agordeev@linux.ibm.com
2024-04-16sched: Add missing memory barrier in switch_mm_cidMathieu Desnoyers
Many architectures' switch_mm() (e.g. arm64) do not have an smp_mb() which the core scheduler code has depended upon since commit: commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid") If switch_mm() doesn't call smp_mb(), sched_mm_cid_remote_clear() can unset the actively used cid when it fails to observe active task after it sets lazy_put. There *is* a memory barrier between storing to rq->curr and _return to userspace_ (as required by membarrier), but the rseq mm_cid has stricter requirements: the barrier needs to be issued between store to rq->curr and switch_mm_cid(), which happens earlier than: - spin_unlock(), - switch_to(). So it's fine when the architecture switch_mm() happens to have that barrier already, but less so when the architecture only provides the full barrier in switch_to() or spin_unlock(). It is a bug in the rseq switch_mm_cid() implementation. All architectures that don't have memory barriers in switch_mm(), but rather have the full barrier either in finish_lock_switch() or switch_to() have them too late for the needs of switch_mm_cid(). Introduce a new smp_mb__after_switch_mm(), defined as smp_mb() in the generic barrier.h header, and use it in switch_mm_cid() for scheduler transitions where switch_mm() is expected to provide a memory barrier. Architectures can override smp_mb__after_switch_mm() if their switch_mm() implementation provides an implicit memory barrier. Override it with a no-op on x86 which implicitly provide this memory barrier by writing to CR3. Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid") Reported-by: levi.yun <yeoreum.yun@arm.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> # for arm64 Acked-by: Dave Hansen <dave.hansen@linux.intel.com> # for x86 Cc: <stable@vger.kernel.org> # 6.4.x Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240415152114.59122-2-mathieu.desnoyers@efficios.com
2024-03-29sched/balancing: Simplify the sg_status bitmask and use separate ↵Ingo Molnar
->overloaded and ->overutilized flags SG_OVERLOADED and SG_OVERUTILIZED flags plus the sg_status bitmask are an unnecessary complication that only make the code harder to read and slower. We only ever set them separately: thule:~/tip> git grep SG_OVER kernel/sched/ kernel/sched/fair.c: set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED); kernel/sched/fair.c: *sg_status |= SG_OVERLOADED; kernel/sched/fair.c: *sg_status |= SG_OVERUTILIZED; kernel/sched/fair.c: *sg_status |= SG_OVERLOADED; kernel/sched/fair.c: set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED); kernel/sched/fair.c: sg_status & SG_OVERUTILIZED); kernel/sched/fair.c: } else if (sg_status & SG_OVERUTILIZED) { kernel/sched/fair.c: set_rd_overutilized_status(env->dst_rq->rd, SG_OVERUTILIZED); kernel/sched/sched.h:#define SG_OVERLOADED 0x1 /* More than one runnable task on a CPU. */ kernel/sched/sched.h:#define SG_OVERUTILIZED 0x2 /* One or more CPUs are over-utilized. */ kernel/sched/sched.h: set_rd_overloaded(rq->rd, SG_OVERLOADED); And use them separately, which results in suboptimal code: /* update overload indicator if we are at root domain */ set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED); /* Update over-utilization (tipping point, U >= 0) indicator */ set_rd_overutilized_status(env->dst_rq->rd, Introduce separate sg_overloaded and sg_overutilized flags in update_sd_lb_stats() and its lower level functions, and change all of them to 'bool'. Remove the now unused SG_OVERLOADED and SG_OVERUTILIZED flags. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Shrikanth Hegde <sshegde@linux.ibm.com> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Qais Yousef <qyousef@layalina.io> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/ZgVPhODZ8/nbsqbP@gmail.com
2024-03-28sched/fair: Rename set_rd_overutilized_status() to set_rd_overutilized()Ingo Molnar
The _status() postfix has no real meaning, simplify the naming and harmonize it with set_rd_overloaded(). Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Qais Yousef <qyousef@layalina.io> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/ZgVHq65XKsOZpfgK@gmail.com
2024-03-28sched/fair: Rename SG_OVERLOAD to SG_OVERLOADEDIngo Molnar
Follow the rename of the root_domain::overloaded flag. Note that this also matches the SG_OVERUTILIZED flag better. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Qais Yousef <qyousef@layalina.io> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/ZgVHq65XKsOZpfgK@gmail.com
2024-03-28sched/fair: Rename {set|get}_rd_overload() to {set|get}_rd_overloaded()Ingo Molnar
Follow the rename of the root_domain::overloaded flag. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Qais Yousef <qyousef@layalina.io> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/ZgVHq65XKsOZpfgK@gmail.com
2024-03-28sched/fair: Rename root_domain::overload to ::overloadedIngo Molnar
It is silly to use an ambiguous noun instead of a clear adjective when naming such a flag ... Note how root_domain::overutilized already used a proper adjective. rd->overloaded is now set to 1 when the root domain is overloaded. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Qais Yousef <qyousef@layalina.io> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/ZgVHq65XKsOZpfgK@gmail.com
2024-03-28sched/fair: Use helper functions to access root_domain::overloadShrikanth Hegde
Introduce two helper functions to access & set the root_domain::overload flag: get_rd_overload() set_rd_overload() To make sure code is always following READ_ONCE()/WRITE_ONCE() access methods. No change in functionality intended. [ mingo: Renamed the accessors to get_/set_rd_overload(), tidied up the changelog. ] Suggested-by: Qais Yousef <qyousef@layalina.io> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240325054505.201995-3-sshegde@linux.ibm.com
2024-03-28sched/fair: Check root_domain::overload value before updateShrikanth Hegde
The root_domain::overload flag is 1 when there's any rq in the root domain that has 2 or more running tasks. (Ie. it's overloaded.) The root_domain structure itself is a global structure per cpuset island. The ::overload flag is maintained the following way: - Set when adding a second task to the runqueue. - It is cleared in update_sd_lb_stats() during load balance, if none of the rqs have 2 or more running tasks. This flag is used during newidle balance to see if its worth doing a full load balance pass, which can be an expensive operation. If it is set, then newidle balance will try to aggressively pull a task. Since commit: 630246a06ae2 ("sched/fair: Clean-up update_sg_lb_stats parameters") ::overload is being written unconditionally, even if it has the same value. The change in value of this depends on the workload, but on typical workloads, it doesn't change all that often: a system is either dominantly overloaded for substantial amounts of time, or not. Extra writes to this semi-global structure cause unnecessary overhead, extra bus traffic, etc. - so avoid it as much as possible. Perf probe stats show that it's worth making this change (numbers are with patch applied): 1M probe:sched_balance_newidle_L38 139 probe:update_sd_lb_stats_L53 <====== 1->0 writes 129K probe:add_nr_running_L12 74 probe:add_nr_running_L13 <====== 0->1 writes 54K probe:update_sd_lb_stats_L50 <====== reads These numbers prove that actual change in the ::overload value is (much) less frequent: L50 is much larger at ~54,000 accesses vs L53+L13 of 139+74. [ mingo: Rewrote the changelog. ] Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20240325054505.201995-2-sshegde@linux.ibm.com
2024-03-28sched/fair: Combine EAS check with root_domain::overutilized accessShrikanth Hegde
Access to root_domainoverutilized is always used with sched_energy_enabled in the pattern: if (sched_energy_enabled && !overutilized) do something So modify the helper function to utilize this pattern. This is more readable code as it would say, do something when root domain is not overutilized. This function always return true when EAS is disabled. No change in functionality intended. Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240326152616.380999-1-sshegde@linux.ibm.com
2024-03-26sched/fair: Simplify the continue_balancing logic in sched_balance_newidle()Shrikanth Hegde
newidle(CPU_NEWLY_IDLE) balancing doesn't stop the load-balancing if the continue_balancing flag is reset, but the other two balancing (IDLE, BUSY) cases do that. newidle balance stops the load balancing if rq has a task or there is wakeup pending. The same checks are present in should_we_balance for newidle. Hence use the return value and simplify continue_balancing mechanism for newidle. Update the comment surrounding it as well. No change in functionality intended. Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20240325153926.274284-1-sshegde@linux.ibm.com
2024-03-26sched/fair: Introduce is_rd_overutilized() helper function to access ↵Shrikanth Hegde
root_domain::overutilized The root_domain::overutilized field is READ_ONCE() accessed in multiple places, which could be simplified with a helper function. This might also make it more apparent that it needs to be used only in case of EAS. No change in functionality intended. Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240307085725.444486-3-sshegde@linux.ibm.com
2024-03-26sched/fair: Add EAS checks before updating root_domain::overutilizedShrikanth Hegde
root_domain::overutilized is only used for EAS(energy aware scheduler) to decide whether to do load balance or not. It is not used if EAS not possible. Currently enqueue_task_fair and task_tick_fair accesses, sometime updates this field. In update_sd_lb_stats it is updated often. This causes cache contention due to true sharing and burns a lot of cycles. ::overload and ::overutilized are part of the same cacheline. Updating it often invalidates the cacheline. That causes access to ::overload to slow down due to false sharing. Hence add EAS check before accessing/updating this field. EAS check is optimized at compile time or it is a static branch. Hence it shouldn't cost much. With the patch, both enqueue_task_fair and newidle_balance don't show up as hot routines in perf profile. 6.8-rc4: 7.18% swapper [kernel.vmlinux] [k] enqueue_task_fair 6.78% s [kernel.vmlinux] [k] newidle_balance +patch: 0.14% swapper [kernel.vmlinux] [k] enqueue_task_fair 0.00% swapper [kernel.vmlinux] [k] newidle_balance While at it: trace_sched_overutilized_tp expect that second argument to be bool. So do a int to bool conversion for that. Fixes: 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point indicator") Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Srikar Dronamraju <srikar@linux.ibm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240307085725.444486-2-sshegde@linux.ibm.com