summaryrefslogtreecommitdiff
path: root/kernel/sched/core.c
AgeCommit message (Collapse)Author
2024-09-03sched: Rework dl_serverPeter Zijlstra
When a task is selected through a dl_server, it will have p->dl_server set, such that it can account runtime to the dl_server, see update_curr_task(). Currently p->dl_server is set in pick*task() whenever it goes through the dl_server, clearing it is a bit of a mess though. The trivial solution is clearing it on the final put (now that we have this location). However, this gives a problem when: p = pick_task(rq); if (p) put_prev_set_next_task(rq, prev, next); picks the same task but through a different path, notably when it goes from picking through the dl_server to a direct pick or vice-versa. In that case we cannot readily determine wether we should clear or preserve p->dl_server. An additional complication is pick_*task() setting p->dl_server for a remote pick, it might still need to update runtime before it schedules the core_pick. Close all these holes and remove all the random clearing of p->dl_server by: - having pick_*task() manage rq->dl_server - having the final put_prev_task() clear p->dl_server - having the first set_next_task() set p->dl_server = rq->dl_server - complicate the core_sched code to save/restore rq->dl_server where appropriate. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224016.259853414@infradead.org
2024-09-03sched: Combine the last put_prev_task() and the first set_next_task()Peter Zijlstra
Ensure the last put_prev_task() and the first set_next_task() always go together. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224016.158454756@infradead.org
2024-09-03sched: Rework pick_next_task()Peter Zijlstra
The current rule is that: pick_next_task() := pick_task() + set_next_task(.first = true) And many classes implement it directly as such. Change things around to make pick_next_task() optional while also changing the definition to: pick_next_task(prev) := pick_task() + put_prev_task() + set_next_task(.first = true) The reason is that sched_ext would like to have a 'final' call that knows the next task. By placing put_prev_task() right next to set_next_task() (as it already is for sched_core) this becomes trivial. As a bonus, this is a nice cleanup on its own. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224016.051225657@infradead.org
2024-09-03sched: Split up put_prev_task_balance()Peter Zijlstra
With the goal of pushing put_prev_task() after pick_task() / into pick_next_task(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224015.943143811@infradead.org
2024-09-03sched: Use set_next_task(.first) where requiredPeter Zijlstra
Turns out the core_sched bits forgot to use the set_next_task(.first=true) variant. Notably: pick_next_task() := pick_task() + set_next_task(.first = true) Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224015.614146342@infradead.org
2024-09-01task_stack: uninline stack_not_usedPasha Tatashin
Given that stack_not_used() is not performance critical function uninline it. Link: https://lkml.kernel.org/r/20240730150158.832783-4-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20240724203322.2765486-4-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Li Zhijian <lizhijian@fujitsu.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-20Merge branch 'tip/sched/core' into for-6.12Tejun Heo
To receive 863ccdbb918a ("sched: Allow sched_class::dequeue_task() to fail") which makes sched_class.dequeue_task() return bool instead of void. This leads to compile breakage and will be fixed by a follow-up patch. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-17sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestionPeter Zijlstra
Allow applications to directly set a suggested request/slice length using sched_attr::sched_runtime. The implementation clamps the value to: 0.1[ms] <= slice <= 100[ms] which is 1/10 the size of HZ=1000 and 10 times the size of HZ=100. Applications should strive to use their periodic runtime at a high confidence interval (95%+) as the target slice. Using a smaller slice will introduce undue preemptions, while using a larger value will increase latency. For all the following examples assume a scheduling quantum of 8, and for consistency all examples have W=4: {A,B,C,D}(w=1,r=8): ABCD... +---+---+---+--- t=0, V=1.5 t=1, V=3.5 A |------< A |------< B |------< B |------< C |------< C |------< D |------< D |------< ---+*------+-------+--- ---+--*----+-------+--- t=2, V=5.5 t=3, V=7.5 A |------< A |------< B |------< B |------< C |------< C |------< D |------< D |------< ---+----*--+-------+--- ---+------*+-------+--- Note: 4 identical tasks in FIFO order ~~~ {A,B}(w=1,r=16) C(w=2,r=16) AACCBBCC... +---+---+---+--- t=0, V=1.25 t=2, V=5.25 A |--------------< A |--------------< B |--------------< B |--------------< C |------< C |------< ---+*------+-------+--- ---+----*--+-------+--- t=4, V=8.25 t=6, V=12.25 A |--------------< A |--------------< B |--------------< B |--------------< C |------< C |------< ---+-------*-------+--- ---+-------+---*---+--- Note: 1 heavy task -- because q=8, double r such that the deadline of the w=2 task doesn't go below q. Note: observe the full schedule becomes: W*max(r_i/w_i) = 4*2q = 8q in length. Note: the period of the heavy task is half the full period at: W*(r_i/w_i) = 4*(2q/2) = 4q ~~~ {A,C,D}(w=1,r=16) B(w=1,r=8): BAACCBDD... +---+---+---+--- t=0, V=1.5 t=1, V=3.5 A |--------------< A |---------------< B |------< B |------< C |--------------< C |--------------< D |--------------< D |--------------< ---+*------+-------+--- ---+--*----+-------+--- t=3, V=7.5 t=5, V=11.5 A |---------------< A |---------------< B |------< B |------< C |--------------< C |--------------< D |--------------< D |--------------< ---+------*+-------+--- ---+-------+--*----+--- t=6, V=13.5 A |---------------< B |------< C |--------------< D |--------------< ---+-------+----*--+--- Note: 1 short task -- again double r so that the deadline of the short task won't be below q. Made B short because its not the leftmost task, but is eligible with the 0,1,2,3 spread. Note: like with the heavy task, the period of the short task observes: W*(r_i/w_i) = 4*(1q/1) = 4q ~~~ A(w=1,r=16) B(w=1,r=8) C(w=2,r=16) BCCAABCC... +---+---+---+--- t=0, V=1.25 t=1, V=3.25 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+*------+-------+--- ---+--*----+-------+--- t=3, V=7.25 t=5, V=11.25 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+------*+-------+--- ---+-------+--*----+--- t=6, V=13.25 A |--------------< B |------< C |------< ---+-------+----*--+--- Note: 1 heavy and 1 short task -- combine them all. Note: both the short and heavy task end up with a period of 4q ~~~ A(w=1,r=16) B(w=2,r=16) C(w=1,r=8) BBCAABBC... +---+---+---+--- t=0, V=1 t=2, V=5 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+*------+-------+--- ---+----*--+-------+--- t=3, V=7 t=5, V=11 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+------*+-------+--- ---+-------+--*----+--- t=7, V=15 A |--------------< B |------< C |------< ---+-------+------*+--- Note: as before but permuted ~~~ From all this it can be deduced that, for the steady state: - the total period (P) of a schedule is: W*max(r_i/w_i) - the average period of a task is: W*(r_i/w_i) - each task obtains the fair share: w_i/W of each full period P Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.842834421@infradead.org
2024-08-17sched: Teach dequeue_task() about special task statesPeter Zijlstra
Since special task states must not suffer spurious wakeups, and the proposed delayed dequeue can cause exactly these (under some boundary conditions), propagate this knowledge into dequeue_task() such that it can do the right thing. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.110439521@infradead.org
2024-08-17sched/uclamg: Handle delayed dequeuePeter Zijlstra
Delayed dequeue has tasks sit around on the runqueue that are not actually runnable -- specifically, they will be dequeued the moment they get picked. One side-effect is that such a task can get migrated, which leads to a 'nested' dequeue_task() scenario that messes up uclamp if we don't take care. Notably, dequeue_task(DEQUEUE_SLEEP) can 'fail' and keep the task on the runqueue. This however will have removed the task from uclamp -- per uclamp_rq_dec() in dequeue_task(). So far so good. However, if at that point the task gets migrated -- or nice adjusted or any of a myriad of operations that does a dequeue-enqueue cycle -- we'll pass through dequeue_task()/enqueue_task() again. Without modification this will lead to a double decrement for uclamp, which is wrong. Reported-by: Luis Machado <luis.machado@arm.com> Reported-by: Hongyan Xia <hongyan.xia2@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105029.315205425@infradead.org
2024-08-17sched: Prepare generic code for delayed dequeuePeter Zijlstra
While most of the delayed dequeue code can be done inside the sched_class itself, there is one location where we do not have an appropriate hook, namely ttwu_runnable(). Add an ENQUEUE_DELAYED call to the on_rq path to deal with waking delayed dequeue tasks. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105029.200000445@infradead.org
2024-08-17sched: Split DEQUEUE_SLEEP from deactivate_task()Peter Zijlstra
As a preparation for dequeue_task() failing, and a second code-path needing to take care of the 'success' path, split out the DEQEUE_SLEEP path from deactivate_task(). Much thanks to Libo for spotting and fixing a TASK_ON_RQ_MIGRATING ordering fail. Fixed-by: Libo Chen <libo.chen@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105029.086192709@infradead.org
2024-08-17sched: Allow sched_class::dequeue_task() to failPeter Zijlstra
Change the function signature of sched_class::dequeue_task() to return a boolean, allowing future patches to 'fail' dequeue. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105028.864630153@infradead.org
2024-08-15rcu: Let dump_cpu_task() be used without preemption disabledRyo Takakura
The commit 2d7f00b2f0130 ("rcu: Suppress smp_processor_id() complaint in synchronize_rcu_expedited_wait()") disabled preemption around dump_cpu_task() to suppress warning on its usage within preemtible context. Calling dump_cpu_task() doesn't required to be in non-preemptible context except for suppressing the smp_processor_id() warning. As the smp_processor_id() is evaluated along with in_hardirq() to check if it's in interrupt context, this patch removes the need for its preemtion disablement by reordering the condition so that smp_processor_id() only gets evaluated when it's in interrupt context. Signed-off-by: Ryo Takakura <takakura@valinux.co.jp> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-07sched/rt: Rename realtime_{prio, task}() to rt_or_dl_{prio, task}()Qais Yousef
Some find the name realtime overloaded. Use rt_or_dl() as an alternative, hopefully better, name. Suggested-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Qais Yousef <qyousef@layalina.io> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240610192018.1567075-4-qyousef@layalina.io
2024-08-07sched/rt: Clean up usage of rt_task()Qais Yousef
rt_task() checks if a task has RT priority. But depends on your dictionary, this could mean it belongs to RT class, or is a 'realtime' task, which includes RT and DL classes. Since this has caused some confusion already on discussion [1], it seemed a clean up is due. I define the usage of rt_task() to be tasks that belong to RT class. Make sure that it returns true only for RT class and audit the users and replace the ones required the old behavior with the new realtime_task() which returns true for RT and DL classes. Introduce similar realtime_prio() to create similar distinction to rt_prio() and update the users that required the old behavior to use the new function. Move MAX_DL_PRIO to prio.h so it can be used in the new definitions. Document the functions to make it more obvious what is the difference between them. PI-boosted tasks is a factor that must be taken into account when choosing which function to use. Rename task_is_realtime() to realtime_task_policy() as the old name is confusing against the new realtime_task(). No functional changes were intended. [1] https://lore.kernel.org/lkml/20240506100509.GL40213@noisy.programming.kicks-ass.net/ Signed-off-by: Qais Yousef <qyousef@layalina.io> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: "Steven Rostedt (Google)" <rostedt@goodmis.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20240610192018.1567075-2-qyousef@layalina.io
2024-08-06sched_ext: Make task_can_run_on_remote_rq() use common task_allowed_on_cpu()Tejun Heo
task_can_run_on_remote_rq() is similar to is_cpu_allowed() but there are subtle differences. It currently open codes all the tests. This is cumbersome to understand and error-prone in case the intersecting tests need to be updated. Factor out the common part - testing whether the task is allowed on the CPU at all regardless of the CPU state - into task_allowed_on_cpu() and make both is_cpu_allowed() and SCX's task_can_run_on_remote_rq() use it. As the code is now linked between the two and each contains only the extra tests that differ between them, it's less error-prone when the conditions need to be updated. Also, improve the comment to explain why they are different. v2: Replace accidental "extern inline" with "static inline" (Peter). Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: David Vernet <void@manifault.com>
2024-08-06sched_ext: Simplify UP support by enabling sched_class->balance() in UPTejun Heo
On SMP, SCX performs dispatch from sched_class->balance(). As balance() was not available in UP, it instead called the internal balance function from put_prev_task_scx() and pick_next_task_scx() to emulate the effect, which is rather nasty. Enabling sched_class->balance() on UP shouldn't cause any meaningful overhead. Enable balance() on UP and drop the ugly workaround. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: David Vernet <void@manifault.com>
2024-08-06sched_ext: Add scx_enabled() test to @start_class promotion in ↵Tejun Heo
put_prev_task_balance() SCX needs its balance() invoked even when waking up from a lower priority sched class (idle) and put_prev_task_balance() thus has the logic to promote @start_class if it's lower than ext_sched_class. This is only needed when SCX is enabled. Add scx_enabled() test to avoid unnecessary overhead when SCX is disabled. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: David Vernet <void@manifault.com>
2024-08-06sched_ext: Simplify scx_can_stop_tick() invocation in sched_can_stop_tick()Tejun Heo
The way sched_can_stop_tick() used scx_can_stop_tick() was rather confusing and the behavior wasn't ideal when SCX is enabled in partial mode. Simplify it so that: - scx_can_stop_tick() can say no if scx_enabled(). - CFS tests rq->cfs.nr_running > 1 instead of rq->nr_running. This is easier to follow and leads to the correct answer whether SCX is disabled, enabled in partial mode or all tasks are switched to SCX. Peter, note that this is a bit different from your suggestion where sched_can_stop_tick() unconditionally returns scx_can_stop_tick() iff scx_switched_all(). The problem is that in partial mode, tick can be stopped when there is only one SCX task even if the BPF scheduler didn't ask and isn't ready for it. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: David Vernet <void@manifault.com>
2024-08-04Merge branch 'sched/core' of ↵Tejun Heo
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-6.12 Pull tip/sched/core to resolve the following four conflicts. While 2-4 are simple context conflicts, 1 is a bit subtle and easy to resolve incorrectly. 1. 2c8d046d5d51 ("sched: Add normal_policy()") vs. faa42d29419d ("sched/fair: Make SCHED_IDLE entity be preempted in strict hierarchy") The former converts direct test on p->policy to use the helper normal_policy(). The latter moves the p->policy test to a different location. Resolve by converting the test on p->plicy in the new location to use normal_policy(). 2. a7a9fc549293 ("sched_ext: Add boilerplate for extensible scheduler class") vs. a110a81c52a9 ("sched/deadline: Deferrable dl server") Both add calls to put_prev_task_idle() and set_next_task_idle(). Simple context conflict. Resolve by taking changes from both. 3. a7a9fc549293 ("sched_ext: Add boilerplate for extensible scheduler class") vs. c245910049d0 ("sched/core: Add clearing of ->dl_server in put_prev_task_balance()") The former changes for_each_class() itertion to use for_each_active_class(). The latter moves away the adjacent dl_server handling code. Simple context conflict. Resolve by taking changes from both. 4. 60c27fb59f6c ("sched_ext: Implement sched_ext_ops.cpu_online/offline()") vs. 31b164e2e4af ("sched/smt: Introduce sched_smt_present_inc/dec() helper") 2f027354122f ("sched/core: Introduce sched_set_rq_on/offline() helper") The former adds scx_rq_deactivate() call. The latter two change code around it. Simple context conflict. Resolve by taking changes from both. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-30Merge tag 'v6.11-rc1' into for-6.12Tejun Heo
Linux 6.11-rc1
2024-07-29sched/rt: Remove default bandwidth controlPeter Zijlstra
Now that fair_server exists, we no longer need RT bandwidth control unless RT_GROUP_SCHED. Enable fair_server with parameters equivalent to RT throttling. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/14d562db55df5c3c780d91940743acb166895ef7.1716811044.git.bristot@kernel.org
2024-07-29sched/core: Fix priority checking for DL server picksJoel Fernandes (Google)
In core scheduling, a DL server pick (which is CFS task) should be given higher priority than tasks in other classes. Not doing so causes CFS starvation. A kselftest is added later to demonstrate this. A CFS task that is competing with RT tasks can be completely starved without this and the DL server's boosting completely ignored. Fix these problems. Reported-by: Suleiman Souhlal <suleiman@google.com> Signed-off-by: "Joel Fernandes (Google)" <joel@joelfernandes.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vineeth Pillai <vineeth@bitbyteword.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/48b78521d86f3b33c24994d843c1aad6b987dda9.1716811044.git.bristot@kernel.org
2024-07-29sched/fair: Add trivial fair serverPeter Zijlstra
Use deadline servers to service fair tasks. This patch adds a fair_server deadline entity which acts as a container for fair entities and can be used to fix starvation when higher priority (wrt fair) tasks are monopolizing CPU(s). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/b6b0bcefaf25391bcf5b6ecdb9f1218de402d42e.1716811044.git.bristot@kernel.org
2024-07-29sched/core: Clear prev->dl_server in CFS pick fast pathYoussef Esmat
In case the previous pick was a DL server pick, ->dl_server might be set. Clear it in the fast path as well. Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers") Signed-off-by: Youssef Esmat <youssefesmat@google.com> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/7f7381ccba09efcb4a1c1ff808ed58385eccc222.1716811044.git.bristot@kernel.org
2024-07-29sched/core: Add clearing of ->dl_server in put_prev_task_balance()Joel Fernandes (Google)
Paths using put_prev_task_balance() need to do a pick shortly after. Make sure they also clear the ->dl_server on prev as a part of that. Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers") Signed-off-by: "Joel Fernandes (Google)" <joel@joelfernandes.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/d184d554434bedbad0581cb34656582d78655150.1716811044.git.bristot@kernel.org
2024-07-29sched: remove HZ_BW feature hedgePhil Auld
As a hedge against unexpected user issues commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs bandwidth in use") included a scheduler feature to disable the new functionality. It's been a few releases (v6.6) and no screams, so remove it. Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20240515133705.3632915-1-pauld@redhat.com
2024-07-29sched/core: Add WARN_ON_ONCE() to check overflow for migrate_disable()Peilin He
Background ========== When repeated migrate_disable() calls are made with missing the corresponding migrate_enable() calls, there is a risk of 'migration_disabled' going upper overflow because 'migration_disabled' is a type of unsigned short whose max value is 65535. In PREEMPT_RT kernel, if 'migration_disabled' goes upper overflow, it may make the migrate_disable() ineffective within local_lock_irqsave(). This is because, during the scheduling procedure, the value of 'migration_disabled' will be checked, which can trigger CPU migration. Consequently, the count of 'rcu_read_lock_nesting' may leak due to local_lock_irqsave() and local_unlock_irqrestore() occurring on different CPUs. Usecase ======== For example, When I developed a driver, I encountered a warning like "WARNING: CPU: 4 PID: 260 at kernel/rcu/tree_plugin.h:315 rcu_note_context_switch+0xa8/0x4e8" warning. It took me half a month to locate this issue. Ultimately, I discovered that the lack of upper overflow detection mechanism in migrate_disable() was the root cause, leading to a significant amount of time spent on problem localization. If the upper overflow detection mechanism was added to migrate_disable(), the root cause could be very quickly and easily identified. Effect ====== Using WARN_ON_ONCE() to check if 'migration_disabled' is upper overflow can help developers identify the issue quickly. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Peilin He<he.peilin@zte.com.cn> Signed-off-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Yunkai Zhang <zhang.yunkai@zte.com.cn> Reviewed-by: Qiang Tu <tu.qiang35@zte.com.cn> Reviewed-by: Kun Jiang <jiang.kun2@zte.com.cn> Reviewed-by: Fan Yu <fan.yu9@zte.com.cn> Link: https://lkml.kernel.org/r/20240716104244764N2jD8gnBpnsLjCDnQGQ8c@zte.com.cn
2024-07-29sched: Initialize the vruntime of a new task when it is first enqueuedZhang Qiao
When creating a new task, we initialize vruntime of the newly task at sched_cgroup_fork(). However, the timing of executing this action is too early and may not be accurate. Because it uses current CPU to init the vruntime, but the new task actually runs on the cpu which be assigned at wake_up_new_task(). To optimize this case, we pass ENQUEUE_INITIAL flag to activate_task() in wake_up_new_task(), in this way, when place_entity is called in enqueue_entity(), the vruntime of the new task will be initialized. In addition, place_entity() in task_fork_fair() was introduced for two reasons: 1. Previously, the __enqueue_entity() was in task_new_fair(), in order to provide vruntime for enqueueing the newly task, the vruntime assignment equation "se->vruntime = cfs_rq->min_vruntime" was introduced by commit e9acbff6484d ("sched: introduce se->vruntime"). This is the initial state of place_entity(). 2. commit 4d78e7b656aa ("sched: new task placement for vruntime") added child_runs_first task placement feature which based on vruntime, this also requires the new task's vruntime value. After removing the child_runs_first and enqueue_entity() from task_fork_fair(), this place_entity() no longer makes sense, so remove it also. Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20240627133359.1370598-1-zhangqiao22@huawei.com
2024-07-29sched/core: Fix unbalance set_rq_online/offline() in sched_cpu_deactivate()Yang Yingliang
If cpuset_cpu_inactive() fails, set_rq_online() need be called to rollback. Fixes: 120455c514f7 ("sched: Fix hotplug vs CPU bandwidth control") Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240703031610.587047-5-yangyingliang@huaweicloud.com
2024-07-29sched/core: Introduce sched_set_rq_on/offline() helperYang Yingliang
Introduce sched_set_rq_on/offline() helper, so it can be called in normal or error path simply. No functional changed. Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240703031610.587047-4-yangyingliang@huaweicloud.com
2024-07-29sched/smt: Fix unbalance sched_smt_present dec/incYang Yingliang
I got the following warn report while doing stress test: jump label: negative count! WARNING: CPU: 3 PID: 38 at kernel/jump_label.c:263 static_key_slow_try_dec+0x9d/0xb0 Call Trace: <TASK> __static_key_slow_dec_cpuslocked+0x16/0x70 sched_cpu_deactivate+0x26e/0x2a0 cpuhp_invoke_callback+0x3ad/0x10d0 cpuhp_thread_fun+0x3f5/0x680 smpboot_thread_fn+0x56d/0x8d0 kthread+0x309/0x400 ret_from_fork+0x41/0x70 ret_from_fork_asm+0x1b/0x30 </TASK> Because when cpuset_cpu_inactive() fails in sched_cpu_deactivate(), the cpu offline failed, but sched_smt_present is decremented before calling sched_cpu_deactivate(), it leads to unbalanced dec/inc, so fix it by incrementing sched_smt_present in the error path. Fixes: c5511d03ec09 ("sched/smt: Make sched_smt_present track topology") Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Link: https://lore.kernel.org/r/20240703031610.587047-3-yangyingliang@huaweicloud.com
2024-07-29sched/smt: Introduce sched_smt_present_inc/dec() helperYang Yingliang
Introduce sched_smt_present_inc/dec() helper, so it can be called in normal or error path simply. No functional changed. Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240703031610.587047-2-yangyingliang@huaweicloud.com
2024-07-29treewide: context_tracking: Rename CONTEXT_* into CT_STATE_*Valentin Schneider
Context tracking state related symbols currently use a mix of the CONTEXT_ (e.g. CONTEXT_KERNEL) and CT_SATE_ (e.g. CT_STATE_MASK) prefixes. Clean up the naming and make the ctx_state enum use the CT_STATE_ prefix. Suggested-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-24sysctl: treewide: constify the ctl_table argument of proc_handlersJoel Granados
const qualify the struct ctl_table argument in the proc_handler function signatures. This is a prerequisite to moving the static ctl_table structs into .rodata data which will ensure that proc_handler function pointers cannot be modified. This patch has been generated by the following coccinelle script: ``` virtual patch @r1@ identifier ctl, write, buffer, lenp, ppos; identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)"; @@ int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int write, void *buffer, size_t *lenp, loff_t *ppos); @r2@ identifier func, ctl, write, buffer, lenp, ppos; @@ int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int write, void *buffer, size_t *lenp, loff_t *ppos) { ... } @r3@ identifier func; @@ int func( - struct ctl_table * + const struct ctl_table * ,int , void *, size_t *, loff_t *); @r4@ identifier func, ctl; @@ int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int , void *, size_t *, loff_t *); @r5@ identifier func, write, buffer, lenp, ppos; @@ int func( - struct ctl_table * + const struct ctl_table * ,int write, void *buffer, size_t *lenp, loff_t *ppos); ``` * Code formatting was adjusted in xfs_sysctl.c to comply with code conventions. The xfs_stats_clear_proc_handler, xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where adjusted. * The ctl_table argument in proc_watchdog_common was const qualified. This is called from a proc_handler itself and is calling back into another proc_handler, making it necessary to change it as part of the proc_handler migration. Co-developed-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Co-developed-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Joel Granados <j.granados@samsung.com>
2024-07-16Merge tag 'sched-core-2024-07-16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Update Daniel Bristot de Oliveira's entry in MAINTAINERS, and credit him in CREDITS - Harmonize the lock-yielding behavior on dynamically selected preemption models with static ones - Reorganize the code a bit: split out sched/syscalls.c to reduce the size of sched/core.c - Micro-optimize psi_group_change() - Fix set_load_weight() for SCHED_IDLE tasks - Misc cleanups & fixes * tag 'sched-core-2024-07-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Update MAINTAINERS and CREDITS sched/fair: set_load_weight() must also call reweight_task() for SCHED_IDLE tasks sched/psi: Optimise psi_group_change a bit sched/core: Drop spinlocks on contention iff kernel is preemptible sched/core: Move preempt_model_*() helpers from sched.h to preempt.h sched/balance: Skip unnecessary updates to idle load balancer's flags idle: Remove stale RCU comment sched/headers: Move struct pre-declarations to the beginning of the header sched/core: Clean up kernel/sched/sched.h a bit sched/core: Simplify prefetch_curr_exec_start() sched: Fix spelling in comments sched/syscalls: Split out kernel/sched/syscalls.c from kernel/sched/core.c
2024-07-15Merge tag 'rcu.2024.07.12a' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU updates from Paul McKenney: - Update Tasks RCU and Tasks Rude RCU description in Requirements.rst and clarify rcu_assign_pointer() and rcu_dereference() ordering properties - Add lockdep assertions for RCU readers, limit inline wakeups for callback-bypass synchronize_rcu(), add an rcutree.nohz_full_patience_delay to reduce nohz_full OS jitter, add Uladzislau Rezki as RCU maintainer, and fix a subtle callback-migration memory-ordering issue - Remove a number of redundant memory barriers - Remove unnecessary bypass-list lock-contention mitigation, use parking API instead of open-coded ad-hoc equivalent, and upgrade obsolete comments - Revert avoidance of a deadlock that can no longer occur and properly synchronize Tasks Trace RCU checking of runqueues - Add tests for handling of double-call_rcu() bug, add missing MODULE_DESCRIPTION, and add a script that histograms the number of calls to RCU updaters - Fill out SRCU polled-grace-period API * tag 'rcu.2024.07.12a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (29 commits) rcu: Fix rcu_barrier() VS post CPUHP_TEARDOWN_CPU invocation rcu: Eliminate lockless accesses to rcu_sync->gp_count MAINTAINERS: Add Uladzislau Rezki as RCU maintainer rcu: Add rcutree.nohz_full_patience_delay to reduce nohz_full OS jitter rcu/exp: Remove redundant full memory barrier at the end of GP rcu: Remove full memory barrier on RCU stall printout rcu: Remove full memory barrier on boot time eqs sanity check rcu/exp: Remove superfluous full memory barrier upon first EQS snapshot rcu: Remove superfluous full memory barrier upon first EQS snapshot rcu: Remove full ordering on second EQS snapshot srcu: Fill out polled grace-period APIs srcu: Update cleanup_srcu_struct() comment srcu: Add NUM_ACTIVE_SRCU_POLL_OLDSTATE srcu: Disable interrupts directly in srcu_gp_end() rcu: Disable interrupts directly in rcu_gp_init() rcu/tree: Reduce wake up for synchronize_rcu() common case rcu/tasks: Fix stale task snaphot for Tasks Trace tools/rcu: Add rcu-updaters.sh script rcutorture: Add missing MODULE_DESCRIPTION() macros rcutorture: Fix rcu_torture_fwd_cb_cr() data race ...
2024-07-11Merge branch 'sched/urgent' into sched/core, to pick up fixes and refresh ↵Ingo Molnar
the branch Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-07-08sched, sched_ext: Open code for_balance_class_range()Tejun Heo
For flexibility, sched_ext allows the BPF scheduler to select the CPU to execute a task on at dispatch time so that e.g. a queue can be shared across multiple CPUs. To enable this, the dispatch path is executed from balance() so that a dispatched task can be hot-migrated to its target CPU. This means that sched_ext needs its balance() method invoked before every pick_next_task() even when the CPU is waking up from SCHED_IDLE. for_balance_class_range() defined in kernel/sched/ext.h implements this selective iteration promotion. However, the indirection obfuscates more than helps. Open code the iteration promotion in put_prev_task_balance() and remove for_balance_class_range(). No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: David Vernet <void@manifault.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2024-07-08sched, sched_ext: Simplify dl_prio() case handling in sched_fork()Tejun Heo
sched_fork() returns with -EAGAIN if dl_prio(@p). a7a9fc549293 ("sched_ext: Add boilerplate for extensible scheduler class") added scx_pre_fork() call before it and then scx_cancel_fork() on the exit path. This is silly as the dl_prio() block can just be moved above the scx_pre_fork() call. Move the dl_prio() block above the scx_pre_fork() call and remove the now unnecessary scx_cancel_fork() invocation. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Vernet <void@manifault.com>
2024-07-08Merge branch 'sched/core' of ↵Tejun Heo
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-6.11 d32960528702 ("sched/fair: set_load_weight() must also call reweight_task() for SCHED_IDLE tasks") applied to sched/core changes how reweight_task() is called causing conflicts with e83edbf88f18 ("sched: Add sched_class->reweight_task()"). Resolve the conflicts by taking set_load_weight() changes from d32960528702 and updating sched_class->reweight_task() to take pointer to struct load_weight instead of int prio. Signed-off-by: Tejun Heo<tj@kernel.org>
2024-07-04sched/fair: set_load_weight() must also call reweight_task() for SCHED_IDLE ↵Tejun Heo
tasks When a task's weight is being changed, set_load_weight() is called with @update_load set. As weight changes aren't trivial for the fair class, set_load_weight() calls fair.c::reweight_task() for fair class tasks. However, set_load_weight() first tests task_has_idle_policy() on entry and skips calling reweight_task() for SCHED_IDLE tasks. This is buggy as SCHED_IDLE tasks are just fair tasks with a very low weight and they would incorrectly skip load, vlag and position updates. Fix it by updating reweight_task() to take struct load_weight as idle weight can't be expressed with prio and making set_load_weight() call reweight_task() for SCHED_IDLE tasks too when @update_load is set. Fixes: 9059393e4ec1 ("sched/fair: Use reweight_entity() for set_user_nice()") Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org # v4.15+ Link: http://lkml.kernel.org/r/20240624102331.GI31592@noisy.programming.kicks-ass.net
2024-07-01sched: Move psi_account_irqtime() out of update_rq_clock_task() hotpathJohn Stultz
It was reported that in moving to 6.1, a larger then 10% regression was seen in the performance of clock_gettime(CLOCK_THREAD_CPUTIME_ID,...). Using a simple reproducer, I found: 5.10: 100000000 calls in 24345994193 ns => 243.460 ns per call 100000000 calls in 24288172050 ns => 242.882 ns per call 100000000 calls in 24289135225 ns => 242.891 ns per call 6.1: 100000000 calls in 28248646742 ns => 282.486 ns per call 100000000 calls in 28227055067 ns => 282.271 ns per call 100000000 calls in 28177471287 ns => 281.775 ns per call The cause of this was finally narrowed down to the addition of psi_account_irqtime() in update_rq_clock_task(), in commit 52b1364ba0b1 ("sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure"). In my initial attempt to resolve this, I leaned towards moving all accounting work out of the clock_gettime() call path, but it wasn't very pretty, so it will have to wait for a later deeper rework. Instead, Peter shared this approach: Rework psi_account_irqtime() to use its own psi_irq_time base for accounting, and move it out of the hotpath, calling it instead from sched_tick() and __schedule(). In testing this, we found the importance of ensuring psi_account_irqtime() is run under the rq_lock, which Johannes Weiner helpfully explained, so also add some lockdep annotations to make that requirement clear. With this change the performance is back in-line with 5.10: 6.1+fix: 100000000 calls in 24297324597 ns => 242.973 ns per call 100000000 calls in 24318869234 ns => 243.189 ns per call 100000000 calls in 24291564588 ns => 242.916 ns per call Reported-by: Jimmy Shiu <jimmyshiu@google.com> Originally-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Qais Yousef <qyousef@layalina.io> Link: https://lore.kernel.org/r/20240618215909.4099720-1-jstultz@google.com
2024-06-21sched, sched_ext: Replace scx_next_task_picked() with ↵Tejun Heo
sched_class->switch_class() scx_next_task_picked() is used by sched_ext to notify the BPF scheduler when a CPU is taken away by a task dispatched from a higher priority sched_class so that the BPF scheduler can, e.g., punt the task[s] which was running or were waiting for the CPU to other CPUs. Replace the sched_ext specific hook scx_next_task_picked() with a new sched_class operation switch_class(). The changes are straightforward and the code looks better afterwards. However, when !CONFIG_SCHED_CLASS_EXT, this ends up adding an unused hook which is unlikely to be useful to other sched_classes. For further discussion on this subject, please refer to the following: http://lkml.kernel.org/r/CAHk-=wjFPLqo7AXu8maAGEGnOy6reUg-F4zzFhVB0Kyu22h7pw@mail.gmail.com Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2024-06-18sched_ext: Implement core-sched supportTejun Heo
The core-sched support is composed of the following parts: - task_struct->scx.core_sched_at is added. This is a timestamp which can be used to order tasks. Depending on whether the BPF scheduler implements custom ordering, it tracks either global FIFO ordering of all tasks or local-DSQ ordering within the dispatched tasks on a CPU. - prio_less() is updated to call scx_prio_less() when comparing SCX tasks. scx_prio_less() calls ops.core_sched_before() if available or uses the core_sched_at timestamp. For global FIFO ordering, the BPF scheduler doesn't need to do anything. Otherwise, it should implement ops.core_sched_before() which reflects the ordering. - When core-sched is enabled, balance_scx() balances all SMT siblings so that they all have tasks dispatched if necessary before pick_task_scx() is called. pick_task_scx() picks between the current task and the first dispatched task on the local DSQ based on availability and the core_sched_at timestamps. Note that FIFO ordering is expected among the already dispatched tasks whether running or on the local DSQ, so this path always compares core_sched_at instead of calling into ops.core_sched_before(). qmap_core_sched_before() is added to scx_qmap. It scales the distances from the heads of the queues to compare the tasks across different priority queues and seems to behave as expected. v3: Fixed build error when !CONFIG_SCHED_SMT reported by Andrea Righi. v2: Sched core added the const qualifiers to prio_less task arguments. Explicitly drop them for ops.core_sched_before() task arguments. BPF enforces access control through the verifier, so the qualifier isn't actually operative and only gets in the way when interacting with various helpers. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Reviewed-by: Josh Don <joshdon@google.com> Cc: Andrea Righi <andrea.righi@canonical.com>
2024-06-18sched_ext: Implement sched_ext_ops.cpu_online/offline()Tejun Heo
Add ops.cpu_online/offline() which are invoked when CPUs come online and offline respectively. As the enqueue path already automatically bypasses tasks to the local dsq on a deactivated CPU, BPF schedulers are guaranteed to see tasks only on CPUs which are between online() and offline(). If the BPF scheduler doesn't implement ops.cpu_online/offline(), the scheduler is automatically exited with SCX_ECODE_RESTART | SCX_ECODE_RSN_HOTPLUG. Userspace can implement CPU hotpplug support trivially by simply reinitializing and reloading the scheduler. scx_qmap is updated to print out online CPUs on hotplug events. Other schedulers are updated to restart based on ecode. v3: - The previous implementation added @reason to sched_class.rq_on/offline() to distinguish between CPU hotplug events and topology updates. This was buggy and fragile as the methods are skipped if the current state equals the target state. Instead, add scx_rq_[de]activate() which are directly called from sched_cpu_de/activate(). This also allows ops.cpu_on/offline() to sleep which can be useful. - ops.dispatch() could be called on a CPU that the BPF scheduler was told to be offline. The dispatch patch is updated to bypass in such cases. v2: - To accommodate lock ordering change between scx_cgroup_rwsem and cpus_read_lock(), CPU hotplug operations are put into its own SCX_OPI block and enabled eariler during scx_ope_enable() so that cpus_read_lock() can be dropped before acquiring scx_cgroup_rwsem. - Auto exit with ECODE added. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-06-18sched_ext: Implement SCX_KICK_WAITDavid Vernet
If set when calling scx_bpf_kick_cpu(), the invoking CPU will busy wait for the kicked cpu to enter the scheduler. See the following for example usage: https://github.com/sched-ext/scx/blob/main/scheds/c/scx_pair.bpf.c v2: - Updated to fit the updated kick_cpus_irq_workfn() implementation. - Include SCX_KICK_WAIT related information in debug dump. Signed-off-by: David Vernet <dvernet@meta.com> Reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-06-18sched_ext: Implement tickless supportTejun Heo
Allow BPF schedulers to indicate tickless operation by setting p->scx.slice to SCX_SLICE_INF. A CPU whose current task has infinte slice goes into tickless operation. scx_central is updated to use tickless operations for all tasks and instead use a BPF timer to expire slices. This also uses the SCX_ENQ_PREEMPT and task state tracking added by the previous patches. Currently, there is no way to pin the timer on the central CPU, so it may end up on one of the worker CPUs; however, outside of that, the worker CPUs can go tickless both while running sched_ext tasks and idling. With schbench running, scx_central shows: root@test ~# grep ^LOC /proc/interrupts; sleep 10; grep ^LOC /proc/interrupts LOC: 142024 656 664 449 Local timer interrupts LOC: 161663 663 665 449 Local timer interrupts Without it: root@test ~ [SIGINT]# grep ^LOC /proc/interrupts; sleep 10; grep ^LOC /proc/interrupts LOC: 188778 3142 3793 3993 Local timer interrupts LOC: 198993 5314 6323 6438 Local timer interrupts While scx_central itself is too barebone to be useful as a production scheduler, a more featureful central scheduler can be built using the same approach. Google's experience shows that such an approach can have significant benefits for certain applications such as VM hosting. v4: Allow operation even if BPF_F_TIMER_CPU_PIN is not available. v3: Pin the central scheduler's timer on the central_cpu using BPF_F_TIMER_CPU_PIN. v2: Convert to BPF inline iterators. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-06-18sched_ext: Print sched_ext info when dumping stackDavid Vernet
It would be useful to see what the sched_ext scheduler state is, and what scheduler is running, when we're dumping a task's stack. This patch therefore adds a new print_scx_info() function that's called in the same context as print_worker_info() and print_stop_info(). An example dump follows. BUG: kernel NULL pointer dereference, address: 0000000000000999 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 0 P4D 0 Oops: 0002 [#1] PREEMPT SMP CPU: 13 PID: 2047 Comm: insmod Tainted: G O 6.6.0-work-10323-gb58d4cae8e99-dirty #34 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS unknown 2/2/2022 Sched_ext: qmap (enabled+all), task: runnable_at=-17ms RIP: 0010:init_module+0x9/0x1000 [test_module] ... v3: - scx_ops_enable_state_str[] definition moved to an earlier patch as it's now used by core implementation. - Convert jiffy delta to msecs using jiffies_to_msecs() instead of multiplying by (HZ / MSEC_PER_SEC). The conversion is implemented in jiffies_delta_msecs(). v2: - We are now using scx_ops_enable_state_str[] outside CONFIG_SCHED_DEBUG. Move it outside of CONFIG_SCHED_DEBUG and to the top. This was reported by Changwoo and Andrea. Signed-off-by: David Vernet <void@manifault.com> Reported-by: Changwoo Min <changwoo@igalia.com> Reported-by: Andrea Righi <andrea.righi@canonical.com> Signed-off-by: Tejun Heo <tj@kernel.org>