summaryrefslogtreecommitdiff
path: root/kernel/sched/ext.c
AgeCommit message (Collapse)Author
2025-03-26sched_ext: initialize built-in idle state before ops.init()Andrea Righi
A BPF scheduler may want to use the built-in idle cpumasks in ops.init() before the scheduler is fully initialized, either directly or through a BPF timer for example. However, this would result in an error, since the idle state has not been properly initialized yet. This can be easily verified by modifying scx_simple to call scx_bpf_get_idle_cpumask() in ops.init(): $ sudo scx_simple DEBUG DUMP =========================================================================== scx_simple[121] triggered exit kind 1024: runtime error (built-in idle tracking is disabled) ... Fix this by properly initializing the idle state before ops.init() is called. With this change applied: $ sudo scx_simple local=2 global=0 local=19 global=11 local=23 global=11 ... Fixes: d73249f88743d ("sched_ext: idle: Make idle static keys private") Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-26sched_ext: create_dsq: Return -EEXIST on duplicate requestJake Hillion
create_dsq and therefore the scx_bpf_create_dsq kfunc currently silently ignore duplicate entries. As a sched_ext scheduler is creating each DSQ for a different purpose this is surprising behaviour. Replace rhashtable_insert_fast which ignores duplicates with rhashtable_lookup_insert_fast that reports duplicates (though doesn't return their value). The rest of the code is structured correctly and this now returns -EEXIST. Tested by adding an extra scx_bpf_create_dsq to scx_simple. Previously this was ignored, now init fails with a -17 code. Also ran scx_lavd which continued to work well. Signed-off-by: Jake Hillion <jake@hillion.co.uk> Acked-by: Andrea Righi <arighi@nvidia.com> Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class") Cc: stable@vger.kernel.org # v6.12+ Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-24Merge tag 'sched-core-2025-03-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Core & fair scheduler changes: - Cancel the slice protection of the idle entity (Zihan Zhou) - Reduce the default slice to avoid tasks getting an extra tick (Zihan Zhou) - Force propagating min_slice of cfs_rq when {en,de}queue tasks (Tianchen Ding) - Refactor can_migrate_task() to elimate looping (I Hsin Cheng) - Add unlikey branch hints to several system calls (Colin Ian King) - Optimize current_clr_polling() on certain architectures (Yujun Dong) Deadline scheduler: (Juri Lelli) - Remove redundant dl_clear_root_domain call - Move dl_rebuild_rd_accounting to cpuset.h Uclamp: - Use the uclamp_is_used() helper instead of open-coding it (Xuewen Yan) - Optimize sched_uclamp_used static key enabling (Xuewen Yan) Scheduler topology support: (Juri Lelli) - Ignore special tasks when rebuilding domains - Add wrappers for sched_domains_mutex - Generalize unique visiting of root domains - Rebuild root domain accounting after every update - Remove partition_and_rebuild_sched_domains - Stop exposing partition_sched_domains_locked RSEQ: (Michael Jeanson) - Update kernel fields in lockstep with CONFIG_DEBUG_RSEQ=y - Fix segfault on registration when rseq_cs is non-zero - selftests: Add rseq syscall errors test - selftests: Ensure the rseq ABI TLS is actually 1024 bytes Membarriers: - Fix redundant load of membarrier_state (Nysal Jan K.A.) Scheduler debugging: - Introduce and use preempt_model_str() (Sebastian Andrzej Siewior) - Make CONFIG_SCHED_DEBUG unconditional (Ingo Molnar) Fixes and cleanups: - Always save/restore x86 TSC sched_clock() on suspend/resume (Guilherme G. Piccoli) - Misc fixes and cleanups (Thorsten Blum, Juri Lelli, Sebastian Andrzej Siewior)" * tag 'sched-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits) cpuidle, sched: Use smp_mb__after_atomic() in current_clr_polling() sched/debug: Remove CONFIG_SCHED_DEBUG sched/debug: Remove CONFIG_SCHED_DEBUG from self-test config files sched/debug, Documentation: Remove (most) CONFIG_SCHED_DEBUG references from documentation sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditional sched/debug: Make 'const_debug' tunables unconditional __read_mostly sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE() rseq/selftests: Fix namespace collision with rseq UAPI header include/{topology,cpuset}: Move dl_rebuild_rd_accounting to cpuset.h sched/topology: Stop exposing partition_sched_domains_locked cgroup/cpuset: Remove partition_and_rebuild_sched_domains sched/topology: Remove redundant dl_clear_root_domain call sched/deadline: Rebuild root domain accounting after every update sched/deadline: Generalize unique visiting of root domains sched/topology: Wrappers for sched_domains_mutex sched/deadline: Ignore special tasks when rebuilding domains tracing: Use preempt_model_str() xtensa: Rely on generic printing of preemption model x86: Rely on generic printing of preemption model s390: Rely on generic printing of preemption model ...
2025-03-19sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE()Ingo Molnar
The scheduler has this special SCHED_WARN() facility that depends on CONFIG_SCHED_DEBUG. Since CONFIG_SCHED_DEBUG is getting removed, convert SCHED_WARN() to WARN_ON_ONCE(). Note that the warning output isn't 100% equivalent: #define SCHED_WARN_ON(x) WARN_ONCE(x, #x) Because SCHED_WARN_ON() would output the 'x' condition as well, while WARN_ONCE() will only show a backtrace. Hopefully these are rare enough to not really matter. If it does, we should probably introduce a new WARN_ON() variant that outputs the condition in stringified form, or improve WARN_ON() itself. Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250317104257.3496611-2-mingo@kernel.org
2025-03-14sched_ext: idle: Refactor scx_select_cpu_dfl()Andrea Righi
Make scx_select_cpu_dfl() more consistent with the other idle-related APIs by returning a negative value when an idle CPU isn't found. No functional changes, this is purely a refactoring. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-14sched_ext: idle: Honor idle flags in the built-in idle selection policyAndrea Righi
Enable passing idle flags (%SCX_PICK_IDLE_*) to scx_select_cpu_dfl(), to enforce strict selection criteria, such as selecting an idle CPU strictly within @prev_cpu's node or choosing only a fully idle SMT core. This functionality will be exposed through a dedicated kfunc in a separate patch. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-13sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()Andrea Righi
scx_bpf_reenqueue_local() can be invoked from ops.cpu_release() to give tasks that are queued to the local DSQ a chance to migrate to other CPUs, when a CPU is taken by a higher scheduling class. However, there is no point re-enqueuing tasks that can only run on that particular CPU, as they would simply be re-added to the same local DSQ without any benefit. Therefore, skip per-CPU tasks in scx_bpf_reenqueue_local(). Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-04sched_ext: Add trace point to track sched_ext core eventsChangwoo Min
Add tracing support to track sched_ext core events (/sched_ext/sched_ext_event). This may be useful for debugging sched_ext schedulers that trigger a particular event. The trace point can be used as other trace points, so it can be used in, for example, `perf trace` and BPF programs, as follows: ====== $> sudo perf trace -e sched_ext:sched_ext_event --filter 'name == "SCX_EV_ENQ_SLICE_DFL"' ====== ====== struct tp_sched_ext_event { struct trace_entry ent; u32 __data_loc_name; s64 delta; }; SEC("tracepoint/sched_ext/sched_ext_event") int rtp_add_event(struct tp_sched_ext_event *ctx) { char event_name[128]; unsigned short offset = ctx->__data_loc_name & 0xFFFF; bpf_probe_read_str((void *)event_name, 128, (char *)ctx + offset); bpf_printk("name %s delta %lld", event_name, ctx->delta); return 0; } ====== Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-04sched_ext: Change the event type from u64 to s64Changwoo Min
The event count could be negative in the future, so change the event type from u64 to s64. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-03sched_ext: Merge branch 'for-6.14-fixes' into for-6.15Tejun Heo
Pull for-6.14-fixes to receive: 9360dfe4cbd6 ("sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()") which conflicts with: 337d1b354a29 ("sched_ext: Move built-in idle CPU selection policy to a separate file") Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-03sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()Andrea Righi
If a BPF scheduler provides an invalid CPU (outside the nr_cpu_ids range) as prev_cpu to scx_bpf_select_cpu_dfl() it can cause a kernel crash. To prevent this, validate prev_cpu in scx_bpf_select_cpu_dfl() and trigger an scx error if an invalid CPU is specified. Fixes: f0e1a0643a59b ("sched_ext: Implement BPF extensible scheduler class") Cc: stable@vger.kernel.org # v6.12+ Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-25sched_ext: Fix pick_task_scx() picking non-queued tasks when it's called ↵Tejun Heo
without balance() a6250aa251ea ("sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()") added a workaround to handle the cases where pick_task_scx() is called without prececing balance_scx() which is due to a fair class bug where pick_taks_fair() may return NULL after a true return from balance_fair(). The workaround detects when pick_task_scx() is called without preceding balance_scx() and emulates SCX_RQ_BAL_KEEP and triggers kicking to avoid stalling. Unfortunately, the workaround code was testing whether @prev was on SCX to decide whether to keep the task running. This is incorrect as the task may be on SCX but no longer runnable. This could lead to a non-runnable task to be returned from pick_task_scx() which cause interesting confusions and failures. e.g. A common failure mode is the task ending up with (!on_rq && on_cpu) state which can cause potential wakers to busy loop, which can easily lead to deadlocks. Fix it by testing whether @prev has SCX_TASK_QUEUED set. This makes @prev_on_scx only used in one place. Open code the usage and improve the comment while at it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Pat Cody <patcody@meta.com> Fixes: a6250aa251ea ("sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()") Cc: stable@vger.kernel.org # v6.12+ Acked-by: Andrea Righi <arighi@nvidia.com>
2025-02-24sched_ext: idle: Introduce scx_bpf_nr_node_ids()Andrea Righi
Similarly to scx_bpf_nr_cpu_ids(), introduce a new kfunc scx_bpf_nr_node_ids() to expose the maximum number of NUMA nodes in the system. BPF schedulers can use this information together with the new node-aware kfuncs, for example to create per-node DSQs, validate node IDs, etc. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16sched_ext: idle: Per-node idle cpumasksAndrea Righi
Using a single global idle mask can lead to inefficiencies and a lot of stress on the cache coherency protocol on large systems with multiple NUMA nodes, since all the CPUs can create a really intense read/write activity on the single global cpumask. Therefore, split the global cpumask into multiple per-NUMA node cpumasks to improve scalability and performance on large systems. The concept is that each cpumask will track only the idle CPUs within its corresponding NUMA node, treating CPUs in other NUMA nodes as busy. In this way concurrent access to the idle cpumask will be restricted within each NUMA node. The split of multiple per-node idle cpumasks can be controlled using the SCX_OPS_BUILTIN_IDLE_PER_NODE flag. By default SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled and a global host-wide idle cpumask is used, maintaining the previous behavior. NOTE: if a scheduler explicitly enables the per-node idle cpumasks (via SCX_OPS_BUILTIN_IDLE_PER_NODE), scx_bpf_get_idle_cpu/smtmask() will trigger an scx error, since there are no system-wide cpumasks. = Test = Hardware: - System: DGX B200 - CPUs: 224 SMT threads (112 physical cores) - Processor: INTEL(R) XEON(R) PLATINUM 8570 - 2 NUMA nodes Scheduler: - scx_simple [1] (so that we can focus at the built-in idle selection policy and not at the scheduling policy itself) Test: - Run a parallel kernel build `make -j $(nproc)` and measure the average elapsed time over 10 runs: avg time | stdev ---------+------ before: 52.431s | 2.895 after: 50.342s | 2.895 = Conclusion = Splitting the global cpumask into multiple per-NUMA cpumasks helped to achieve a speedup of approximately +4% with this particular architecture and test case. The same test on a DGX-1 (40 physical cores, Intel Xeon E5-2698 v4 @ 2.20GHz, 2 NUMA nodes) shows a speedup of around 1.5-3%. On smaller systems, I haven't noticed any measurable regressions or improvements with the same test (parallel kernel build) and scheduler (scx_simple). Moreover, with a modified scx_bpfland that uses the new NUMA-aware APIs I observed an additional +2-2.5% performance improvement with the same test. [1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_simple.bpf.c Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODEAndrea Righi
Add the new scheduler flag SCX_OPS_BUILTIN_IDLE_PER_NODE, which allows BPF schedulers to select between using a global flat idle cpumask or multiple per-node cpumasks. This only introduces the flag and the mechanism to enable/disable this feature without affecting any scheduling behavior. Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16sched_ext: idle: Make idle static keys privateAndrea Righi
Make all the static keys used by the idle CPU selection policy private to ext_idle.c. This avoids unnecessary exposure in headers and improves code encapsulation. Cc: Yury Norov <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-14sched_ext: Provides a sysfs 'events' to expose core event countersChangwoo Min
Add a sysfs entry at /sys/kernel/sched_ext/root/events to expose core event counters through the files system interface. Each line of the file shows the event name and its counter value. In addition, the format of scx_dump_event() is adjusted as the event name gets longer. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-13sched_ext: Implement SCX_OPS_ALLOW_QUEUED_WAKEUPTejun Heo
A task wakeup can be either processed on the waker's CPU or bounced to the wakee's previous CPU using an IPI (ttwu_queue). Bouncing to the wakee's CPU avoids the waker's CPU locking and accessing the wakee's rq which can be expensive across cache and node boundaries. When ttwu_queue path is taken, select_task_rq() and thus ops.select_cpu() may be skipped in some cases (racing against the wakee switching out). As this confused some BPF schedulers, there wasn't a good way for a BPF scheduler to tell whether idle CPU selection has been skipped, ops.enqueue() couldn't insert tasks into foreign local DSQs, and the performance difference on machines with simple toplogies were minimal, sched_ext disabled ttwu_queue. However, this optimization makes noticeable difference on more complex topologies and a BPF scheduler now has an easy way tell whether ops.select_cpu() was skipped since 9b671793c7d9 ("sched_ext, scx_qmap: Add and use SCX_ENQ_CPU_SELECTED") and can insert tasks into foreign local DSQs since 5b26f7b920f7 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches"). Implement SCX_OPS_ALLOW_QUEUED_WAKEUP which allows BPF schedulers to choose to enable ttwu_queue optimization. v2: Update the patch description and comment re. ops.select_cpu() being skipped in some cases as opposed to always as per Neel. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Neel Natu <neelnatu@google.com> Reported-by: Barret Rhoden <brho@google.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-02-13sched_ext: Use SCX_CALL_OP_TASK in task_tick_scxChuyi Zhou
Now when we use scx_bpf_task_cgroup() in ops.tick() to get the cgroup of the current task, the following error will occur: scx_foo[3795244] triggered exit kind 1024: runtime error (called on a task not being operated on) The reason is that we are using SCX_CALL_OP() instead of SCX_CALL_OP_TASK() when calling ops.tick(), which triggers the error during the subsequent scx_kf_allowed_on_arg_tasks() check. SCX_CALL_OP_TASK() was first introduced in commit 36454023f50b ("sched_ext: Track tasks that are subjects of the in-flight SCX operation") to ensure task's rq lock is held when accessing task's sched_group. Since ops.tick() is marked as SCX_KF_TERMINAL and task_tick_scx() is protected by the rq lock, we can use SCX_CALL_OP_TASK() to avoid the above issue. Similarly, the same changes should be made for ops.disable() and ops.exit_task(), as they are also protected by task_rq_lock() and it's safe to access the task's task_group. Fixes: 36454023f50b ("sched_ext: Track tasks that are subjects of the in-flight SCX operation") Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-10Merge branch 'for-6.14-fixes' into for-6.15Tejun Heo
Pull to receive f3f08c3acfb8 ("sched_ext: Fix incorrect assumption about migration disabled tasks in task_can_run_on_remote_rq()") which conflicts with 26176116d931 ("sched_ext: Count SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE in the right spot") in for-6.15.
2025-02-10sched_ext: Fix incorrect assumption about migration disabled tasks in ↵Tejun Heo
task_can_run_on_remote_rq() While fixing migration disabled task handling, 32966821574c ("sched_ext: Fix migration disabled handling in targeted dispatches") assumed that a migration disabled task's ->cpus_ptr would only have the pinned CPU. While this is eventually true for migration disabled tasks that are switched out, ->cpus_ptr update is performed by migrate_disable_switch() which is called right before context_switch() in __scheduler(). However, the task is enqueued earlier during pick_next_task() via put_prev_task_scx(), so there is a race window where another CPU can see the task on a DSQ. If the CPU tries to dispatch the migration disabled task while in that window, task_allowed_on_cpu() will succeed and task_can_run_on_remote_rq() will subsequently trigger SCHED_WARN(is_migration_disabled()). WARNING: CPU: 8 PID: 1837 at kernel/sched/ext.c:2466 task_can_run_on_remote_rq+0x12e/0x140 Sched_ext: layered (enabled+all), task: runnable_at=-10ms RIP: 0010:task_can_run_on_remote_rq+0x12e/0x140 ... <TASK> consume_dispatch_q+0xab/0x220 scx_bpf_dsq_move_to_local+0x58/0xd0 bpf_prog_84dd17b0654b6cf0_layered_dispatch+0x290/0x1cfa bpf__sched_ext_ops_dispatch+0x4b/0xab balance_one+0x1fe/0x3b0 balance_scx+0x61/0x1d0 prev_balance+0x46/0xc0 __pick_next_task+0x73/0x1c0 __schedule+0x206/0x1730 schedule+0x3a/0x160 __do_sys_sched_yield+0xe/0x20 do_syscall_64+0xbb/0x1e0 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fix it by converting the SCHED_WARN() back to a regular failure path. Also, perform the migration disabled test before task_allowed_on_cpu() test so that BPF schedulers which fail to handle migration disabled tasks can be noticed easily. While at it, adjust scx_ops_error() message for !task_allowed_on_cpu() case for brevity and consistency. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 32966821574c ("sched_ext: Fix migration disabled handling in targeted dispatches") Acked-by: Andrea Righi <arighi@nvidia.com> Reported-by: Jake Hillion <jakehillion@meta.com>
2025-02-10sched_ext: Take NUMA node into account when allocating per-CPU cpumasksLi RongQing
per-CPU cpumasks are dominantly accessed from their own local CPUs, so allocate them node-local to improve performance. Signed-off-by: Li RongQing <lirongqing@baidu.com> Acked-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-08sched_ext: Add SCX_EV_ENQ_SKIP_MIGRATION_DISABLEDTejun Heo
Count the number of times a migration disabled task is automatically dispatched to its local DSQ. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-02-08sched_ext: Count SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE in the right spotTejun Heo
SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE wasn't quite right in two aspects: - It counted both migration disabled and offline events. - It didn't count events from scx_bpf_dsq_move() path. Fix it by moving the counting into task_can_run_on_remote_rq() which is shared by both paths and can distinguish the different rejection conditions. The argument @trigger_error is renamed to @enforce as it now does more than just triggering error. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-02-08Merge branch 'for-6.14-fixes' into for-6.15Tejun Heo
Pull to receive: - 2fa0fbeb69ed ("sched_ext: Implement auto local dispatching of migration disabled tasks") - 32966821574c ("sched_ext: Fix migration disabled handling in targeted dispatches") as planned for-6.15 changes depend on them (e.g. adding event counter for implicit migration disabled task handling).
2025-02-08sched_ext: Fix migration disabled handling in targeted dispatchesTejun Heo
A dispatch operation that can target a specific local DSQ - scx_bpf_dsq_move_to_local() or scx_bpf_dsq_move() - checks whether the task can be migrated to the target CPU using task_can_run_on_remote_rq(). If the task can't be migrated to the targeted CPU, it is bounced through a global DSQ. task_can_run_on_remote_rq() assumes that the task is on a CPU that's different from the targeted CPU but the callers doesn't uphold the assumption and may call the function when the task is already on the target CPU. When such task has migration disabled, task_can_run_on_remote_rq() ends up returning %false incorrectly unnecessarily bouncing the task to a global DSQ. Fix it by updating the callers to only call task_can_run_on_remote_rq() when the task is on a different CPU than the target CPU. As this is a bit subtle, for clarity and documentation: - Make task_can_run_on_remote_rq() trigger SCHED_WARN_ON() if the task is on the same CPU as the target CPU. - is_migration_disabled() test in task_can_run_on_remote_rq() cannot trigger if the task is on a different CPU than the target CPU as the preceding task_allowed_on_cpu() test should fail beforehand. Convert the test into SCHED_WARN_ON(). Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()") Fixes: 0366017e0973 ("sched_ext: Use task_can_run_on_remote_rq() test in dispatch_to_local_dsq()") Cc: stable@vger.kernel.org # v6.12+
2025-02-08sched_ext: Implement auto local dispatching of migration disabled tasksTejun Heo
Migration disabled tasks are special and pinned to their previous CPUs. They tripped up some unsuspecting BPF schedulers as their ->nr_cpus_allowed may not agree with the bits set in ->cpus_ptr. Make it easier for BPF schedulers by automatically dispatching them to the pinned local DSQs by default. If a BPF scheduler wants to handle migration disabled tasks explicitly, it can set SCX_OPS_ENQ_MIGRATION_DISABLED. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-02-07sched_ext: Add an event, SCX_EV_ENQ_SLICE_DFLChangwoo Min
Add a core event, SCX_EV_ENQ_SLICE_DFL, which represents how many tasks have been enqueued (or pick_task-ed or select_cpu-ed) with a default time slice (SCX_SLICE_DFL). Scheduling a task with SCX_SLICE_DFL unintentionally would be a source of latency spikes because SCX_SLICE_DFL is relatively long (20 msec). Thus, soaring the SCX_EV_ENQ_SLICE_DFL value would be a sign of BPF scheduler bugs, causing latency spikes, especially when ops.select_cpu() is provided. __scx_add_event() is used since the caller holds an rq lock or p->pi_lock, so the preemption has already been disabled. Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04sched_ext: Add an event, SCX_EV_BYPASS_DURATIONChangwoo Min
Add a core event, SCX_EV_BYPASS_DURATION, which represents the total duration of bypass modes in nanoseconds. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04sched_ext: Add an event, SCX_EV_BYPASS_DISPATCHChangwoo Min
Add a core event, SCX_EV_BYPASS_DISPATCH, which represents how many tasks have been dispatched in the bypass mode. __scx_add_event() is used since the caller holds an rq lock or p->pi_lock, so the preemption has already been disabled. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04sched_ext: Add an event, SCX_EV_BYPASS_ACTIVATEChangwoo Min
Add a core event, SCX_EV_BYPASS_ACTIVATE, which represents how many times the bypass mode has been triggered. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04sched_ext: Add an event, SCX_EV_ENQ_SKIP_EXITINGChangwoo Min
Add a core event, SCX_EV_ENQ_SKIP_EXITING, which represents how many times a task is enqueued to a local DSQ when exiting if SCX_OPS_ENQ_EXITING is not set. __scx_add_event() is used since the caller holds an rq lock, so the preemption has already been disabled. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02sched_ext: Add an event, SCX_EV_DISPATCH_KEEP_LASTChangwoo Min
Add a core event, SCX_EV_DISPATCH_KEEP_LAST, which represents how many times a task is continued to run without ops.enqueue() when SCX_OPS_ENQ_LAST is not set. __scx_add_event() is used since the caller holds an rq lock, so the preemption has already been disabled. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02sched_ext: Add an event, SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINEChangwoo Min
Add a core event, SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE, which represents how many times a BPF scheduler tries to dispatch to an offlined local DSQ. __scx_add_event() is used since the caller holds an rq lock, so the preemption has already been disabled. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02sched_ext: Add an event, SCX_EV_SELECT_CPU_FALLBACKChangwoo Min
Add a core event, SCX_EV_SELECT_CPU_FALLBACK, which represents how many times ops.select_cpu() returns a CPU that the task can't use. __scx_add_event() is used since the caller holds an rq lock, so the preemption has already been disabled. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02sched_ext: Implement event counter infrastructureChangwoo Min
Collect the statistics of specific types of behavior in the sched_ext core, which are not easily visible but still interesting to an scx scheduler. An event type is defined in 'struct scx_event_stats.' When an event occurs, its counter is accumulated using 'scx_add_event()' and '__scx_add_event()' to per-CPU 'struct scx_event_stats' for efficiency. 'scx_bpf_events()' aggregates all the per-CPU counters and exposes a system-wide counters. For convenience and readability of the code, 'scx_agg_event()' and 'scx_dump_event()' are provided. The collected events can be observed after a BPF scheduler is unloaded beforea new BPF scheduler is loaded so the per-CPU 'struct scx_event_stats' are reset. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-27sched_ext: Move built-in idle CPU selection policy to a separate fileAndrea Righi
As ext.c is becoming quite large, move the idle CPU selection policy to separate files (ext_idle.c / ext_idle.h) for better code readability. Moreover, group together all the idle CPU selection kfunc's to the same btf_kfunc_id_set block. No functional changes, this is purely code reorganization. Suggested-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-27sched_ext: Fix lock imbalance in dispatch_to_local_dsq()Andrea Righi
While performing the rq locking dance in dispatch_to_local_dsq(), we may trigger the following lock imbalance condition, in particular when multiple tasks are rapidly changing CPU affinity (i.e., running a `stress-ng --race-sched 0`): [ 13.413579] ===================================== [ 13.413660] WARNING: bad unlock balance detected! [ 13.413729] 6.13.0-virtme #15 Not tainted [ 13.413792] ------------------------------------- [ 13.413859] kworker/1:1/80 is trying to release lock (&rq->__lock) at: [ 13.413954] [<ffffffff873c6c48>] dispatch_to_local_dsq+0x108/0x1a0 [ 13.414111] but there are no more locks to release! [ 13.414176] [ 13.414176] other info that might help us debug this: [ 13.414258] 1 lock held by kworker/1:1/80: [ 13.414318] #0: ffff8b66feb41698 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x20/0x90 [ 13.414612] [ 13.414612] stack backtrace: [ 13.415255] CPU: 1 UID: 0 PID: 80 Comm: kworker/1:1 Not tainted 6.13.0-virtme #15 [ 13.415505] Workqueue: 0x0 (events) [ 13.415567] Sched_ext: dsp_local_on (enabled+all), task: runnable_at=-2ms [ 13.415570] Call Trace: [ 13.415700] <TASK> [ 13.415744] dump_stack_lvl+0x78/0xe0 [ 13.415806] ? dispatch_to_local_dsq+0x108/0x1a0 [ 13.415884] print_unlock_imbalance_bug+0x11b/0x130 [ 13.415965] ? dispatch_to_local_dsq+0x108/0x1a0 [ 13.416226] lock_release+0x231/0x2c0 [ 13.416326] _raw_spin_unlock+0x1b/0x40 [ 13.416422] dispatch_to_local_dsq+0x108/0x1a0 [ 13.416554] flush_dispatch_buf+0x199/0x1d0 [ 13.416652] balance_one+0x194/0x370 [ 13.416751] balance_scx+0x61/0x1e0 [ 13.416848] prev_balance+0x43/0xb0 [ 13.416947] __pick_next_task+0x6b/0x1b0 [ 13.417052] __schedule+0x20d/0x1740 This happens because dispatch_to_local_dsq() is racing with dispatch_dequeue() and, when the latter wins, we incorrectly assume that the task has been moved to dst_rq. Fix by properly tracking the currently locked rq. Fixes: 4d3ca89bdd31 ("sched_ext: Refactor consume_remote_task()") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-27sched_ext: Fix incorrect autogroup migration detectionTejun Heo
scx_move_task() is called from sched_move_task() and tells the BPF scheduler that cgroup migration is being committed. sched_move_task() is used by both cgroup and autogroup migrations and scx_move_task() tried to filter out autogroup migrations by testing the destination cgroup and PF_EXITING but this is not enough. In fact, without explicitly tagging the thread which is doing the cgroup migration, there is no good way to tell apart scx_move_task() invocations for racing migration to the root cgroup and an autogroup migration. This led to scx_move_task() incorrectly ignoring a migration from non-root cgroup to an autogroup of the root cgroup triggering the following warning: WARNING: CPU: 7 PID: 1 at kernel/sched/ext.c:3725 scx_cgroup_can_attach+0x196/0x340 ... Call Trace: <TASK> cgroup_migrate_execute+0x5b1/0x700 cgroup_attach_task+0x296/0x400 __cgroup_procs_write+0x128/0x140 cgroup_procs_write+0x17/0x30 kernfs_fop_write_iter+0x141/0x1f0 vfs_write+0x31d/0x4a0 __x64_sys_write+0x72/0xf0 do_syscall_64+0x82/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e Fix it by adding an argument to sched_move_task() that indicates whether the moving is for a cgroup or autogroup migration. After the change, scx_move_task() is called only for cgroup migrations and renamed to scx_cgroup_move_task(). Link: https://github.com/sched-ext/scx/issues/370 Fixes: 819513666966 ("sched_ext: Add cgroup support") Cc: stable@vger.kernel.org # v6.12+ Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-24sched_ext: Include task weight in the error state dumpAndrea Righi
Report the task weight when dumping the task state during an error exit. Moreover, adjust the output format to display dsq_vtime, slice, and weight on the same line. This can help identify whether certain tasks were excessively prioritized or de-prioritized due to large niceness gaps. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-24sched_ext: Fixes typos in commentsAtul Kumar Pant
Fixes some spelling errors in the comments. Signed-off-by: Atul Kumar Pant <atulpant.linux@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-23Merge tag 'sched_ext-for-6.14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - scx_bpf_now() added so that BPF scheduler can access the cached timestamp in struct rq to avoid reading TSC multiple times within a locked scheduling operation. - Minor updates to the built-in idle CPU selection logic. - tool/sched_ext updates and other misc changes. * tag 'sched_ext-for-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: fix kernel-doc warnings sched_ext: Use time helpers in BPF schedulers sched_ext: Replace bpf_ktime_get_ns() to scx_bpf_now() sched_ext: Add time helpers for BPF schedulers sched_ext: Add scx_bpf_now() for BPF scheduler sched_ext: Implement scx_bpf_now() sched_ext: Relocate scx_enabled() related code sched_ext: Add option -l in selftest runner to list all available tests sched_ext: Include remaining task time slice in error state dump sched_ext: update scx_bpf_dsq_insert() doc for SCX_DSQ_LOCAL_ON sched_ext: idle: small CPU iteration refactoring sched_ext: idle: introduce check_builtin_idle_enabled() helper sched_ext: idle: clarify comments sched_ext: idle: use assign_cpu() to update the idle cpumask sched_ext: Use str_enabled_disabled() helper in update_selcpu_topology() sched_ext: Use sizeof_field for key_len in dsq_hash_params tools/sched_ext: Receive updates from SCX repo sched_ext: Use the NUMA scheduling domain for NUMA optimizations
2025-01-21Merge tag 'kthread-for-6.14-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks Pull kthread updates from Frederic Weisbecker: "Kthreads affinity follow either of 4 existing different patterns: 1) Per-CPU kthreads must stay affine to a single CPU and never execute relevant code on any other CPU. This is currently handled by smpboot code which takes care of CPU-hotplug operations. Affinity here is a correctness constraint. 2) Some kthreads _have_ to be affine to a specific set of CPUs and can't run anywhere else. The affinity is set through kthread_bind_mask() and the subsystem takes care by itself to handle CPU-hotplug operations. Affinity here is assumed to be a correctness constraint. 3) Per-node kthreads _prefer_ to be affine to a specific NUMA node. This is not a correctness constraint but merely a preference in terms of memory locality. kswapd and kcompactd both fall into this category. The affinity is set manually like for any other task and CPU-hotplug is supposed to be handled by the relevant subsystem so that the task is properly reaffined whenever a given CPU from the node comes up. Also care should be taken so that the node affinity doesn't cross isolated (nohz_full) cpumask boundaries. 4) Similar to the previous point except kthreads have a _preferred_ affinity different than a node. Both RCU boost kthreads and RCU exp kworkers fall into this category as they refer to "RCU nodes" from a distinctly distributed tree. Currently the preferred affinity patterns (3 and 4) have at least 4 identified users, with more or less success when it comes to handle CPU-hotplug operations and CPU isolation. Each of which do it in its own ad-hoc way. This is an infrastructure proposal to handle this with the following API changes: - kthread_create_on_node() automatically affines the created kthread to its target node unless it has been set as per-cpu or bound with kthread_bind[_mask]() before the first wake-up. - kthread_affine_preferred() is a new function that can be called right after kthread_create_on_node() to specify a preferred affinity different than the specified node. When the preferred affinity can't be applied because the possible targets are offline or isolated (nohz_full), the kthread is affine to the housekeeping CPUs (which means to all online CPUs most of the time or only the non-nohz_full CPUs when nohz_full= is set). kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been converted, along with a few old drivers. Summary of the changes: - Consolidate a bunch of ad-hoc implementations of kthread_run_on_cpu() - Introduce task_cpu_fallback_mask() that defines the default last resort affinity of a task to become nohz_full aware - Add some correctness check to ensure kthread_bind() is always called before the first kthread wake up. - Default affine kthread to its preferred node. - Convert kswapd / kcompactd and remove their halfway working ad-hoc affinity implementation - Implement kthreads preferred affinity - Unify kthread worker and kthread API's style - Convert RCU kthreads to the new API and remove the ad-hoc affinity implementation" * tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: kthread: modify kernel-doc function name to match code rcu: Use kthread preferred affinity for RCU exp kworkers treewide: Introduce kthread_run_worker[_on_cpu]() kthread: Unify kthread_create_on_cpu() and kthread_create_worker_on_cpu() automatic format rcu: Use kthread preferred affinity for RCU boost kthread: Implement preferred affinity mm: Create/affine kswapd to its preferred node mm: Create/affine kcompactd to its preferred node kthread: Default affine kthread to its preferred NUMA node kthread: Make sure kthread hasn't started while binding it sched,arm64: Handle CPU isolation on last resort fallback rq selection arm64: Exclude nohz_full CPUs from 32bits el0 support lib: test_objpool: Use kthread_run_on_cpu() kallsyms: Use kthread_run_on_cpu() soc/qman: test: Use kthread_run_on_cpu() arm/bL_switcher: Use kthread_run_on_cpu()
2025-01-13sched_ext: fix kernel-doc warningsRandy Dunlap
Use the correct function parameter names and function names. Use the correct kernel-doc comment format for struct sched_ext_ops to eliminate a bunch of warnings. ext.c:1418: warning: Excess function parameter 'include_dead' description in 'scx_task_iter_next_locked' ext.c:7261: warning: expecting prototype for scx_bpf_dump(). Prototype was for scx_bpf_dump_bstr() instead ext.c:7352: warning: Excess function parameter 'flags' description in 'scx_bpf_cpuperf_set' ext.c:3150: warning: Function parameter or struct member 'in_fi' not described in 'scx_prio_less' ext.c:4711: warning: Function parameter or struct member 'dur_s' not described in 'scx_softlockup' ext.c:4775: warning: Function parameter or struct member 'bypass' not described in 'scx_ops_bypass' ext.c:7453: warning: Function parameter or struct member 'idle_mask' not described in 'scx_bpf_put_idle_cpumask' ext.c:209: warning: Incorrect use of kernel-doc format: * select_cpu - Pick the target CPU for a task which is being woken up ext.c:236: warning: Incorrect use of kernel-doc format: * enqueue - Enqueue a task on the BPF scheduler ext.c:251: warning: Incorrect use of kernel-doc format: * dequeue - Remove a task from the BPF scheduler ext.c:267: warning: Incorrect use of kernel-doc format: * dispatch - Dispatch tasks from the BPF scheduler and/or user DSQs ext.c:290: warning: Incorrect use of kernel-doc format: * tick - Periodic tick ext.c:300: warning: Incorrect use of kernel-doc format: * runnable - A task is becoming runnable on its associated CPU ext.c:327: warning: Incorrect use of kernel-doc format: * running - A task is starting to run on its associated CPU ext.c:335: warning: Incorrect use of kernel-doc format: * stopping - A task is stopping execution ext.c:346: warning: Incorrect use of kernel-doc format: * quiescent - A task is becoming not runnable on its associated CPU ext.c:366: warning: Incorrect use of kernel-doc format: * yield - Yield CPU ext.c:381: warning: Incorrect use of kernel-doc format: * core_sched_before - Task ordering for core-sched ext.c:399: warning: Incorrect use of kernel-doc format: * set_weight - Set task weight ext.c:408: warning: Incorrect use of kernel-doc format: * set_cpumask - Set CPU affinity ext.c:418: warning: Incorrect use of kernel-doc format: * update_idle - Update the idle state of a CPU ext.c:439: warning: Incorrect use of kernel-doc format: * cpu_acquire - A CPU is becoming available to the BPF scheduler ext.c:449: warning: Incorrect use of kernel-doc format: * cpu_release - A CPU is taken away from the BPF scheduler ext.c:461: warning: Incorrect use of kernel-doc format: * init_task - Initialize a task to run in a BPF scheduler ext.c:476: warning: Incorrect use of kernel-doc format: * exit_task - Exit a previously-running task from the system ext.c:485: warning: Incorrect use of kernel-doc format: * enable - Enable BPF scheduling for a task ext.c:494: warning: Incorrect use of kernel-doc format: * disable - Disable BPF scheduling for a task ext.c:504: warning: Incorrect use of kernel-doc format: * dump - Dump BPF scheduler state on error ext.c:512: warning: Incorrect use of kernel-doc format: * dump_cpu - Dump BPF scheduler state for a CPU on error ext.c:524: warning: Incorrect use of kernel-doc format: * dump_task - Dump BPF scheduler state for a runnable task on error ext.c:535: warning: Incorrect use of kernel-doc format: * cgroup_init - Initialize a cgroup ext.c:550: warning: Incorrect use of kernel-doc format: * cgroup_exit - Exit a cgroup ext.c:559: warning: Incorrect use of kernel-doc format: * cgroup_prep_move - Prepare a task to be moved to a different cgroup ext.c:574: warning: Incorrect use of kernel-doc format: * cgroup_move - Commit cgroup move ext.c:585: warning: Incorrect use of kernel-doc format: * cgroup_cancel_move - Cancel cgroup move ext.c:597: warning: Incorrect use of kernel-doc format: * cgroup_set_weight - A cgroup's weight is being changed ext.c:611: warning: Incorrect use of kernel-doc format: * cpu_online - A CPU became online ext.c:620: warning: Incorrect use of kernel-doc format: * cpu_offline - A CPU is going offline ext.c:633: warning: Incorrect use of kernel-doc format: * init - Initialize the BPF scheduler ext.c:638: warning: Incorrect use of kernel-doc format: * exit - Clean up after the BPF scheduler ext.c:648: warning: Incorrect use of kernel-doc format: * dispatch_max_batch - Max nr of tasks that dispatch() can dispatch ext.c:653: warning: Incorrect use of kernel-doc format: * flags - %SCX_OPS_* flags ext.c:658: warning: Incorrect use of kernel-doc format: * timeout_ms - The maximum amount of time, in milliseconds, that a ext.c:667: warning: Incorrect use of kernel-doc format: * exit_dump_len - scx_exit_info.dump buffer length. If 0, the default ext.c:673: warning: Incorrect use of kernel-doc format: * hotplug_seq - A sequence number that may be set by the scheduler to ext.c:682: warning: Incorrect use of kernel-doc format: * name - BPF scheduler's name ext.c:689: warning: Function parameter or struct member 'select_cpu' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'enqueue' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'dequeue' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'dispatch' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'tick' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'runnable' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'running' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'stopping' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'quiescent' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'yield' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'core_sched_before' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'set_weight' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'set_cpumask' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'update_idle' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cpu_acquire' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cpu_release' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'init_task' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'exit_task' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'enable' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'disable' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'dump' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'dump_cpu' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'dump_task' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cgroup_init' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cgroup_exit' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cgroup_prep_move' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cgroup_move' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cgroup_cancel_move' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cgroup_set_weight' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cpu_online' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cpu_offline' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'init' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'exit' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'dispatch_max_batch' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'flags' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'timeout_ms' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'exit_dump_len' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'hotplug_seq' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'name' not described in 'sched_ext_ops' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: David Vernet <void@manifault.com> Cc: Changwoo Min <changwoo@igalia.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: bpf@vger.kernel.org Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-10sched_ext: idle: Refresh idle masks during idle-to-idle transitionsAndrea Righi
With the consolidation of put_prev_task/set_next_task(), see commit 436f3eed5c69 ("sched: Combine the last put_prev_task() and the first set_next_task()"), we are now skipping the transition between these two functions when the previous and the next tasks are the same. As a result, the scx idle state of a CPU is updated only when transitioning to or from the idle thread. While this is generally correct, it can lead to uneven and inefficient core utilization in certain scenarios [1]. A typical scenario involves proactive wake-ups: scx_bpf_pick_idle_cpu() selects and marks an idle CPU as busy, followed by a wake-up via scx_bpf_kick_cpu(), without dispatching any tasks. In this case, the CPU continues running the idle thread, returns to idle, but remains marked as busy, preventing it from being selected again as an idle CPU (until a task eventually runs on it and releases the CPU). For example, running a workload that uses 20% of each CPU, combined with an scx scheduler using proactive wake-ups, results in the following core utilization: CPU 0: 25.7% CPU 1: 29.3% CPU 2: 26.5% CPU 3: 25.5% CPU 4: 0.0% CPU 5: 25.5% CPU 6: 0.0% CPU 7: 10.5% To address this, refresh the idle state also in pick_task_idle(), during idle-to-idle transitions, but only trigger ops.update_idle() on actual state changes to prevent unnecessary updates to the scx scheduler and maintain balanced state transitions. With this change in place, the core utilization in the previous example becomes the following: CPU 0: 18.8% CPU 1: 19.4% CPU 2: 18.0% CPU 3: 18.7% CPU 4: 19.3% CPU 5: 18.9% CPU 6: 18.7% CPU 7: 19.3% [1] https://github.com/sched-ext/scx/pull/1139 Fixes: 7c65ae81ea86 ("sched_ext: Don't call put_prev_task_scx() before picking the next task") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-10sched_ext: Implement scx_bpf_now()Changwoo Min
Returns a high-performance monotonically non-decreasing clock for the current CPU. The clock returned is in nanoseconds. It provides the following properties: 1) High performance: Many BPF schedulers call bpf_ktime_get_ns() frequently to account for execution time and track tasks' runtime properties. Unfortunately, in some hardware platforms, bpf_ktime_get_ns() -- which eventually reads a hardware timestamp counter -- is neither performant nor scalable. scx_bpf_now() aims to provide a high-performance clock by using the rq clock in the scheduler core whenever possible. 2) High enough resolution for the BPF scheduler use cases: In most BPF scheduler use cases, the required clock resolution is lower than the most accurate hardware clock (e.g., rdtsc in x86). scx_bpf_now() basically uses the rq clock in the scheduler core whenever it is valid. It considers that the rq clock is valid from the time the rq clock is updated (update_rq_clock) until the rq is unlocked (rq_unpin_lock). 3) Monotonically non-decreasing clock for the same CPU: scx_bpf_now() guarantees the clock never goes backward when comparing them in the same CPU. On the other hand, when comparing clocks in different CPUs, there is no such guarantee -- the clock can go backward. It provides a monotonically *non-decreasing* clock so that it would provide the same clock values in two different scx_bpf_now() calls in the same CPU during the same period of when the rq clock is valid. An rq clock becomes valid when it is updated using update_rq_clock() and invalidated when the rq is unlocked using rq_unpin_lock(). Let's suppose the following timeline in the scheduler core: T1. rq_lock(rq) T2. update_rq_clock(rq) T3. a sched_ext BPF operation T4. rq_unlock(rq) T5. a sched_ext BPF operation T6. rq_lock(rq) T7. update_rq_clock(rq) For [T2, T4), we consider that rq clock is valid (SCX_RQ_CLK_VALID is set), so scx_bpf_now() calls during [T2, T4) (including T3) will return the rq clock updated at T2. For duration [T4, T7), when a BPF scheduler can still call scx_bpf_now() (T5), we consider the rq clock is invalid (SCX_RQ_CLK_VALID is unset at T4). So when calling scx_bpf_now() at T5, we will return a fresh clock value by calling sched_clock_cpu() internally. Also, to prevent getting outdated rq clocks from a previous scx scheduler, invalidate all the rq clocks when unloading a BPF scheduler. One example of calling scx_bpf_now(), when the rq clock is invalid (like T5), is in scx_central [1]. The scx_central scheduler uses a BPF timer for preemptive scheduling. In every msec, the timer callback checks if the currently running tasks exceed their timeslice. At the beginning of the BPF timer callback (central_timerfn in scx_central.bpf.c), scx_central gets the current time. When the BPF timer callback runs, the rq clock could be invalid, the same as T5. In this case, scx_bpf_now() returns a fresh clock value rather than returning the old one (T2). [1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_central.bpf.c Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-08treewide: Introduce kthread_run_worker[_on_cpu]()Frederic Weisbecker
kthread_create() creates a kthread without running it yet. kthread_run() creates a kthread and runs it. On the other hand, kthread_create_worker() creates a kthread worker and runs it. This difference in behaviours is confusing. Also there is no way to create a kthread worker and affine it using kthread_bind_mask() or kthread_affine_preferred() before starting it. Consolidate the behaviours and introduce kthread_run_worker[_on_cpu]() that behaves just like kthread_run(). kthread_create_worker[_on_cpu]() will now only create a kthread worker without starting it. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
2025-01-08sched_ext: switch class when preempted by higher priority schedulerHonglei Wang
ops.cpu_release() function, if defined, must be invoked when preempted by a higher priority scheduler class task. This scenario was skipped in commit f422316d7466 ("sched_ext: Remove switch_class_scx()"). Let's fix it. Fixes: f422316d7466 ("sched_ext: Remove switch_class_scx()") Signed-off-by: Honglei Wang <jameshongleiwang@126.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-08sched_ext: Replace rq_lock() to raw_spin_rq_lock() in scx_ops_bypass()Changwoo Min
scx_ops_bypass() iterates all CPUs to re-enqueue all the scx tasks. For each CPU, it acquires a lock using rq_lock() regardless of whether a CPU is offline or the CPU is currently running a task in a higher scheduler class (e.g., deadline). The rq_lock() is supposed to be used for online CPUs, and the use of rq_lock() may trigger an unnecessary warning in rq_pin_lock(). Therefore, replace rq_lock() to raw_spin_rq_lock() in scx_ops_bypass(). Without this change, we observe the following warning: ===== START ===== [ 6.615205] rq->balance_callback && rq->balance_callback != &balance_push_callback [ 6.615208] WARNING: CPU: 2 PID: 0 at kernel/sched/sched.h:1730 __schedule+0x1130/0x1c90 ===== END ===== Fixes: 0e7ffff1b811 ("scx: Fix raciness in scx_ops_bypass()") Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-08sched_ext: keep running prev when prev->scx.slice != 0Henry Huang
When %SCX_OPS_ENQ_LAST is set and prev->scx.slice != 0, @prev will be dispacthed into the local DSQ in put_prev_task_scx(). However, pick_task_scx() is executed before put_prev_task_scx(), so it will not pick @prev. Set %SCX_RQ_BAL_KEEP in balance_one() to ensure that pick_task_scx() can pick @prev. Signed-off-by: Henry Huang <henry.hj@antgroup.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>