summaryrefslogtreecommitdiff
path: root/kernel/sched/ext.c
AgeCommit message (Collapse)Author
2024-10-25scx: Fix raciness in scx_ops_bypass()David Vernet
scx_ops_bypass() can currently race on the ops enable / disable path as follows: 1. scx_ops_bypass(true) called on enable path, bypass depth is set to 1 2. An op on the init path exits, which schedules scx_ops_disable_workfn() 3. scx_ops_bypass(false) is called on the disable path, and bypass depth is decremented to 0 4. kthread is scheduled to execute scx_ops_disable_workfn() 5. scx_ops_bypass(true) called, bypass depth set to 1 6. scx_ops_bypass() races when iterating over CPUs While it's not safe to take any blocking locks on the bypass path, it is safe to take a raw spinlock which cannot be preempted. This patch therefore updates scx_ops_bypass() to use a raw spinlock to synchronize, and changes scx_ops_bypass_depth to be a regular int. Without this change, we observe the following warnings when running the 'exit' sched_ext selftest (sometimes requires a couple of runs): .[root@virtme-ng sched_ext]# ./runner -t exit ===== START ===== TEST: exit ... [ 14.935078] WARNING: CPU: 2 PID: 360 at kernel/sched/ext.c:4332 scx_ops_bypass+0x1ca/0x280 [ 14.935126] Modules linked in: [ 14.935150] CPU: 2 UID: 0 PID: 360 Comm: sched_ext_ops_h Not tainted 6.11.0-virtme #24 [ 14.935192] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 [ 14.935242] Sched_ext: exit (enabling+all) [ 14.935244] RIP: 0010:scx_ops_bypass+0x1ca/0x280 [ 14.935300] Code: ff ff ff e8 48 96 10 00 fb e9 08 ff ff ff c6 05 7b 34 e8 01 01 90 48 c7 c7 89 86 88 87 e8 be 1d f8 ff 90 0f 0b 90 90 eb 95 90 <0f> 0b 90 41 8b 84 24 24 0a 00 00 eb 97 90 0f 0b 90 41 8b 84 24 24 [ 14.935394] RSP: 0018:ffffb706c0957ce0 EFLAGS: 00010002 [ 14.935424] RAX: 0000000000000009 RBX: 0000000000000001 RCX: 00000000e3fb8b2a [ 14.935465] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffffffff88a4c080 [ 14.935512] RBP: 0000000000009b56 R08: 0000000000000004 R09: 00000003f12e520a [ 14.935555] R10: ffffffff863a9795 R11: 0000000000000000 R12: ffff8fc5fec31300 [ 14.935598] R13: ffff8fc5fec31318 R14: 0000000000000286 R15: 0000000000000018 [ 14.935642] FS: 0000000000000000(0000) GS:ffff8fc5fe680000(0000) knlGS:0000000000000000 [ 14.935684] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 14.935721] CR2: 0000557d92890b88 CR3: 000000002464a000 CR4: 0000000000750ef0 [ 14.935765] PKRU: 55555554 [ 14.935782] Call Trace: [ 14.935802] <TASK> [ 14.935823] ? __warn+0xce/0x220 [ 14.935850] ? scx_ops_bypass+0x1ca/0x280 [ 14.935881] ? report_bug+0xc1/0x160 [ 14.935909] ? handle_bug+0x61/0x90 [ 14.935934] ? exc_invalid_op+0x1a/0x50 [ 14.935959] ? asm_exc_invalid_op+0x1a/0x20 [ 14.935984] ? raw_spin_rq_lock_nested+0x15/0x30 [ 14.936019] ? scx_ops_bypass+0x1ca/0x280 [ 14.936046] ? srso_alias_return_thunk+0x5/0xfbef5 [ 14.936081] ? __pfx_scx_ops_disable_workfn+0x10/0x10 [ 14.936111] scx_ops_disable_workfn+0x146/0xac0 [ 14.936142] ? finish_task_switch+0xa9/0x2c0 [ 14.936172] ? srso_alias_return_thunk+0x5/0xfbef5 [ 14.936211] ? __pfx_scx_ops_disable_workfn+0x10/0x10 [ 14.936244] kthread_worker_fn+0x101/0x2c0 [ 14.936268] ? __pfx_kthread_worker_fn+0x10/0x10 [ 14.936299] kthread+0xec/0x110 [ 14.936327] ? __pfx_kthread+0x10/0x10 [ 14.936351] ret_from_fork+0x37/0x50 [ 14.936374] ? __pfx_kthread+0x10/0x10 [ 14.936400] ret_from_fork_asm+0x1a/0x30 [ 14.936427] </TASK> [ 14.936443] irq event stamp: 21002 [ 14.936467] hardirqs last enabled at (21001): [<ffffffff863aa35f>] resched_cpu+0x9f/0xd0 [ 14.936521] hardirqs last disabled at (21002): [<ffffffff863dd0ba>] scx_ops_bypass+0x11a/0x280 [ 14.936571] softirqs last enabled at (20642): [<ffffffff863683d7>] __irq_exit_rcu+0x67/0xd0 [ 14.936622] softirqs last disabled at (20637): [<ffffffff863683d7>] __irq_exit_rcu+0x67/0xd0 [ 14.936672] ---[ end trace 0000000000000000 ]--- [ 14.953282] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) [ 14.953352] ------------[ cut here ]------------ [ 14.953383] WARNING: CPU: 2 PID: 360 at kernel/sched/ext.c:4335 scx_ops_bypass+0x1d8/0x280 [ 14.953428] Modules linked in: [ 14.953453] CPU: 2 UID: 0 PID: 360 Comm: sched_ext_ops_h Tainted: G W 6.11.0-virtme #24 [ 14.953505] Tainted: [W]=WARN [ 14.953527] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 [ 14.953574] RIP: 0010:scx_ops_bypass+0x1d8/0x280 [ 14.953603] Code: c6 05 7b 34 e8 01 01 90 48 c7 c7 89 86 88 87 e8 be 1d f8 ff 90 0f 0b 90 90 eb 95 90 0f 0b 90 41 8b 84 24 24 0a 00 00 eb 97 90 <0f> 0b 90 41 8b 84 24 24 0a 00 00 eb 92 f3 0f 1e fa 49 8d 84 24 f0 [ 14.953693] RSP: 0018:ffffb706c0957ce0 EFLAGS: 00010046 [ 14.953722] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000001 [ 14.953763] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8fc5fec31318 [ 14.953804] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000 [ 14.953845] R10: ffffffff863a9795 R11: 0000000000000000 R12: ffff8fc5fec31300 [ 14.953888] R13: ffff8fc5fec31318 R14: 0000000000000286 R15: 0000000000000018 [ 14.953934] FS: 0000000000000000(0000) GS:ffff8fc5fe680000(0000) knlGS:0000000000000000 [ 14.953974] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 14.954009] CR2: 0000557d92890b88 CR3: 000000002464a000 CR4: 0000000000750ef0 [ 14.954052] PKRU: 55555554 [ 14.954068] Call Trace: [ 14.954085] <TASK> [ 14.954102] ? __warn+0xce/0x220 [ 14.954126] ? scx_ops_bypass+0x1d8/0x280 [ 14.954150] ? report_bug+0xc1/0x160 [ 14.954178] ? handle_bug+0x61/0x90 [ 14.954203] ? exc_invalid_op+0x1a/0x50 [ 14.954226] ? asm_exc_invalid_op+0x1a/0x20 [ 14.954250] ? raw_spin_rq_lock_nested+0x15/0x30 [ 14.954285] ? scx_ops_bypass+0x1d8/0x280 [ 14.954311] ? __mutex_unlock_slowpath+0x3a/0x260 [ 14.954343] scx_ops_disable_workfn+0xa3e/0xac0 [ 14.954381] ? __pfx_scx_ops_disable_workfn+0x10/0x10 [ 14.954413] kthread_worker_fn+0x101/0x2c0 [ 14.954442] ? __pfx_kthread_worker_fn+0x10/0x10 [ 14.954479] kthread+0xec/0x110 [ 14.954507] ? __pfx_kthread+0x10/0x10 [ 14.954530] ret_from_fork+0x37/0x50 [ 14.954553] ? __pfx_kthread+0x10/0x10 [ 14.954576] ret_from_fork_asm+0x1a/0x30 [ 14.954603] </TASK> [ 14.954621] irq event stamp: 21002 [ 14.954644] hardirqs last enabled at (21001): [<ffffffff863aa35f>] resched_cpu+0x9f/0xd0 [ 14.954686] hardirqs last disabled at (21002): [<ffffffff863dd0ba>] scx_ops_bypass+0x11a/0x280 [ 14.954735] softirqs last enabled at (20642): [<ffffffff863683d7>] __irq_exit_rcu+0x67/0xd0 [ 14.954782] softirqs last disabled at (20637): [<ffffffff863683d7>] __irq_exit_rcu+0x67/0xd0 [ 14.954829] ---[ end trace 0000000000000000 ]--- [ 15.022283] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) [ 15.092282] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) [ 15.149282] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) ok 1 exit # ===== END ===== And with it, the test passes without issue after 1000s of runs: .[root@virtme-ng sched_ext]# ./runner -t exit ===== START ===== TEST: exit DESCRIPTION: Verify we can cleanly exit a scheduler in multiple places OUTPUT: [ 7.412856] sched_ext: BPF scheduler "exit" enabled [ 7.427924] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) [ 7.466677] sched_ext: BPF scheduler "exit" enabled [ 7.475923] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) [ 7.512803] sched_ext: BPF scheduler "exit" enabled [ 7.532924] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) [ 7.586809] sched_ext: BPF scheduler "exit" enabled [ 7.595926] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) [ 7.661923] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) [ 7.723923] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) ok 1 exit # ===== END ===== ============================= RESULTS: PASSED: 1 SKIPPED: 0 FAILED: 0 Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class") Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-10-24sched_ext: Replace set_arg_maybe_null() with __nullable CFI stub tagsTejun Heo
ops.dispatch() and ops.yield() may be fed a NULL task_struct pointer. set_arg_maybe_null() is used to tell the verifier that they should be NULL checked before being dereferenced. BPF now has an a lot prettier way to express this - tagging arguments in CFI stubs with __nullable. Replace set_arg_maybe_null() with __nullable CFI stub tags. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24sched_ext: Rename CFI stubs to names that are recognized by BPFTejun Heo
CFI stubs can be used to tag arguments with __nullable (and possibly other tags in the future) but for that to work the CFI stubs must have names that are recognized by BPF. Rename them. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org>
2024-10-23sched_ext: Introduce LLC awareness to the default idle selection policyAndrea Righi
Rely on the scheduler topology information to implement basic LLC awareness in the sched_ext build-in idle selection policy. This allows schedulers using the built-in policy to make more informed decisions when selecting an idle CPU in systems with multiple LLCs, such as NUMA systems or chiplet-based architectures, and it helps keep tasks within the same LLC domain, thereby improving cache locality. For efficiency, LLC awareness is applied only to tasks that can run on all the CPUs in the system for now. If a task's affinity is modified from user space, it's the responsibility of user space to choose the appropriate optimized scheduling domain. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-10-23sched_ext: Clarify ops.select_cpu() for single-CPU tasksAndrea Righi
Update ops.select_cpu() documentation to clarify that this method is not called for tasks that are restricted to run on a single CPU, as these tasks do not have the option to select a different CPU. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-10-18sched_ext: improve WAKE_SYNC behavior for default idle CPU selectionAndrea Righi
In the sched_ext built-in idle CPU selection logic, when handling a WF_SYNC wakeup, we always attempt to migrate the task to the waker's CPU, as the waker is expected to yield the CPU after waking the task. However, it may be preferable to keep the task on its previous CPU if the waker's CPU is cache-affine. The same approach is also used by the fair class and in other scx schedulers, like scx_rusty and scx_bpfland. Therefore, apply the same logic to the built-in idle CPU selection policy as well. Signed-off-by: Andrea Righi <andrea.righi@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-10-17sched_ext: Use btf_ids to resolve task_structTianchen Ding
Save the searching time during bpf_scx_init. Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-10-17Merge branch 'linus' into sched/urgent, to resolve conflictIngo Molnar
Conflicts: kernel/sched/ext.c There's a context conflict between this upstream commit: 3fdb9ebcec10 sched_ext: Start schedulers with consistent p->scx.slice values ... and this fix in sched/urgent: 98442f0ccd82 sched: Fix delayed_dequeue vs switched_from_fair() Resolve it. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-10-14sched_ext: Remove unnecessary cpu_relax()David Vernet
As described in commit b07996c7abac ("sched_ext: Don't hold scx_tasks_lock for too long"), we're doing a cond_resched() every 32 calls to scx_task_iter_next() to avoid RCU and other stalls. That commit also added a cpu_relax() to the codepath where we drop and reacquire the lock, but as Waiman described in [0], cpu_relax() should only be necessary in busy loops to avoid pounding on a cacheline (or to allow a hypertwin to more fully utilize a core). Let's remove the unnecessary cpu_relax(). [0]: https://lore.kernel.org/all/35b3889b-904a-4d26-981f-c8aa1557a7c7@redhat.com/ Cc: Waiman Long <llong@redhat.com> Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-10-11sched: Fix delayed_dequeue vs switched_from_fair()Peter Zijlstra
Commit 2e0199df252a ("sched/fair: Prepare exit/cleanup paths for delayed_dequeue") and its follow up fixes try to deal with a rather unfortunate situation where is task is enqueued in a new class, even though it shouldn't have been. Mostly because the existing ->switched_to/from() hooks are in the wrong place for this case. This all led to Paul being able to trigger failures at something like once per 10k CPU hours of RCU torture. For now, do the ugly thing and move the code to the right place by ignoring the switch hooks. Note: Clean up the whole sched_class::switch*_{to,from}() thing. Fixes: 2e0199df252a ("sched/fair: Prepare exit/cleanup paths for delayed_dequeue") Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20241003185037.GA5594@noisy.programming.kicks-ass.net
2024-10-10sched_ext: Don't hold scx_tasks_lock for too longTejun Heo
While enabling and disabling a BPF scheduler, every task is iterated a couple times by walking scx_tasks. Except for one, all iterations keep holding scx_tasks_lock. On multi-socket systems under heavy rq lock contention and high number of threads, this can can lead to RCU and other stalls. The following is triggered on a 2 x AMD EPYC 7642 system (192 logical CPUs) running `stress-ng --workload 150 --workload-threads 10` with >400k idle threads and RCU stall period reduced to 5s: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: rcu: 91-...!: (10 ticks this GP) idle=0754/1/0x4000000000000000 softirq=18204/18206 fqs=17 rcu: 186-...!: (17 ticks this GP) idle=ec54/1/0x4000000000000000 softirq=25863/25866 fqs=17 rcu: (detected by 80, t=10042 jiffies, g=89305, q=33 ncpus=192) Sending NMI from CPU 80 to CPUs 91: NMI backtrace for cpu 91 CPU: 91 UID: 0 PID: 284038 Comm: sched_ext_ops_h Kdump: loaded Not tainted 6.12.0-rc2-work-g6bf5681f7ee2-dirty #471 Hardware name: Supermicro Super Server/H11DSi, BIOS 2.8 12/14/2023 Sched_ext: simple (disabling+all) RIP: 0010:queued_spin_lock_slowpath+0x17b/0x2f0 Code: 02 c0 10 03 00 83 79 08 00 75 08 f3 90 83 79 08 00 74 f8 48 8b 11 48 85 d2 74 09 0f 0d 0a eb 0a 31 d2 eb 06 31 d2 eb 02 f3 90 <8b> 07 66 85 c0 75 f7 39 d8 75 0d be 01 00 00 00 89 d8 f0 0f b1 37 RSP: 0018:ffffc9000fadfcb8 EFLAGS: 00000002 RAX: 0000000001700001 RBX: 0000000001700000 RCX: ffff88bfcaaf10c0 RDX: 0000000000000000 RSI: 0000000000000101 RDI: ffff88bfca8f0080 RBP: 0000000001700000 R08: 0000000000000090 R09: ffffffffffffffff R10: ffff88a74761b268 R11: 0000000000000000 R12: ffff88a6b6765460 R13: ffffc9000fadfd60 R14: ffff88bfca8f0080 R15: ffff88bfcaac0000 FS: 0000000000000000(0000) GS:ffff88bfcaac0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f5c55f526a0 CR3: 0000000afd474000 CR4: 0000000000350eb0 Call Trace: <NMI> </NMI> <TASK> do_raw_spin_lock+0x9c/0xb0 task_rq_lock+0x50/0x190 scx_task_iter_next_locked+0x157/0x170 scx_ops_disable_workfn+0x2c2/0xbf0 kthread_worker_fn+0x108/0x2a0 kthread+0xeb/0x110 ret_from_fork+0x36/0x40 ret_from_fork_asm+0x1a/0x30 </TASK> Sending NMI from CPU 80 to CPUs 186: NMI backtrace for cpu 186 CPU: 186 UID: 0 PID: 51248 Comm: fish Kdump: loaded Not tainted 6.12.0-rc2-work-g6bf5681f7ee2-dirty #471 scx_task_iter can safely drop locks while iterating. Make scx_task_iter_next() drop scx_tasks_lock every 32 iterations to avoid stalls. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-10-10sched_ext: Move scx_tasks_lock handling into scx_task_iter helpersTejun Heo
Iterating with scx_task_iter involves scx_tasks_lock and optionally the rq lock of the task being iterated. Both locks can be released during iteration and the iteration can be continued after re-grabbing scx_tasks_lock. Currently, all lock handling is pushed to the caller which is a bit cumbersome and makes it difficult to add lock-aware behaviors. Make the scx_task_iter helpers handle scx_tasks_lock. - scx_task_iter_init/scx_taks_iter_exit() now grabs and releases scx_task_lock, respectively. Renamed to scx_task_iter_start/scx_task_iter_stop() to more clearly indicate that there are non-trivial side-effects. - Add __ prefix to scx_task_iter_rq_unlock() to indicate that the function is internal. - Add scx_task_iter_unlock/relock(). The former drops both rq lock (if held) and scx_tasks_lock and the latter re-locks only scx_tasks_lock. This doesn't cause behavior changes and will be used to implement stall avoidance. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-10-10sched_ext: bypass mode shouldn't depend on ops.select_cpu()Tejun Heo
Bypass mode was depending on ops.select_cpu() which can't be trusted as with the rest of the BPF scheduler. Always enable and use scx_select_cpu_dfl() in bypass mode. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-10-10sched_ext: Move scx_buildin_idle_enabled check to scx_bpf_select_cpu_dfl()Tejun Heo
Move the sanity check from the inner function scx_select_cpu_dfl() to the exported kfunc scx_bpf_select_cpu_dfl(). This doesn't cause behavior differences and will allow using scx_select_cpu_dfl() in bypass mode regardless of scx_builtin_idle_enabled. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-10-10sched_ext: Start schedulers with consistent p->scx.slice valuesTejun Heo
The disable path caps p->scx.slice to SCX_SLICE_DFL. As the field is already being ignored at this stage during disable, the only effect this has is that when the next BPF scheduler is loaded, it won't see unreasonable left-over slices. Ultimately, this shouldn't matter but it's better to start in a known state. Drop p->scx.slice capping from the disable path and instead reset it to SCX_SLICE_DFL in the enable path. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-10-10Revert "sched_ext: Use shorter slice while bypassing"Tejun Heo
This reverts commit 6f34d8d382d64e7d8e77f5a9ddfd06f4c04937b0. Slice length is ignored while bypassing and tasks are switched on every tick and thus the patch does not make any difference. The perceived difference was from test noise. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-10-10sched_ext: use correct function name in pick_task_scx() warning messageHonglei Wang
pick_next_task_scx() was turned into pick_task_scx() since commit 753e2836d139 ("sched_ext: Unify regular and core-sched pick task paths"). Update the outdated message. Signed-off-by: Honglei Wang <jameshongleiwang@126.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-10-07sched_ext, scx_qmap: Add and use SCX_ENQ_CPU_SELECTEDTejun Heo
scx_qmap and other schedulers in the SCX repo are using SCX_ENQ_WAKEUP to tell whether ops.select_cpu() was called. This is incorrect as ops.select_cpu() can be skipped in the wakeup path and leads to e.g. incorrectly skipping direct dispatch for tasks that are bound to a single CPU. sched core has been updated to specify ENQUEUE_RQ_SELECTED if ->select_task_rq() was called. Map it to SCX_ENQ_CPU_SELECTED and update scx_qmap to test it instead of SCX_ENQ_WAKEUP. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> Cc: Daniel Hodges <hodges.daniel.scott@gmail.com> Cc: Changwoo Min <multics69@gmail.com> Cc: Andrea Righi <andrea.righi@linux.dev> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-10-04sched_ext: scx_cgroup_exit() may be called without successful scx_cgroup_init()Tejun Heo
568894edbe48 ("sched_ext: Add scx_cgroup_enabled to gate cgroup operations and fix scx_tg_online()") assumed that scx_cgroup_exit() is only called after scx_cgroup_init() finished successfully. This isn't true. scx_cgroup_exit() can be called without scx_cgroup_init() being called at all or after scx_cgroup_init() failed in the middle. As init state is tracked per cgroup, scx_cgroup_exit() can be used safely to clean up in all cases. Remove the incorrect WARN_ON_ONCE(). Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 568894edbe48 ("sched_ext: Add scx_cgroup_enabled to gate cgroup operations and fix scx_tg_online()")
2024-10-04sched_ext: Improve error reporting during loadingTejun Heo
When the BPF scheduler fails, ops.exit() allows rich error reporting through scx_exit_info. Use scx.exit() path consistently for all failures which can be caused by the BPF scheduler: - scx_ops_error() is called after ops.init() and ops.cgroup_init() failure to record error information. - ops.init_task() failure now uses scx_ops_error() instead of pr_err(). - The err_disable path updated to automatically trigger scx_ops_error() to cover cases that the error message hasn't already been generated and always return 0 indicating init success so that the error is reported through ops.exit(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: David Vernet <void@manifault.com> Cc: Daniel Hodges <hodges.daniel.scott@gmail.com> Cc: Changwoo Min <multics69@gmail.com> Cc: Andrea Righi <andrea.righi@linux.dev> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-09-27sched/ext: Use tg_cgroup() to elieminate duplicate codeZhang Qiao
Use tg_cgroup() to eliminate duplicate code patterns in scx_bpf_task_cgroup(). Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-27sched/ext: Fix unmatch trailing comment of CONFIG_EXT_GROUP_SCHEDZhang Qiao
The #endif trailing comment of CONFIG_EXT_GROUP_SCHED is unmatched, so fix it. Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-27sched_ext: Factor out move_task_between_dsqs() from scx_dispatch_from_dsq()Tejun Heo
Pure reorganization. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-27sched_ext: Remove redundant p->nr_cpus_allowed checkerZhang Qiao
select_rq_task() already checked that 'p->nr_cpus_allowed > 1', 'p->nr_cpus_allowed == 1' checker in scx_select_cpu_dfl() is redundant. Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-27sched_ext: Decouple locks in scx_ops_enable()Tejun Heo
The enable path uses three big locks - scx_fork_rwsem, scx_cgroup_rwsem and cpus_read_lock. Currently, the locks are grabbed together which is prone to locking order problems. For example, currently, there is a possible deadlock involving scx_fork_rwsem and cpus_read_lock. cpus_read_lock has to nest inside scx_fork_rwsem due to locking order existing in other subsystems. However, there exists a dependency in the other direction during hotplug if hotplug needs to fork a new task, which happens in some cases. This leads to the following deadlock: scx_ops_enable() hotplug percpu_down_write(&cpu_hotplug_lock) percpu_down_write(&scx_fork_rwsem) block on cpu_hotplug_lock kthread_create() waits for kthreadd kthreadd blocks on scx_fork_rwsem Note that this doesn't trigger lockdep because the hotplug side dependency bounces through kthreadd. With the preceding scx_cgroup_enabled change, this can be solved by decoupling cpus_read_lock, which is needed for static_key manipulations, from the other two locks. - Move the first block of static_key manipulations outside of scx_fork_rwsem and scx_cgroup_rwsem. This is now safe with the preceding scx_cgroup_enabled change. - Drop scx_cgroup_rwsem and scx_fork_rwsem between the two task iteration blocks so that __scx_ops_enabled static_key enabling is outside the two rwsems. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Link: http://lkml.kernel.org/r/8cd0ec0c4c7c1bc0119e61fbef0bee9d5e24022d.camel@linux.ibm.com
2024-09-27sched_ext: Decouple locks in scx_ops_disable_workfn()Tejun Heo
The disable path uses three big locks - scx_fork_rwsem, scx_cgroup_rwsem and cpus_read_lock. Currently, the locks are grabbed together which is prone to locking order problems. With the preceding scx_cgroup_enabled change, we can decouple them: - As cgroup disabling no longer requires modifying a static_key which requires cpus_read_lock(), no need to grab cpus_read_lock() before grabbing scx_cgroup_rwsem. - cgroup can now be independently disabled before tasks are moved back to the fair class. Relocate scx_cgroup_exit() invocation before scx_fork_rwsem is grabbed, drop now unnecessary cpus_read_lock() and move static_key operations out of scx_fork_rwsem. This decouples all three locks in the disable path. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Link: http://lkml.kernel.org/r/8cd0ec0c4c7c1bc0119e61fbef0bee9d5e24022d.camel@linux.ibm.com
2024-09-27sched_ext: Add scx_cgroup_enabled to gate cgroup operations and fix ↵Tejun Heo
scx_tg_online() If the BPF scheduler does not implement ops.cgroup_init(), scx_tg_online() didn't set SCX_TG_INITED which meant that ops.cgroup_exit(), even if implemented, won't be called from scx_tg_offline(). This is because SCX_HAS_OP(cgroupt_init) is used to test both whether SCX cgroup operations are enabled and ops.cgroup_init() exists. Fix it by introducing a separate bool scx_cgroup_enabled to gate cgroup operations and use SCX_HAS_OP(cgroup_init) only to test whether ops.cgroup_init() exists. Make all cgroup operations consistently use scx_cgroup_enabled to test whether cgroup operations are enabled. scx_cgroup_enabled is added instead of using scx_enabled() to ease planned locking updates. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-27sched_ext: Enable scx_ops_init_task() separatelyTejun Heo
scx_ops_init_task() and the follow-up scx_ops_enable_task() in the fork path were gated by scx_enabled() test and thus __scx_ops_enabled had to be turned on before the first scx_ops_init_task() loop in scx_ops_enable(). However, if an external entity causes sched_class switch before the loop is complete, tasks which are not initialized could be switched to SCX. The following can be reproduced by running a program which keeps toggling a process between SCHED_OTHER and SCHED_EXT using sched_setscheduler(2). sched_ext: Invalid task state transition 0 -> 3 for fish[1623] WARNING: CPU: 1 PID: 1650 at kernel/sched/ext.c:3392 scx_ops_enable_task+0x1a1/0x200 ... Sched_ext: simple (enabling) RIP: 0010:scx_ops_enable_task+0x1a1/0x200 ... switching_to_scx+0x13/0xa0 __sched_setscheduler+0x850/0xa50 do_sched_setscheduler+0x104/0x1c0 __x64_sys_sched_setscheduler+0x18/0x30 do_syscall_64+0x7b/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e Fix it by gating scx_ops_init_task() separately using scx_ops_init_task_enabled. __scx_ops_enabled is now set after all tasks are finished with scx_ops_init_task(). Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-27sched_ext: Fix SCX_TASK_INIT -> SCX_TASK_READY transitions in scx_ops_enable()Tejun Heo
scx_ops_enable() has two task iteration loops. The first one calls scx_ops_init_task() on every task and the latter switches the eligible ones into SCX. The first loop left the tasks in SCX_TASK_INIT state and then the second loop switched it into READY before switching the task into SCX. The distinction between INIT and READY is only meaningful in the fork path where it's used to tell whether the task finished forking so that we can tell ops.exit_task() accordingly. Leaving task in INIT state between the two loops is incosistent with the fork path and incorrect. The following can be triggered by running a program which keeps toggling a task between SCHED_OTHER and SCHED_SCX while enabling a task: sched_ext: Invalid task state transition 1 -> 3 for fish[1526] WARNING: CPU: 2 PID: 1615 at kernel/sched/ext.c:3393 scx_ops_enable_task+0x1a1/0x200 ... Sched_ext: qmap (enabling+all) RIP: 0010:scx_ops_enable_task+0x1a1/0x200 ... switching_to_scx+0x13/0xa0 __sched_setscheduler+0x850/0xa50 do_sched_setscheduler+0x104/0x1c0 __x64_sys_sched_setscheduler+0x18/0x30 do_syscall_64+0x7b/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e Fix it by transitioning to READY in the first loop right after scx_ops_init_task() succeeds. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: David Vernet <void@manifault.com>
2024-09-27sched_ext: Initialize in bypass modeTejun Heo
scx_ops_enable() used preempt_disable() around the task iteration loop to switch tasks into SCX to guarantee forward progress of the task which is running scx_ops_enable(). However, in the gap between setting __scx_ops_enabled and preeempt_disable(), an external entity can put tasks including the enabling one into SCX prematurely, which can lead to malfunctions including stalls. The bypass mode can wrap the entire enabling operation and guarantee forward progress no matter what the BPF scheduler does. Use the bypass mode instead to guarantee forward progress while enabling. While at it, release and regrab scx_tasks_lock between the two task iteration locks in scx_ops_enable() for clarity as there is no reason to keep holding the lock between them. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-27sched_ext: Remove SCX_OPS_PREPPINGTejun Heo
The distinction between SCX_OPS_PREPPING and SCX_OPS_ENABLING is not used anywhere and only adds confusion. Drop SCX_OPS_PREPPING. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-27sched_ext: Relocate check_hotplug_seq() call in scx_ops_enable()Tejun Heo
check_hotplug_seq() is used to detect CPU hotplug event which occurred while the BPF scheduler is being loaded so that initialization can be retried if CPU hotplug events take place before the CPU hotplug callbacks are online. As such, the best place to call it is in the same cpu_read_lock() section that enables the CPU hotplug ops. Currently, it is called in the next cpus_read_lock() block in scx_ops_enable(). The side effect of this placement is a small window in which hotplug sequence detection can trigger unnecessarily, which isn't critical. Move check_hotplug_seq() invocation to the same cpus_read_lock() block as the hotplug operation enablement to close the window and get the invocation out of the way for planned locking updates. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: David Vernet <void@manifault.com>
2024-09-26sched_ext: Use shorter slice while bypassingTejun Heo
While bypassing, tasks are scheduled in FIFO order which favors tasks that hog CPUs. This can slow down e.g. unloading of the BPF scheduler. While bypassing, guaranteeing timely forward progress is the main goal. There's no point in giving long slices. Shorten the time slice used while bypassing from 20ms to 5ms. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-09-26sched_ext: Split the global DSQ per NUMA nodeTejun Heo
In the bypass mode, the global DSQ is used to schedule all tasks in simple FIFO order. All tasks are queued into the global DSQ and all CPUs try to execute tasks from it. This creates a lot of cross-node cacheline accesses and scheduling across the node boundaries, and can lead to live-lock conditions where the system takes tens of minutes to disable the BPF scheduler while executing in the bypass mode. Split the global DSQ per NUMA node. Each node has its own global DSQ. When a task is dispatched to SCX_DSQ_GLOBAL, it's put into the global DSQ local to the task's CPU and all CPUs in a node only consume its node-local global DSQ. This resolves a livelock condition which could be reliably triggered on an 2x EPYC 7642 system by running `stress-ng --race-sched 1024` together with `stress-ng --workload 80 --workload-threads 10` while repeatedly enabling and disabling a SCX scheduler. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-09-26sched_ext: Relocate find_user_dsq()Tejun Heo
To prepare for the addition of find_global_dsq(). No functional changes. Signed-off-by: tejun heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-09-26sched_ext: Allow only user DSQs for scx_bpf_consume(), ↵Tejun Heo
scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new() SCX_DSQ_GLOBAL is special in that it can't be used as a priority queue and is consumed implicitly, but all BPF DSQ related kfuncs could be used on it. SCX_DSQ_GLOBAL will be split per-node for scalability and those operations won't make sense anymore. Disallow SCX_DSQ_GLOBAL on scx_bpf_consume(), scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new(). This means that SCX_DSQ_GLOBAL can only be used as a dispatch target from BPF schedulers. With scx_flatcg, which was using SCX_DSQ_GLOBAL as the fallback DSQ, updated, this shouldn't affect any schedulers. This leaves find_dsq_for_dispatch() the only user of find_non_local_dsq(). Open code and remove find_non_local_dsq(). Signed-off-by: tejun heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-09-24sched_ext: Build fix for !CONFIG_SMPTejun Heo
move_remote_task_to_local_dsq() is only defined on SMP configs but scx_disaptch_from_dsq() was calling move_remote_task_to_local_dsq() on UP configs too causing build failures. Add a dummy move_remote_task_to_local_dsq() which triggers a warning. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202409241108.jaocHiDJ-lkp@intel.com/
2024-09-23sched_ext: Provide a sysfs enable_seq counterAndrea Righi
As discussed during the distro-centric session within the sched_ext Microconference at LPC 2024, introduce a sequence counter that is incremented every time a BPF scheduler is loaded. This feature can help distributions in diagnosing potential performance regressions by identifying systems where users are running (or have ran) custom BPF schedulers. Example: arighi@virtme-ng~> cat /sys/kernel/sched_ext/enable_seq 0 arighi@virtme-ng~> sudo scx_simple local=1 global=0 ^CEXIT: unregistered from user space arighi@virtme-ng~> cat /sys/kernel/sched_ext/enable_seq 1 In this way user-space tools (such as Ubuntu's apport and similar) are able to gather and include this information in bug reports. Cc: Giovanni Gherdovich <giovanni.gherdovich@suse.com> Cc: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Cc: Marcelo Henrique Cerri <marcelo.cerri@canonical.com> Cc: Phil Auld <pauld@redhat.com> Signed-off-by: Andrea Righi <andrea.righi@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-23sched_ext: Fix build when !CONFIG_STACKTRACETejun Heo
a2f4b16e736d ("sched_ext: Build fix on !CONFIG_STACKTRACE[_SUPPORT]") tried fixing build when !CONFIG_STACKTRACE but didn't so fully. Also put stack_trace_print() and stack_trace_save() inside CONFIG_STACKTRACE to fix build when !CONFIG_STACKTRACE. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202409220642.fDW2OmWc-lkp@intel.com/
2024-09-10sched_ext: Don't trigger ops.quiescent/runnable() on migrationsTejun Heo
A task moving across CPUs should not trigger quiescent/runnable task state events as the task is staying runnable the whole time and just stopping and then starting on different CPUs. Suppress quiescent/runnable task state events if task_on_rq_migrating(). Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: David Vernet <void@manifault.com> Cc: Daniel Hodges <hodges.daniel.scott@gmail.com> Cc: Changwoo Min <multics69@gmail.com> Cc: Andrea Righi <andrea.righi@linux.dev> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-10sched_ext: Synchronize bypass state changes with rq lockTejun Heo
While the BPF scheduler is being unloaded, the following warning messages trigger sometimes: NOHZ tick-stop error: local softirq work is pending, handler #80!!! This is caused by the CPU entering idle while there are pending softirqs. The main culprit is the bypassing state assertion not being synchronized with rq operations. As the BPF scheduler cannot be trusted in the disable path, the first step is entering the bypass mode where the BPF scheduler is ignored and scheduling becomes global FIFO. This is implemented by turning scx_ops_bypassing() true. However, the transition isn't synchronized against anything and it's possible for enqueue and dispatch paths to have different ideas on whether bypass mode is on. Make each rq track its own bypass state with SCX_RQ_BYPASSING which is modified while rq is locked. This removes most of the NOHZ tick-stop messages but not completely. I believe the stragglers are from the sched core bug where pick_task_scx() can be called without preceding balance_scx(). Once that bug is fixed, we should verify that all occurrences of this error message are gone too. v2: scx_enabled() test moved inside the for_each_possible_cpu() loop so that the per-cpu states are always synchronized with the global state. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: David Vernet <void@manifault.com>
2024-09-09sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()Tejun Heo
Once a task is put into a DSQ, the allowed operations are fairly limited. Tasks in the built-in local and global DSQs are executed automatically and, ignoring dequeue, there is only one way a task in a user DSQ can be manipulated - scx_bpf_consume() moves the first task to the dispatching local DSQ. This inflexibility sometimes gets in the way and is an area where multiple feature requests have been made. Implement scx_bpf_dispatch[_vtime]_from_dsq(), which can be called during DSQ iteration and can move the task to any DSQ - local DSQs, global DSQ and user DSQs. The kfuncs can be called from ops.dispatch() and any BPF context which dosen't hold a rq lock including BPF timers and SYSCALL programs. This is an expansion of an earlier patch which only allowed moving into the dispatching local DSQ: http://lkml.kernel.org/r/Zn4Cw4FDTmvXnhaf@slm.duckdns.org v2: Remove @slice and @vtime from scx_bpf_dispatch_from_dsq[_vtime]() as they push scx_bpf_dispatch_from_dsq_vtime() over the kfunc argument count limit and often won't be needed anyway. Instead provide scx_bpf_dispatch_from_dsq_set_{slice|vtime}() kfuncs which can be called only when needed and override the specified parameter for the subsequent dispatch. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Daniel Hodges <hodges.daniel.scott@gmail.com> Cc: David Vernet <void@manifault.com> Cc: Changwoo Min <multics69@gmail.com> Cc: Andrea Righi <andrea.righi@linux.dev> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-09-09sched_ext: Compact struct bpf_iter_scx_dsq_kernTejun Heo
struct scx_iter_scx_dsq is defined as 6 u64's and scx_dsq_iter_kern was using 5 of them. We want to add two more u64 fields but it's better if we do so while staying within scx_iter_scx_dsq to maintain binary compatibility. The way scx_iter_scx_dsq_kern is laid out is rather inefficient - the node field takes up three u64's but only one bit of the last u64 is used. Turn the bool into u32 flags and only use the lower 16 bits freeing up 48 bits - 16 bits for flags, 32 bits for a u32 - for use by struct bpf_iter_scx_dsq_kern. This allows moving the dsq_seq and flags fields of bpf_iter_scx_dsq_kern into the cursor field reducing the struct size by a full u64. No behavior changes intended. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-09sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq()Tejun Heo
- Rename move_task_to_local_dsq() to move_remote_task_to_local_dsq(). - Rename consume_local_task() to move_local_task_to_local_dsq() and remove task_unlink_from_dsq() and source DSQ unlocking from it. This is to make the migration code easier to reuse. No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-09-09sched_ext: Move consume_local_task() upwardTejun Heo
So that the local case comes first and two CONFIG_SMP blocks can be merged. No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-09-09sched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq()Tejun Heo
All task_unlink_from_dsq() users are doing dsq_mod_nr(dsq, -1). Move it into task_unlink_from_dsq(). Also move sanity check into it. No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-09-09sched_ext: Reorder args for consume_local/remote_task()Tejun Heo
Reorder args for consistency in the order of: current_rq, p, src_[rq|dsq], dst_[rq|dsq]. No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-09sched_ext: Restructure dispatch_to_local_dsq()Tejun Heo
Now that there's nothing left after the big if block, flip the if condition and unindent the body. No functional changes intended. v2: Add BUG() to clarify control can't reach the end of dispatch_to_local_dsq() in UP kernels per David. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-09-09sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handlingTejun Heo
With the preceding update, the only return value which makes meaningful difference is DTL_INVALID, for which one caller, finish_dispatch(), falls back to the global DSQ and the other, process_ddsp_deferred_locals(), doesn't do anything. It should always fallback to the global DSQ. Move the global DSQ fallback into dispatch_to_local_dsq() and remove the return value. v2: Patch title and description updated to reflect the behavior fix for process_ddsp_deferred_locals(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-09-09sched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ONTejun Heo
find_dsq_for_dispatch() handles all DSQ IDs except SCX_DSQ_LOCAL_ON. Instead, each caller is hanlding SCX_DSQ_LOCAL_ON before calling it. Move SCX_DSQ_LOCAL_ON lookup into find_dsq_for_dispatch() to remove duplicate code in direct_dispatch() and dispatch_to_local_dsq(). No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>