summaryrefslogtreecommitdiff
path: root/kernel/sched/ext.c
AgeCommit message (Collapse)Author
2025-05-28Merge tag 'bpf-next-6.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: - Fix and improve BTF deduplication of identical BTF types (Alan Maguire and Andrii Nakryiko) - Support up to 12 arguments in BPF trampoline on arm64 (Xu Kuohai and Alexis Lothoré) - Support load-acquire and store-release instructions in BPF JIT on riscv64 (Andrea Parri) - Fix uninitialized values in BPF_{CORE,PROBE}_READ macros (Anton Protopopov) - Streamline allowed helpers across program types (Feng Yang) - Support atomic update for hashtab of BPF maps (Hou Tao) - Implement json output for BPF helpers (Ihor Solodrai) - Several s390 JIT fixes (Ilya Leoshkevich) - Various sockmap fixes (Jiayuan Chen) - Support mmap of vmlinux BTF data (Lorenz Bauer) - Support BPF rbtree traversal and list peeking (Martin KaFai Lau) - Tests for sockmap/sockhash redirection (Michal Luczaj) - Introduce kfuncs for memory reads into dynptrs (Mykyta Yatsenko) - Add support for dma-buf iterators in BPF (T.J. Mercier) - The verifier support for __bpf_trap() (Yonghong Song) * tag 'bpf-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (135 commits) bpf, arm64: Remove unused-but-set function and variable. selftests/bpf: Add tests with stack ptr register in conditional jmp bpf: Do not include stack ptr register in precision backtracking bookkeeping selftests/bpf: enable many-args tests for arm64 bpf, arm64: Support up to 12 function arguments bpf: Check rcu_read_lock_trace_held() in bpf_map_lookup_percpu_elem() bpf: Avoid __bpf_prog_ret0_warn when jit fails bpftool: Add support for custom BTF path in prog load/loadall selftests/bpf: Add unit tests with __bpf_trap() kfunc bpf: Warn with __bpf_trap() kfunc maybe due to uninitialized variable bpf: Remove special_kfunc_set from verifier selftests/bpf: Add test for open coded dmabuf_iter selftests/bpf: Add test for dmabuf_iter bpf: Add open coded dmabuf iterator bpf: Add dmabuf iterator dma-buf: Rename debugfs symbols bpf: Fix error return value in bpf_copy_from_user_dynptr libbpf: Use mmap to parse vmlinux BTF from sysfs selftests: bpf: Add a test for mmapable vmlinux BTF btf: Allow mmap of vmlinux btf ...
2025-05-20sched_ext: Make scx_kf_allowed_if_unlocked() available outside ext.cAndrea Righi
Relocate the scx_kf_allowed_if_unlocked(), so it can be used from other source files (e.g., ext_idle.c). No functional change. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-14sched_ext: Explain the temporary situation around scx_root dereferencesTejun Heo
Naked scx_root dereferences are being used as temporary markers to indicate that they need to be updated to point to the right scheduler instance. Explain the situation. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-05-14sched_ext: Add @sch to SCX_CALL_OP*()Tejun Heo
In preparation of hierarchical scheduling support, add @sch to scx_exit() and friends: - scx_exit/error() updated to take explicit @sch instead of assuming scx_root. - scx_kf_exit/error() added. These are to be used from kfuncs, don't take @sch and internally determine the scx_sched instance to abort. Currently, it's always scx_root but once multiple scheduler support is in place, it will be the scx_sched instance that invoked the kfunc. This simplifies many callsites and defers scx_sched lookup until error is triggered. - @sch is propagated to ops_cpu_valid() and ops_sanitize_err(). The CPU validity conditions in ops_cpu_valid() are factored into __cpu_valid() to implement kf_cpu_valid() which is the counterpart to scx_kf_exit/error(). - All users are converted. Most conversions are straightforward. check_rq_for_timeouts() and scx_softlockup() are updated to use explicit rcu_dereference*(scx_root) for safety as they may execute asynchronous to the exit path. scx_tick() is also updated to use rcu_dereference(). While not strictly necessary due to the preceding scx_enabled() test and IRQ disabled context, this removes the subtlety at no noticeable cost. No behavior changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-05-14sched_ext: Cleanup [__]scx_exit/error*()Tejun Heo
__scx_exit() is the base exit implementation and there are three wrappers on top of it - scx_exit(), __scx_error() and scx_error(). This is more confusing than helpful especially given that there are only a couple users of scx_exit() and __scx_error(). To simplify the situation: - Make __scx_exit() take va_list and rename it to scx_vexit(). This is to ease implementing more complex extensions on top. - Make scx_exit() a varargs wrapper around __scx_exit(). scx_exit() now takes both @kind and @exit_code. - Convert existing scx_exit() and __scx_error() users to use the new scx_exit(). - scx_error() remains unchanged. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-05-14sched_ext: Add @sch to SCX_CALL_OP*()Tejun Heo
In preparation of hierarchical scheduling support, make SCX_CALL_OP*() take explicit @sch instead of assuming scx_root. As scx_root is still the only scheduler instance, this patch doesn't make any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-05-14sched_ext: Clean up scx_root usagesTejun Heo
- Always cache scx_root into local variable sch before using. - Don't use scx_root if cached sch is available. - Wrap !sch test with unlikely(). - Pass @scx into scx_cgroup_init/exit(). No behavior changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-05-09sched_ext: Remove bpf_scx_get_func_protoFeng Yang
task_storage_{get,delete} has been moved to bpf_base_func_proto. Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Signed-off-by: Feng Yang <yangfeng@kylinos.cn> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/bpf/20250506061434.94277-3-yangfeng59949@163.com
2025-05-07Merge branch 'for-6.15-fixes' into for-6.16Tejun Heo
To receive 428dc9fc0873 ("sched_ext: bpf_iter_scx_dsq_new() should always initialize iterator") which conflicts with cdf5a6faa8cf ("sched_ext: Move dsq_hash into scx_sched"). The conflict is a simple context conflict which can be resolved by taking changes from both changes in the right order.
2025-05-07sched_ext: bpf_iter_scx_dsq_new() should always initialize iteratorTejun Heo
BPF programs may call next() and destroy() on BPF iterators even after new() returns an error value (e.g. bpf_for_each() macro ignores error returns from new()). bpf_iter_scx_dsq_new() could leave the iterator in an uninitialized state after an error return causing bpf_iter_scx_dsq_next() to dereference garbage data. Make bpf_iter_scx_dsq_new() always clear $kit->dsq so that next() and destroy() become noops. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 650ba21b131e ("sched_ext: Implement DSQ iterator") Cc: stable@vger.kernel.org # v6.12+ Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-30sched_ext: Avoid NULL scx_root deref in __scx_exit()Andrea Righi
A sched_ext scheduler may trigger __scx_exit() from a BPF timer callback, where scx_root may not be safely dereferenced. This can lead to a NULL pointer dereference as shown below (triggered by scx_tickless): BUG: kernel NULL pointer dereference, address: 0000000000000330 ... CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.14.0-virtme #1 PREEMPT(full) RIP: 0010:__scx_exit+0x2b/0x190 ... Call Trace: <IRQ> scx_bpf_get_idle_smtmask+0x59/0x80 bpf_prog_8320d4217989178c_dispatch_all_cpus+0x35/0x1b6 ... bpf_prog_97f847d871513f95_sched_timerfn+0x4c/0x264 bpf_timer_cb+0x7a/0x140 __hrtimer_run_queues+0x1f9/0x3a0 hrtimer_run_softirq+0x8c/0xd0 handle_softirqs+0xd3/0x3d0 __irq_exit_rcu+0x9a/0xc0 irq_exit_rcu+0xe/0x20 Fix this by checking for a valid scx_root and adding proper RCU protection. Fixes: 48e1267773866 ("sched_ext: Introduce scx_sched") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-30sched_ext: Add RCU protection to scx_root in DSQ iteratorAndrea Righi
Using a DSQ iterators from a timer callback can trigger the following lockdep splat when accessing scx_root: ============================= WARNING: suspicious RCU usage 6.14.0-virtme #1 Not tainted ----------------------------- kernel/sched/ext.c:6907 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 no locks held by swapper/0/0. stack backtrace: CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.14.0-virtme #1 PREEMPT(full) Sched_ext: tickless (enabled+all) Call Trace: <IRQ> dump_stack_lvl+0x6f/0xb0 lockdep_rcu_suspicious.cold+0x4e/0xa3 bpf_iter_scx_dsq_new+0xb1/0xd0 bpf_prog_63f4fd1bccc101e7_dispatch_cpu+0x3e/0x156 bpf_prog_8320d4217989178c_dispatch_all_cpus+0x153/0x1b6 bpf_prog_97f847d871513f95_sched_timerfn+0x4c/0x264 ? hrtimer_run_softirq+0x4f/0xd0 bpf_timer_cb+0x7a/0x140 __hrtimer_run_queues+0x1f9/0x3a0 hrtimer_run_softirq+0x8c/0xd0 handle_softirqs+0xd3/0x3d0 __irq_exit_rcu+0x9a/0xc0 irq_exit_rcu+0xe/0x20 sysvec_apic_timer_interrupt+0x73/0x80 Add a proper dereference check to explicitly validate RCU-safe access to scx_root from rcu_read_lock() contexts and also from contexts that hold rcu_read_lock_bh(), such as timer callbacks. Fixes: cdf5a6faa8cf0 ("sched_ext: Move dsq_hash into scx_sched") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-29sched_ext: Clean up SCX_EXIT_NONE handling in scx_disable_workfn()Tejun Heo
With the global states and disable machinery moved into scx_sched, scx_disable_workfn() can only be scheduled and run for the specific scheduler instance. This makes it impossible for scx_disable_workfn() to see SCX_EXIT_NONE. Turn that condition into WARN_ON_ONCE(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Move disable machinery into scx_schedTejun Heo
Because disable can be triggered from any place and the scheduler cannot be trusted, disable path uses an irq_work to bounce and a kthread_work which is executed on an RT helper kthread to perform disable. These must be per scheduler instance to guarantee forward progress. Move them into scx_sched. - If an scx_sched is accessible, its helper kthread is always valid making the `helper` check in schedule_scx_disable_work() unnecessary. As the function becomes trivial after the removal of the test, inline it. - scx_create_rt_helper() has only one user - creation of the disable helper kthread. Inline it into scx_alloc_and_add_sched(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Move event_stats_cpu into scx_schedTejun Heo
The event counters are going to become per scheduler instance. Move event_stats_cpu into scx_sched. - [__]scx_add_event() are updated to take @sch. While at it, add missing parentheses around @cnt expansions. - scx_read_events() is updated to take @sch. - scx_bpf_events() accesses scx_root under RCU read lock. v2: - Replace stale scx_bpf_get_event_stat() reference in a comment with scx_bpf_events(). - Trivial goto label rename. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Factor out scx_read_events()Tejun Heo
In prepration of moving event_stats_cpu into scx_sched, factor out scx_read_events() out of scx_bpf_events() and update the in-kernel users - scx_attr_events_show() and scx_dump_state() - to use scx_read_events() instead of scx_bpf_events(). No observable behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Relocate scx_event_stats definitionTejun Heo
In prepration of moving event_stats_cpu into scx_sched, move scx_event_stats definitions above scx_sched definition. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Move global_dsqs into scx_schedTejun Heo
Global DSQs are going to become per scheduler instance. Move global_dsqs into scx_sched. find_global_dsq() already takes a task_struct pointer as an argument and should later be able to determine the scx_sched to use from that. For now, assume scx_root. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Move dsq_hash into scx_schedTejun Heo
User DSQs are going to become per scheduler instance. Move dsq_hash into scx_sched. This shifts the code that assumes scx_root to be the only scx_sched instance up the call stack but doesn't remove them yet. v2: Add missing rcu_read_lock() in scx_bpf_destroy_dsq() as per Andrea. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Factor out scx_alloc_and_add_sched()Tejun Heo
More will be moved into scx_sched. Factor out the allocation and kobject addition path into scx_alloc_and_add_sched(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Inline create_dsq() into scx_bpf_create_dsq()Tejun Heo
create_dsq() is only used by scx_bpf_create_dsq() and the separation gets in the way of making dsq_hash per scx_sched. Inline it into scx_bpf_create_dsq(). While at it, add unlikely() around SCX_DSQ_FLAG_BUILTIN test. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Use dynamic allocation for scx_schedTejun Heo
To prepare for supporting multiple schedulers, make scx_sched allocated dynamically. scx_sched->kobj is now an embedded field and the kobj's lifetime determines the lifetime of the containing scx_sched. - Enable path is updated so that kobj init and addition are performed later. - scx_sched freeing is initiated in scx_kobj_release() and also goes through an rcu_work so that scx_root can be accessed from an unsynchronized path - scx_disable(). - sched_ext_ops->priv is added and used to point to scx_sched instance created for the ops instance. This is used by bpf_scx_unreg() to determine the scx_sched instance to disable and put. No behavior changes intended. v2: Andrea reported kernel oops due to scx_bpf_unreg() trying to deref NULL scx_root after scheduler init failure. sched_ext_ops->priv added so that scx_bpf_unreg() can always find the scx_sched instance to unregister even if it failed early during init. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Avoid NULL scx_root deref through SCX_HAS_OP()Tejun Heo
SCX_HAS_OP() tests scx_root->has_op bitmap. The bitmap is currently in a statically allocated struct scx_sched and initialized while loading the BPF scheduler and cleared while unloading, and thus can be tested anytime. However, scx_root will be switched to dynamic allocation and thus won't always be deferenceable. Most usages of SCX_HAS_OP() are already protected by scx_enabled() either directly or indirectly (e.g. through a task which is on SCX). However, there are a couple places that could try to dereference NULL scx_root. Update them so that scx_root is guaranteed to be valid before SCX_HAS_OP() is called. - In handle_hotplug(), test whether scx_root is NULL before doing anything else. This is safe because scx_root updates will be protected by cpus_read_lock(). - In scx_tg_offline(), test scx_cgroup_enabled before invoking SCX_HAS_OP(), which should guarnatee that scx_root won't turn NULL. This is also in line with other cgroup operations. As the code path is synchronized against scx_cgroup_init/exit() through scx_cgroup_rwsem, this shouldn't cause any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Introduce scx_schedTejun Heo
To support multiple scheduler instances, collect some of the global variables that should be specific to a scheduler instance into the new struct scx_sched. scx_root is the root scheduler instance and points to a static instance of struct scx_sched. Except for an extra dereference through the scx_root pointer, this patch makes no functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29Merge branch 'for-6.15-fixes' into for-6.16Tejun Heo
To receive e38be1c7647c ("sched_ext: Fix rq lock state in hotplug ops") to avoid conflicts with scx_sched related patches pending for for-6.16.
2025-04-29sched_ext: Fix rq lock state in hotplug opsAndrea Righi
The ops.cpu_online() and ops.cpu_offline() callbacks incorrectly assume that the rq involved in the operation is locked, which is not the case during hotplug, triggering the following warning: WARNING: CPU: 1 PID: 20 at kernel/sched/sched.h:1504 handle_hotplug+0x280/0x340 Fix by not tracking the target rq as locked in the context of ops.cpu_online() and ops.cpu_offline(). Fixes: 18853ba782bef ("sched_ext: Track currently locked rq") Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrea Righi <arighi@nvidia.com> Tested-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-25sched_ext: Remove duplicate BTF_ID_FLAGS definitionsAndrea Righi
Some kfuncs specific to the idle CPU selection policy are registered in both the scx_kfunc_ids_any and scx_kfunc_ids_idle blocks, even though they should only be defined in the latter. Remove the duplicates from scx_kfunc_ids_any. Fixes: 337d1b354a297 ("sched_ext: Move built-in idle CPU selection policy to a separate file") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-23sched_ext: Clarify CPU context for running/stopping callbacksAndrea Righi
The ops.running() and ops.stopping() callbacks can be invoked from a CPU other than the one the task is assigned to, particularly when a task property is changed, as both scx_next_task_scx() and dequeue_task_scx() may run on CPUs different from the task's target CPU. This behavior can lead to confusion or incorrect assumptions if not properly clarified, potentially resulting in bugs (see [1]). Therefore, update the documentation to clarify this aspect and advise users to use scx_bpf_task_cpu() to determine the actual CPU the task will run on or was running on. [1] https://github.com/sched-ext/scx/pull/1728 Cc: Jake Hillion <jake@hillion.co.uk> Signed-off-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-22Merge branch 'for-6.15-fixes' into for-6.16Tejun Heo
a11d6784d731 ("sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()") added a call to scx_ops_error() which was renamed to scx_error() in for-6.16. Fix it up.
2025-04-22sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()Andrea Righi
scx_bpf_cpuperf_set() can be used to set a performance target level on any CPU. However, it doesn't correctly acquire the corresponding rq lock, which may lead to unsafe behavior and trigger the following warning, due to the lockdep_assert_rq_held() check: [ 51.713737] WARNING: CPU: 3 PID: 3899 at kernel/sched/sched.h:1512 scx_bpf_cpuperf_set+0x1a0/0x1e0 ... [ 51.713836] Call trace: [ 51.713837] scx_bpf_cpuperf_set+0x1a0/0x1e0 (P) [ 51.713839] bpf_prog_62d35beb9301601f_bpfland_init+0x168/0x440 [ 51.713841] bpf__sched_ext_ops_init+0x54/0x8c [ 51.713843] scx_ops_enable.constprop.0+0x2c0/0x10f0 [ 51.713845] bpf_scx_reg+0x18/0x30 [ 51.713847] bpf_struct_ops_link_create+0x154/0x1b0 [ 51.713849] __sys_bpf+0x1934/0x22a0 Fix by properly acquiring the rq lock when possible or raising an error if we try to operate on a CPU that is not the one currently locked. Fixes: d86adb4fc0655 ("sched_ext: Add cpuperf support") Signed-off-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-22sched_ext: Track currently locked rqAndrea Righi
Some kfuncs provided by sched_ext may need to operate on a struct rq, but they can be invoked from various contexts, specifically, different scx callbacks. While some of these callbacks are invoked with a particular rq already locked, others are not. This makes it impossible for a kfunc to reliably determine whether it's safe to access a given rq, triggering potential bugs or unsafe behaviors, see for example [1]. To address this, track the currently locked rq whenever a sched_ext callback is invoked via SCX_CALL_OP*(). This allows kfuncs that need to operate on an arbitrary rq to retrieve the currently locked one and apply the appropriate action as needed. [1] https://lore.kernel.org/lkml/20250325140021.73570-1-arighi@nvidia.com/ Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-18sched_ext: add helper for refill task with default sliceHonglei Wang
Add helper for refilling task with default slice and event statistics accordingly. Signed-off-by: Honglei Wang <jameshongleiwang@126.com> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-18sched_ext: change the variable name for slice refill eventHonglei Wang
SCX_EV_ENQ_SLICE_DFL gives the impression that the event only occurs when the tasks were enqueued, which seems not accurate. What it actually means is the refilling with defalt slice, and this can occur either when enqueue or pick_task. Let's change the variable to SCX_EV_REFILL_SLICE_DFL. Signed-off-by: Honglei Wang <jameshongleiwang@126.com> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-09sched_ext: Make scx_has_op a bitmapTejun Heo
scx_has_op is used to encode which ops are implemented by the BPF scheduler into an array of static_keys. While this saves a bit of branching overhead, that is unlikely to be noticeable compared to the overall cost. As the global static_keys can't work with the planned hierarchical multiple scheduler support, replace the static_key array with a bitmap. In repeated hackbench runs before and after static_keys removal on an AMD Ryzen 3900X, I couldn't tell any measurable performance difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09sched_ext: Remove scx_ops_allow_queued_wakeup static_keyTejun Heo
scx_ops_allow_queued_wakeup is used to encode SCX_OPS_ALLOW_QUEUED_WAKEUP into a static_key. The test is gated behind scx_enabled(), and, even when sched_ext is enabled, is unlikely for the static_key usage to make any meaningful difference. It is made to use a static_key mostly because there was no reason not to. However, global static_keys can't work with the planned hierarchical multiple scheduler support. Remove the static_key and instead test SCX_OPS_ALLOW_QUEUED_WAKEUP directly. In repeated hackbench runs before and after static_keys removal on an AMD Ryzen 3900X, I couldn't tell any measurable performance difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09sched_ext: Remove scx_ops_cpu_preempt static_keyTejun Heo
scx_ops_cpu_preempt is used to encode whether ops.cpu_acquire/release() are implemented into a static_key. These tests aren't hot enough for static_key usage to make any meaningful difference and are made to use a static_key mostly because there was no reason not to. However, global static_keys can't work with the planned hierarchical multiple scheduler support. Remove the static_key and instead use an internal ops flag SCX_OPS_HAS_CPU_PREEMPT to record and test whether ops.cpu_acquire/release() are implemented. In repeated hackbench runs before and after static_keys removal on an AMD Ryzen 3900X, I couldn't tell any measurable performance difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09sched_ext: Remove scx_ops_enq_* static_keysTejun Heo
scx_ops_enq_last/exiting/migration_disabled are used to encode the corresponding SCX_OPS_ flags into static_keys. These flags aren't hot enough for static_key usage to make any meaningful difference and are made static_keys mostly because there was no reason not to. However, global static_keys can't work with the planned hierarchical multiple scheduler support. Remove the static_keys and test the ops flags directly. In repeated hackbench runs before and after static_keys removal on an AMD Ryzen 3900X, I couldn't tell any measurable performance difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09sched_ext: Indentation updatesTejun Heo
Purely cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-08sched_ext: Merge branch 'for-6.15-fixes' into for-6.16Tejun Heo
Pull for-6.15-fixes to receive: e776b26e3701 ("sched_ext: Remove cpu.weight / cpu.idle unimplemented warnings") which conflicts with: 1a7ff7216c8b ("sched_ext: Drop "ops" from scx_ops_enable_state and friends") The former removes code updated by the latter. Resolved by removing the updated section. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-08sched_ext: Mark SCX_OPS_HAS_CGROUP_WEIGHT for deprecationTejun Heo
SCX_OPS_HAS_CGROUP_WEIGHT was only used to suppress the missing cgroup weight support warnings. Now that the warnings are removed, the flag doesn't do anything. Mark it for deprecation and remove its usage from scx_flatcg. v2: Actually include the scx_flatcg update. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-and-reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-04-08sched_ext: Remove cpu.weight / cpu.idle unimplemented warningsTejun Heo
sched_ext generates warnings when cpu.weight / cpu.idle are set to non-default values if the BPF scheduler doesn't implement weight support. These warnings don't provide much value while adding constant annoyance. A BPF scheduler may not implement any particular behavior and there's nothing particularly special about missing cgroup weight support. Drop the warnings. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-08sched_ext: Use kvzalloc for large exit_dump allocationBreno Leitao
Replace kzalloc with kvzalloc for the exit_dump buffer allocation, which can require large contiguous memory depending on the implementation. This change prevents allocation failures by allowing the system to fall back to vmalloc when contiguous memory allocation fails. Since this buffer is only used for debugging purposes, physical memory contiguity is not required, making vmalloc a suitable alternative. Cc: stable@vger.kernel.org Fixes: 07814a9439a3b0 ("sched_ext: Print debug dump after an error exit") Suggested-by: Rik van Riel <riel@surriel.com> Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-07sched_ext: idle: Introduce scx_bpf_select_cpu_and()Andrea Righi
Provide a new kfunc, scx_bpf_select_cpu_and(), that can be used to apply the built-in idle CPU selection policy to a subset of allowed CPU. This new helper is basically an extension of scx_bpf_select_cpu_dfl(). However, when an idle CPU can't be found, it returns a negative value instead of @prev_cpu, aligning its behavior more closely with scx_bpf_pick_idle_cpu(). It also accepts %SCX_PICK_IDLE_* flags, which can be used to enforce strict selection to @prev_cpu's node (%SCX_PICK_IDLE_IN_NODE), or to request only a full-idle SMT core (%SCX_PICK_IDLE_CORE), while applying the built-in selection logic. With this helper, BPF schedulers can apply the built-in idle CPU selection policy restricted to any arbitrary subset of CPUs. Example usage ============= Possible usage in ops.select_cpu(): s32 BPF_STRUCT_OPS(foo_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags) { const struct cpumask *cpus = task_allowed_cpus(p) ?: p->cpus_ptr; s32 cpu; cpu = scx_bpf_select_cpu_and(p, prev_cpu, wake_flags, cpus, 0); if (cpu >= 0) { scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); return cpu; } return prev_cpu; } Results ======= Load distribution on a 4 sockets, 4 cores per socket system, simulated using virtme-ng, running a modified version of scx_bpfland that uses scx_bpf_select_cpu_and() with 0xff00 as the allowed subset of CPUs: $ vng --cpu 16,sockets=4,cores=4,threads=1 ... $ stress-ng -c 16 ... $ htop ... 0[ 0.0%] 8[||||||||||||||||||||||||100.0%] 1[ 0.0%] 9[||||||||||||||||||||||||100.0%] 2[ 0.0%] 10[||||||||||||||||||||||||100.0%] 3[ 0.0%] 11[||||||||||||||||||||||||100.0%] 4[ 0.0%] 12[||||||||||||||||||||||||100.0%] 5[ 0.0%] 13[||||||||||||||||||||||||100.0%] 6[ 0.0%] 14[||||||||||||||||||||||||100.0%] 7[ 0.0%] 15[||||||||||||||||||||||||100.0%] With scx_bpf_select_cpu_dfl() tasks would be distributed evenly across all the available CPUs. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-07sched_ext: idle: Explicitly pass allowed cpumask to scx_select_cpu_dfl()Andrea Righi
Modify scx_select_cpu_dfl() to take the allowed cpumask as an explicit argument, instead of implicitly using @p->cpus_ptr. This prepares for future changes where arbitrary cpumasks may be passed to the built-in idle CPU selection policy. This is a pure refactoring with no functional changes. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-04sched_ext: Drop "ops" from SCX_OPS_TASK_ITER_BATCHTejun Heo
The tag "ops" is used for two different purposes. First, to indicate that the entity is directly related to the operations such as flags carried in sched_ext_ops. Second, to indicate that the entity applies to something global such as enable or bypass states. The second usage is historical and causes confusion rather than clarifying anything. For example, scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with scx_ops_flags. Let's drop the second usages. Drop "ops" from SCX_OPS_TASK_ITER_BATCH. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-and-acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04sched_ext: Drop "ops" from scx_ops_{init|exit|enable|disable}[_task]() and ↵Tejun Heo
friends The tag "ops" is used for two different purposes. First, to indicate that the entity is directly related to the operations such as flags carried in sched_ext_ops. Second, to indicate that the entity applies to something global such as enable or bypass states. The second usage is historical and causes confusion rather than clarifying anything. For example, scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with scx_ops_flags. Let's drop the second usages. Drop "ops" from scx_ops_{init|exit|enable|disable}[_task]() and friends. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04sched_ext: Drop "ops" from scx_ops_exit(), scx_ops_error() and friendsTejun Heo
The tag "ops" is used for two different purposes. First, to indicate that the entity is directly related to the operations such as flags carried in sched_ext_ops. Second, to indicate that the entity applies to something global such as enable or bypass states. The second usage is historical and causes confusion rather than clarifying anything. For example, scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with scx_ops_flags. Let's drop the second usages. Drop "ops" from scx_ops_exit(), scx_ops_error() and friends. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04sched_ext: Drop "ops" from scx_ops_bypass(), scx_ops_breather() and friendsTejun Heo
The tag "ops" is used for two different purposes. First, to indicate that the entity is directly related to the operations such as flags carried in sched_ext_ops. Second, to indicate that the entity applies to something global such as enable or bypass states. The second usage is historical and causes confusion rather than clarifying anything. For example, scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with scx_ops_flags. Let's drop the second usages. Drop "ops" from scx_ops_bypass(), scx_ops_breather() and friends. Update scx_show_state.py accordingly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04sched_ext: Drop "ops" from scx_ops_helper, scx_ops_enable_mutex and ↵Tejun Heo
__scx_ops_enabled The tag "ops" is used for two different purposes. First, to indicate that the entity is directly related to the operations such as flags carried in sched_ext_ops. Second, to indicate that the entity applies to something global such as enable or bypass states. The second usage is historical and causes confusion rather than clarifying anything. For example, scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with scx_ops_flags. Let's drop the second usages. Drop "ops" from scx_ops_helper, scx_ops_enable_mutex and __scx_ops_enabled. Update scx_show_state.py accordingly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04sched_ext: Drop "ops" from scx_ops_enable_state and friendsTejun Heo
The tag "ops" is used for two different purposes. First, to indicate that the entity is directly related to the operations such as flags carried in sched_ext_ops. Second, to indicate that the entity applies to something global such as enable or bypass states. The second usage is historical and causes confusion rather than clarifying anything. For example, scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with scx_ops_flags. Let's drop the second usages. Drop "ops" from scx_ops_enable_state and friends. Update scx_show_state.py accordingly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>