summaryrefslogtreecommitdiff
path: root/kernel/sched/ext_idle.c
AgeCommit message (Collapse)Author
2025-06-03sched_ext: idle: Skip cross-node search with !CONFIG_NUMAAndrea Righi
In the idle CPU selection logic, attempting cross-node searches adds unnecessary complexity when CONFIG_NUMA is disabled. Since there's no meaningful concept of nodes in this case, simplify the logic by restricting the idle CPU search to the current node only. Fixes: 48849271e6611 ("sched_ext: idle: Per-node idle cpumasks") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-30sched_ext: idle: Properly handle invalid prev_cpu during idle selectionAndrea Righi
The default idle selection policy doesn't properly handle the case where @prev_cpu is not part of the task's allowed CPUs. In this situation, it may return an idle CPU that is not usable by the task, breaking the assumption that the returned CPU must always be within the allowed cpumask, causing inefficiencies or even stalls in certain cases. This issue can arise in the following cases: - The task's affinity may have changed by the time the function is invoked, especially now that the idle selection logic can be used from multiple contexts (i.e., BPF test_run call). - The BPF scheduler may provide a @prev_cpu that is not part of the allowed mask, either unintentionally or as a placement hint. In fact @prev_cpu may not necessarily refer to the CPU the task last ran on, but it can also be considered as a target CPU that the scheduler wishes to use for the task. Therefore, enforce the right behavior by always checking whether @prev_cpu is in the allowed mask, when using scx_bpf_select_cpu_and(), and it's also usable by the task (@p->cpus_ptr). If it is not, try to find a valid CPU nearby @prev_cpu, following the usual locality-aware fallback path (SMT, LLC, node, allowed CPUs). This ensures the returned CPU is always allowed, improving robustness to affinity changes and invalid scheduler hints, while preserving locality as much as possible. Fixes: a730e3f7a48bc ("sched_ext: idle: Consolidate default idle CPU selection kfuncs") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-22sched_ext: Call ops.update_idle() after updating builtin idle bitsTejun Heo
BPF schedulers that use both builtin CPU idle mechanism and ops.update_idle() may want to use the latter to create interlocking between ops.enqueue() and CPU idle transitions so that either ops.enqueue() sees the idle bit or ops.update_idle() sees the task queued somewhere. This can prevent race conditions where CPUs go idle while tasks are waiting in DSQs. For such interlocking to work, ops.update_idle() must be called after builtin CPU masks are updated. Relocate the invocation. Currently, there are no ordering requirements on transitions from idle and this relocation isn't expected to make meaningful differences in that direction. This also makes the ops.update_idle() behavior semantically consistent: any action performed in this callback should be able to override the builtin idle state, not the other way around. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-and-tested-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-05-21sched_ext: idle: Consolidate default idle CPU selection kfuncsAndrea Righi
There is no reason to restrict scx_bpf_select_cpu_dfl() invocations to ops.select_cpu() while allowing scx_bpf_select_cpu_and() to be used from multiple contexts, as both provide equivalent functionality, with the latter simply accepting an additional "allowed" cpumask. Therefore, unify the two APIs, enabling both kfuncs to be used from ops.select_cpu(), ops.enqueue(), and unlocked contexts (e.g., via BPF test_run). This allows schedulers to implement a consistent idle CPU selection policy and helps reduce code duplication. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-20sched_ext: idle: Allow scx_bpf_select_cpu_and() from unlocked contextAndrea Righi
Allow scx_bpf_select_cpu_and() to be used from an unlocked context, in addition to ops.enqueue() or ops.select_cpu(). This enables schedulers, including user-space ones, to implement a consistent idle CPU selection policy and helps reduce code duplication. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-20sched_ext: idle: Validate locking correctness in scx_bpf_select_cpu_and()Andrea Righi
Validate locking correctness when accessing p->nr_cpus_allowed and p->cpus_ptr inside scx_bpf_select_cpu_and(): if the rq lock is held, access is safe; otherwise, require that p->pi_lock is held. This allows to catch potential unsafe calls to scx_bpf_select_cpu_and(). Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-14sched_ext: Add @sch to SCX_CALL_OP*()Tejun Heo
In preparation of hierarchical scheduling support, add @sch to scx_exit() and friends: - scx_exit/error() updated to take explicit @sch instead of assuming scx_root. - scx_kf_exit/error() added. These are to be used from kfuncs, don't take @sch and internally determine the scx_sched instance to abort. Currently, it's always scx_root but once multiple scheduler support is in place, it will be the scx_sched instance that invoked the kfunc. This simplifies many callsites and defers scx_sched lookup until error is triggered. - @sch is propagated to ops_cpu_valid() and ops_sanitize_err(). The CPU validity conditions in ops_cpu_valid() are factored into __cpu_valid() to implement kf_cpu_valid() which is the counterpart to scx_kf_exit/error(). - All users are converted. Most conversions are straightforward. check_rq_for_timeouts() and scx_softlockup() are updated to use explicit rcu_dereference*(scx_root) for safety as they may execute asynchronous to the exit path. scx_tick() is also updated to use rcu_dereference(). While not strictly necessary due to the preceding scx_enabled() test and IRQ disabled context, this removes the subtlety at no noticeable cost. No behavior changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-05-14sched_ext: Add @sch to SCX_CALL_OP*()Tejun Heo
In preparation of hierarchical scheduling support, make SCX_CALL_OP*() take explicit @sch instead of assuming scx_root. As scx_root is still the only scheduler instance, this patch doesn't make any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-05-14sched_ext: Clean up scx_root usagesTejun Heo
- Always cache scx_root into local variable sch before using. - Don't use scx_root if cached sch is available. - Wrap !sch test with unlikely(). - Pass @scx into scx_cgroup_init/exit(). No behavior changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-04-29sched_ext: Introduce scx_schedTejun Heo
To support multiple scheduler instances, collect some of the global variables that should be specific to a scheduler instance into the new struct scx_sched. scx_root is the root scheduler instance and points to a static instance of struct scx_sched. Except for an extra dereference through the scx_root pointer, this patch makes no functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-22Merge branch 'for-6.15-fixes' into for-6.16Tejun Heo
a11d6784d731 ("sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()") added a call to scx_ops_error() which was renamed to scx_error() in for-6.16. Fix it up.
2025-04-22sched_ext: Track currently locked rqAndrea Righi
Some kfuncs provided by sched_ext may need to operate on a struct rq, but they can be invoked from various contexts, specifically, different scx callbacks. While some of these callbacks are invoked with a particular rq already locked, others are not. This makes it impossible for a kfunc to reliably determine whether it's safe to access a given rq, triggering potential bugs or unsafe behaviors, see for example [1]. To address this, track the currently locked rq whenever a sched_ext callback is invoked via SCX_CALL_OP*(). This allows kfuncs that need to operate on an arbitrary rq to retrieve the currently locked one and apply the appropriate action as needed. [1] https://lore.kernel.org/lkml/20250325140021.73570-1-arighi@nvidia.com/ Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-07sched_ext: idle: Introduce scx_bpf_select_cpu_and()Andrea Righi
Provide a new kfunc, scx_bpf_select_cpu_and(), that can be used to apply the built-in idle CPU selection policy to a subset of allowed CPU. This new helper is basically an extension of scx_bpf_select_cpu_dfl(). However, when an idle CPU can't be found, it returns a negative value instead of @prev_cpu, aligning its behavior more closely with scx_bpf_pick_idle_cpu(). It also accepts %SCX_PICK_IDLE_* flags, which can be used to enforce strict selection to @prev_cpu's node (%SCX_PICK_IDLE_IN_NODE), or to request only a full-idle SMT core (%SCX_PICK_IDLE_CORE), while applying the built-in selection logic. With this helper, BPF schedulers can apply the built-in idle CPU selection policy restricted to any arbitrary subset of CPUs. Example usage ============= Possible usage in ops.select_cpu(): s32 BPF_STRUCT_OPS(foo_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags) { const struct cpumask *cpus = task_allowed_cpus(p) ?: p->cpus_ptr; s32 cpu; cpu = scx_bpf_select_cpu_and(p, prev_cpu, wake_flags, cpus, 0); if (cpu >= 0) { scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); return cpu; } return prev_cpu; } Results ======= Load distribution on a 4 sockets, 4 cores per socket system, simulated using virtme-ng, running a modified version of scx_bpfland that uses scx_bpf_select_cpu_and() with 0xff00 as the allowed subset of CPUs: $ vng --cpu 16,sockets=4,cores=4,threads=1 ... $ stress-ng -c 16 ... $ htop ... 0[ 0.0%] 8[||||||||||||||||||||||||100.0%] 1[ 0.0%] 9[||||||||||||||||||||||||100.0%] 2[ 0.0%] 10[||||||||||||||||||||||||100.0%] 3[ 0.0%] 11[||||||||||||||||||||||||100.0%] 4[ 0.0%] 12[||||||||||||||||||||||||100.0%] 5[ 0.0%] 13[||||||||||||||||||||||||100.0%] 6[ 0.0%] 14[||||||||||||||||||||||||100.0%] 7[ 0.0%] 15[||||||||||||||||||||||||100.0%] With scx_bpf_select_cpu_dfl() tasks would be distributed evenly across all the available CPUs. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-07sched_ext: idle: Accept an arbitrary cpumask in scx_select_cpu_dfl()Andrea Righi
Many scx schedulers implement their own hard or soft-affinity rules to support topology characteristics, such as heterogeneous architectures (e.g., big.LITTLE, P-cores/E-cores), or to categorize tasks based on specific properties (e.g., running certain tasks only in a subset of CPUs). Currently, there is no mechanism that allows to use the built-in idle CPU selection policy to an arbitrary subset of CPUs. As a result, schedulers often implement their own idle CPU selection policies, which are typically similar to one another, leading to a lot of code duplication. To address this, modify scx_select_cpu_dfl() to accept an arbitrary cpumask, that can be used by the BPF schedulers to apply the existent built-in idle CPU selection policy to a subset of allowed CPUs. With this concept the idle CPU selection policy becomes the following: - always prioritize CPUs from fully idle SMT cores (if SMT is enabled), - select the same CPU if it's idle and in the allowed CPUs, - select an idle CPU within the same LLC, if the LLC cpumask is a subset of the allowed CPUs, - select an idle CPU within the same node, if the node cpumask is a subset of the allowed CPUs, - select an idle CPU within the allowed CPUs. This functionality will be exposed through a dedicated kfunc in a separate patch. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-07sched_ext: idle: Explicitly pass allowed cpumask to scx_select_cpu_dfl()Andrea Righi
Modify scx_select_cpu_dfl() to take the allowed cpumask as an explicit argument, instead of implicitly using @p->cpus_ptr. This prepares for future changes where arbitrary cpumasks may be passed to the built-in idle CPU selection policy. This is a pure refactoring with no functional changes. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-07sched_ext: idle: Extend topology optimizations to all tasksAndrea Righi
The built-in idle selection policy, scx_select_cpu_dfl(), always prioritizes picking idle CPUs within the same LLC or NUMA node, but these optimizations are currently applied only when a task has no CPU affinity constraints. This is done primarily for efficiency, as it avoids the overhead of updating a cpumask every time we need to select an idle CPU (which can be costly in large SMP systems). However, this approach limits the effectiveness of the built-in idle policy and results in inconsistent behavior, as affinity-restricted tasks don't benefit from topology-aware optimizations. To address this, modify the policy to apply LLC and NUMA-aware optimizations even when a task is constrained to a subset of CPUs. We can still avoid updating the cpumasks by checking if the subset of LLC and node CPUs are contained in the subset of allowed CPUs usable by the task (which is true in most of the cases - for tasks that don't have affinity constratints). Moreover, use temporary local per-CPU cpumasks to determine the LLC and node subsets, minimizing potential overhead even on large SMP systems. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-04sched_ext: Drop "ops" from scx_ops_exit(), scx_ops_error() and friendsTejun Heo
The tag "ops" is used for two different purposes. First, to indicate that the entity is directly related to the operations such as flags carried in sched_ext_ops. Second, to indicate that the entity applies to something global such as enable or bypass states. The second usage is historical and causes confusion rather than clarifying anything. For example, scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with scx_ops_flags. Let's drop the second usages. Drop "ops" from scx_ops_exit(), scx_ops_error() and friends. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-03-26sched_ext: initialize built-in idle state before ops.init()Andrea Righi
A BPF scheduler may want to use the built-in idle cpumasks in ops.init() before the scheduler is fully initialized, either directly or through a BPF timer for example. However, this would result in an error, since the idle state has not been properly initialized yet. This can be easily verified by modifying scx_simple to call scx_bpf_get_idle_cpumask() in ops.init(): $ sudo scx_simple DEBUG DUMP =========================================================================== scx_simple[121] triggered exit kind 1024: runtime error (built-in idle tracking is disabled) ... Fix this by properly initializing the idle state before ops.init() is called. With this change applied: $ sudo scx_simple local=2 global=0 local=19 global=11 local=23 global=11 ... Fixes: d73249f88743d ("sched_ext: idle: Make idle static keys private") Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-26sched_ext: Remove a meaningless conditional goto in scx_select_cpu_dfl()Tejun Heo
scx_select_cpu_dfl() has a meaningless conditional goto at the end. Remove it. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andrea Righi <arighi@nvidia.com>
2025-03-26sched_ext: idle: Fix return code of scx_select_cpu_dfl()Andrea Righi
Return -EBUSY when using %SCX_PICK_IDLE_CORE with scx_select_cpu_dfl() if a fully idle SMT core cannot be found, instead of falling back to @prev_cpu, which is not a fully idle SMT core in this case. Fixes: c414c2171cd9e ("sched_ext: idle: Honor idle flags in the built-in idle selection policy") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-14sched_ext: idle: Refactor scx_select_cpu_dfl()Andrea Righi
Make scx_select_cpu_dfl() more consistent with the other idle-related APIs by returning a negative value when an idle CPU isn't found. No functional changes, this is purely a refactoring. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-14sched_ext: idle: Honor idle flags in the built-in idle selection policyAndrea Righi
Enable passing idle flags (%SCX_PICK_IDLE_*) to scx_select_cpu_dfl(), to enforce strict selection criteria, such as selecting an idle CPU strictly within @prev_cpu's node or choosing only a fully idle SMT core. This functionality will be exposed through a dedicated kfunc in a separate patch. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-03sched_ext: Merge branch 'for-6.14-fixes' into for-6.15Tejun Heo
Pull for-6.14-fixes to receive: 9360dfe4cbd6 ("sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()") which conflicts with: 337d1b354a29 ("sched_ext: Move built-in idle CPU selection policy to a separate file") Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-25sched_ext: idle: Fix scx_bpf_pick_any_cpu_node() behaviorAndrea Righi
When %SCX_PICK_IDLE_IN_NODE is specified, scx_bpf_pick_any_cpu_node() should always return a CPU from the specified node, regardless of its idle state. Also clarify this logic in the function documentation. Fixes: 01059219b0cfd ("sched_ext: idle: Introduce node-aware idle cpu kfunc helpers") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-18sched_ext: idle: Introduce node-aware idle cpu kfunc helpersAndrea Righi
Introduce a new kfunc to retrieve the node associated to a CPU: int scx_bpf_cpu_node(s32 cpu) Add the following kfuncs to provide BPF schedulers direct access to per-node idle cpumasks information: const struct cpumask *scx_bpf_get_idle_cpumask_node(int node) const struct cpumask *scx_bpf_get_idle_smtmask_node(int node) s32 scx_bpf_pick_idle_cpu_node(const cpumask_t *cpus_allowed, int node, u64 flags) s32 scx_bpf_pick_any_cpu_node(const cpumask_t *cpus_allowed, int node, u64 flags) Moreover, trigger an scx error when any of the non-node aware idle CPU kfuncs are used when SCX_OPS_BUILTIN_IDLE_PER_NODE is enabled. Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16sched_ext: idle: Per-node idle cpumasksAndrea Righi
Using a single global idle mask can lead to inefficiencies and a lot of stress on the cache coherency protocol on large systems with multiple NUMA nodes, since all the CPUs can create a really intense read/write activity on the single global cpumask. Therefore, split the global cpumask into multiple per-NUMA node cpumasks to improve scalability and performance on large systems. The concept is that each cpumask will track only the idle CPUs within its corresponding NUMA node, treating CPUs in other NUMA nodes as busy. In this way concurrent access to the idle cpumask will be restricted within each NUMA node. The split of multiple per-node idle cpumasks can be controlled using the SCX_OPS_BUILTIN_IDLE_PER_NODE flag. By default SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled and a global host-wide idle cpumask is used, maintaining the previous behavior. NOTE: if a scheduler explicitly enables the per-node idle cpumasks (via SCX_OPS_BUILTIN_IDLE_PER_NODE), scx_bpf_get_idle_cpu/smtmask() will trigger an scx error, since there are no system-wide cpumasks. = Test = Hardware: - System: DGX B200 - CPUs: 224 SMT threads (112 physical cores) - Processor: INTEL(R) XEON(R) PLATINUM 8570 - 2 NUMA nodes Scheduler: - scx_simple [1] (so that we can focus at the built-in idle selection policy and not at the scheduling policy itself) Test: - Run a parallel kernel build `make -j $(nproc)` and measure the average elapsed time over 10 runs: avg time | stdev ---------+------ before: 52.431s | 2.895 after: 50.342s | 2.895 = Conclusion = Splitting the global cpumask into multiple per-NUMA cpumasks helped to achieve a speedup of approximately +4% with this particular architecture and test case. The same test on a DGX-1 (40 physical cores, Intel Xeon E5-2698 v4 @ 2.20GHz, 2 NUMA nodes) shows a speedup of around 1.5-3%. On smaller systems, I haven't noticed any measurable regressions or improvements with the same test (parallel kernel build) and scheduler (scx_simple). Moreover, with a modified scx_bpfland that uses the new NUMA-aware APIs I observed an additional +2-2.5% performance improvement with the same test. [1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_simple.bpf.c Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODEAndrea Righi
Add the new scheduler flag SCX_OPS_BUILTIN_IDLE_PER_NODE, which allows BPF schedulers to select between using a global flat idle cpumask or multiple per-node cpumasks. This only introduces the flag and the mechanism to enable/disable this feature without affecting any scheduling behavior. Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16sched_ext: idle: Make idle static keys privateAndrea Righi
Make all the static keys used by the idle CPU selection policy private to ext_idle.c. This avoids unnecessary exposure in headers and improves code encapsulation. Cc: Yury Norov <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-27sched_ext: Move built-in idle CPU selection policy to a separate fileAndrea Righi
As ext.c is becoming quite large, move the idle CPU selection policy to separate files (ext_idle.c / ext_idle.h) for better code readability. Moreover, group together all the idle CPU selection kfunc's to the same btf_kfunc_id_set block. No functional changes, this is purely code reorganization. Suggested-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>