Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf updates from Alexei Starovoitov:
- Fix and improve BTF deduplication of identical BTF types (Alan
Maguire and Andrii Nakryiko)
- Support up to 12 arguments in BPF trampoline on arm64 (Xu Kuohai and
Alexis Lothoré)
- Support load-acquire and store-release instructions in BPF JIT on
riscv64 (Andrea Parri)
- Fix uninitialized values in BPF_{CORE,PROBE}_READ macros (Anton
Protopopov)
- Streamline allowed helpers across program types (Feng Yang)
- Support atomic update for hashtab of BPF maps (Hou Tao)
- Implement json output for BPF helpers (Ihor Solodrai)
- Several s390 JIT fixes (Ilya Leoshkevich)
- Various sockmap fixes (Jiayuan Chen)
- Support mmap of vmlinux BTF data (Lorenz Bauer)
- Support BPF rbtree traversal and list peeking (Martin KaFai Lau)
- Tests for sockmap/sockhash redirection (Michal Luczaj)
- Introduce kfuncs for memory reads into dynptrs (Mykyta Yatsenko)
- Add support for dma-buf iterators in BPF (T.J. Mercier)
- The verifier support for __bpf_trap() (Yonghong Song)
* tag 'bpf-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (135 commits)
bpf, arm64: Remove unused-but-set function and variable.
selftests/bpf: Add tests with stack ptr register in conditional jmp
bpf: Do not include stack ptr register in precision backtracking bookkeeping
selftests/bpf: enable many-args tests for arm64
bpf, arm64: Support up to 12 function arguments
bpf: Check rcu_read_lock_trace_held() in bpf_map_lookup_percpu_elem()
bpf: Avoid __bpf_prog_ret0_warn when jit fails
bpftool: Add support for custom BTF path in prog load/loadall
selftests/bpf: Add unit tests with __bpf_trap() kfunc
bpf: Warn with __bpf_trap() kfunc maybe due to uninitialized variable
bpf: Remove special_kfunc_set from verifier
selftests/bpf: Add test for open coded dmabuf_iter
selftests/bpf: Add test for dmabuf_iter
bpf: Add open coded dmabuf iterator
bpf: Add dmabuf iterator
dma-buf: Rename debugfs symbols
bpf: Fix error return value in bpf_copy_from_user_dynptr
libbpf: Use mmap to parse vmlinux BTF from sysfs
selftests: bpf: Add a test for mmapable vmlinux BTF
btf: Allow mmap of vmlinux btf
...
|
|
Relocate the scx_kf_allowed_if_unlocked(), so it can be used from other
source files (e.g., ext_idle.c).
No functional change.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Naked scx_root dereferences are being used as temporary markers to indicate
that they need to be updated to point to the right scheduler instance.
Explain the situation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
|
|
In preparation of hierarchical scheduling support, add @sch to scx_exit()
and friends:
- scx_exit/error() updated to take explicit @sch instead of assuming
scx_root.
- scx_kf_exit/error() added. These are to be used from kfuncs, don't take
@sch and internally determine the scx_sched instance to abort. Currently,
it's always scx_root but once multiple scheduler support is in place, it
will be the scx_sched instance that invoked the kfunc. This simplifies
many callsites and defers scx_sched lookup until error is triggered.
- @sch is propagated to ops_cpu_valid() and ops_sanitize_err(). The CPU
validity conditions in ops_cpu_valid() are factored into __cpu_valid() to
implement kf_cpu_valid() which is the counterpart to scx_kf_exit/error().
- All users are converted. Most conversions are straightforward.
check_rq_for_timeouts() and scx_softlockup() are updated to use explicit
rcu_dereference*(scx_root) for safety as they may execute asynchronous to
the exit path. scx_tick() is also updated to use rcu_dereference(). While
not strictly necessary due to the preceding scx_enabled() test and IRQ
disabled context, this removes the subtlety at no noticeable cost.
No behavior changes intended.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
__scx_exit() is the base exit implementation and there are three wrappers on
top of it - scx_exit(), __scx_error() and scx_error(). This is more
confusing than helpful especially given that there are only a couple users
of scx_exit() and __scx_error(). To simplify the situation:
- Make __scx_exit() take va_list and rename it to scx_vexit(). This is to
ease implementing more complex extensions on top.
- Make scx_exit() a varargs wrapper around __scx_exit(). scx_exit() now
takes both @kind and @exit_code.
- Convert existing scx_exit() and __scx_error() users to use the new
scx_exit().
- scx_error() remains unchanged.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
In preparation of hierarchical scheduling support, make SCX_CALL_OP*() take
explicit @sch instead of assuming scx_root. As scx_root is still the only
scheduler instance, this patch doesn't make any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
- Always cache scx_root into local variable sch before using.
- Don't use scx_root if cached sch is available.
- Wrap !sch test with unlikely().
- Pass @scx into scx_cgroup_init/exit().
No behavior changes intended.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
task_storage_{get,delete} has been moved to bpf_base_func_proto.
Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Feng Yang <yangfeng@kylinos.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/bpf/20250506061434.94277-3-yangfeng59949@163.com
|
|
To receive 428dc9fc0873 ("sched_ext: bpf_iter_scx_dsq_new() should always
initialize iterator") which conflicts with cdf5a6faa8cf ("sched_ext: Move
dsq_hash into scx_sched"). The conflict is a simple context conflict which
can be resolved by taking changes from both changes in the right order.
|
|
BPF programs may call next() and destroy() on BPF iterators even after new()
returns an error value (e.g. bpf_for_each() macro ignores error returns from
new()). bpf_iter_scx_dsq_new() could leave the iterator in an uninitialized
state after an error return causing bpf_iter_scx_dsq_next() to dereference
garbage data. Make bpf_iter_scx_dsq_new() always clear $kit->dsq so that
next() and destroy() become noops.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 650ba21b131e ("sched_ext: Implement DSQ iterator")
Cc: stable@vger.kernel.org # v6.12+
Acked-by: Andrea Righi <arighi@nvidia.com>
|
|
A sched_ext scheduler may trigger __scx_exit() from a BPF timer
callback, where scx_root may not be safely dereferenced.
This can lead to a NULL pointer dereference as shown below (triggered by
scx_tickless):
BUG: kernel NULL pointer dereference, address: 0000000000000330
...
CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.14.0-virtme #1 PREEMPT(full)
RIP: 0010:__scx_exit+0x2b/0x190
...
Call Trace:
<IRQ>
scx_bpf_get_idle_smtmask+0x59/0x80
bpf_prog_8320d4217989178c_dispatch_all_cpus+0x35/0x1b6
...
bpf_prog_97f847d871513f95_sched_timerfn+0x4c/0x264
bpf_timer_cb+0x7a/0x140
__hrtimer_run_queues+0x1f9/0x3a0
hrtimer_run_softirq+0x8c/0xd0
handle_softirqs+0xd3/0x3d0
__irq_exit_rcu+0x9a/0xc0
irq_exit_rcu+0xe/0x20
Fix this by checking for a valid scx_root and adding proper RCU
protection.
Fixes: 48e1267773866 ("sched_ext: Introduce scx_sched")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Using a DSQ iterators from a timer callback can trigger the following
lockdep splat when accessing scx_root:
=============================
WARNING: suspicious RCU usage
6.14.0-virtme #1 Not tainted
-----------------------------
kernel/sched/ext.c:6907 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
no locks held by swapper/0/0.
stack backtrace:
CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.14.0-virtme #1 PREEMPT(full)
Sched_ext: tickless (enabled+all)
Call Trace:
<IRQ>
dump_stack_lvl+0x6f/0xb0
lockdep_rcu_suspicious.cold+0x4e/0xa3
bpf_iter_scx_dsq_new+0xb1/0xd0
bpf_prog_63f4fd1bccc101e7_dispatch_cpu+0x3e/0x156
bpf_prog_8320d4217989178c_dispatch_all_cpus+0x153/0x1b6
bpf_prog_97f847d871513f95_sched_timerfn+0x4c/0x264
? hrtimer_run_softirq+0x4f/0xd0
bpf_timer_cb+0x7a/0x140
__hrtimer_run_queues+0x1f9/0x3a0
hrtimer_run_softirq+0x8c/0xd0
handle_softirqs+0xd3/0x3d0
__irq_exit_rcu+0x9a/0xc0
irq_exit_rcu+0xe/0x20
sysvec_apic_timer_interrupt+0x73/0x80
Add a proper dereference check to explicitly validate RCU-safe access to
scx_root from rcu_read_lock() contexts and also from contexts that hold
rcu_read_lock_bh(), such as timer callbacks.
Fixes: cdf5a6faa8cf0 ("sched_ext: Move dsq_hash into scx_sched")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
With the global states and disable machinery moved into scx_sched,
scx_disable_workfn() can only be scheduled and run for the specific
scheduler instance. This makes it impossible for scx_disable_workfn() to see
SCX_EXIT_NONE. Turn that condition into WARN_ON_ONCE().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
|
|
Because disable can be triggered from any place and the scheduler cannot be
trusted, disable path uses an irq_work to bounce and a kthread_work which is
executed on an RT helper kthread to perform disable. These must be per
scheduler instance to guarantee forward progress. Move them into scx_sched.
- If an scx_sched is accessible, its helper kthread is always valid making
the `helper` check in schedule_scx_disable_work() unnecessary. As the
function becomes trivial after the removal of the test, inline it.
- scx_create_rt_helper() has only one user - creation of the disable helper
kthread. Inline it into scx_alloc_and_add_sched().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
|
|
The event counters are going to become per scheduler instance. Move
event_stats_cpu into scx_sched.
- [__]scx_add_event() are updated to take @sch. While at it, add missing
parentheses around @cnt expansions.
- scx_read_events() is updated to take @sch.
- scx_bpf_events() accesses scx_root under RCU read lock.
v2: - Replace stale scx_bpf_get_event_stat() reference in a comment with
scx_bpf_events().
- Trivial goto label rename.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
|
|
In prepration of moving event_stats_cpu into scx_sched, factor out
scx_read_events() out of scx_bpf_events() and update the in-kernel users -
scx_attr_events_show() and scx_dump_state() - to use scx_read_events()
instead of scx_bpf_events(). No observable behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
|
|
In prepration of moving event_stats_cpu into scx_sched, move scx_event_stats
definitions above scx_sched definition. No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
|
|
Global DSQs are going to become per scheduler instance. Move global_dsqs
into scx_sched. find_global_dsq() already takes a task_struct pointer as an
argument and should later be able to determine the scx_sched to use from
that. For now, assume scx_root.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
|
|
User DSQs are going to become per scheduler instance. Move dsq_hash into
scx_sched. This shifts the code that assumes scx_root to be the only
scx_sched instance up the call stack but doesn't remove them yet.
v2: Add missing rcu_read_lock() in scx_bpf_destroy_dsq() as per Andrea.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
|
|
More will be moved into scx_sched. Factor out the allocation and kobject
addition path into scx_alloc_and_add_sched().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
|
|
create_dsq() is only used by scx_bpf_create_dsq() and the separation gets in
the way of making dsq_hash per scx_sched. Inline it into
scx_bpf_create_dsq(). While at it, add unlikely() around
SCX_DSQ_FLAG_BUILTIN test.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
|
|
To prepare for supporting multiple schedulers, make scx_sched allocated
dynamically. scx_sched->kobj is now an embedded field and the kobj's
lifetime determines the lifetime of the containing scx_sched.
- Enable path is updated so that kobj init and addition are performed later.
- scx_sched freeing is initiated in scx_kobj_release() and also goes through
an rcu_work so that scx_root can be accessed from an unsynchronized path -
scx_disable().
- sched_ext_ops->priv is added and used to point to scx_sched instance
created for the ops instance. This is used by bpf_scx_unreg() to determine
the scx_sched instance to disable and put.
No behavior changes intended.
v2: Andrea reported kernel oops due to scx_bpf_unreg() trying to deref NULL
scx_root after scheduler init failure. sched_ext_ops->priv added so that
scx_bpf_unreg() can always find the scx_sched instance to unregister
even if it failed early during init.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
|
|
SCX_HAS_OP() tests scx_root->has_op bitmap. The bitmap is currently in a
statically allocated struct scx_sched and initialized while loading the BPF
scheduler and cleared while unloading, and thus can be tested anytime.
However, scx_root will be switched to dynamic allocation and thus won't
always be deferenceable.
Most usages of SCX_HAS_OP() are already protected by scx_enabled() either
directly or indirectly (e.g. through a task which is on SCX). However, there
are a couple places that could try to dereference NULL scx_root. Update them
so that scx_root is guaranteed to be valid before SCX_HAS_OP() is called.
- In handle_hotplug(), test whether scx_root is NULL before doing anything
else. This is safe because scx_root updates will be protected by
cpus_read_lock().
- In scx_tg_offline(), test scx_cgroup_enabled before invoking SCX_HAS_OP(),
which should guarnatee that scx_root won't turn NULL. This is also in line
with other cgroup operations. As the code path is synchronized against
scx_cgroup_init/exit() through scx_cgroup_rwsem, this shouldn't cause any
behavior differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
|
|
To support multiple scheduler instances, collect some of the global
variables that should be specific to a scheduler instance into the new
struct scx_sched. scx_root is the root scheduler instance and points to a
static instance of struct scx_sched. Except for an extra dereference through
the scx_root pointer, this patch makes no functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
|
|
To receive e38be1c7647c ("sched_ext: Fix rq lock state in hotplug ops") to
avoid conflicts with scx_sched related patches pending for for-6.16.
|
|
The ops.cpu_online() and ops.cpu_offline() callbacks incorrectly assume
that the rq involved in the operation is locked, which is not the case
during hotplug, triggering the following warning:
WARNING: CPU: 1 PID: 20 at kernel/sched/sched.h:1504 handle_hotplug+0x280/0x340
Fix by not tracking the target rq as locked in the context of
ops.cpu_online() and ops.cpu_offline().
Fixes: 18853ba782bef ("sched_ext: Track currently locked rq")
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Tested-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Some kfuncs specific to the idle CPU selection policy are registered in
both the scx_kfunc_ids_any and scx_kfunc_ids_idle blocks, even though
they should only be defined in the latter.
Remove the duplicates from scx_kfunc_ids_any.
Fixes: 337d1b354a297 ("sched_ext: Move built-in idle CPU selection policy to a separate file")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The ops.running() and ops.stopping() callbacks can be invoked from a CPU
other than the one the task is assigned to, particularly when a task
property is changed, as both scx_next_task_scx() and dequeue_task_scx() may
run on CPUs different from the task's target CPU.
This behavior can lead to confusion or incorrect assumptions if not
properly clarified, potentially resulting in bugs (see [1]).
Therefore, update the documentation to clarify this aspect and advise
users to use scx_bpf_task_cpu() to determine the actual CPU the task
will run on or was running on.
[1] https://github.com/sched-ext/scx/pull/1728
Cc: Jake Hillion <jake@hillion.co.uk>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
a11d6784d731 ("sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()")
added a call to scx_ops_error() which was renamed to scx_error() in
for-6.16. Fix it up.
|
|
scx_bpf_cpuperf_set() can be used to set a performance target level on
any CPU. However, it doesn't correctly acquire the corresponding rq
lock, which may lead to unsafe behavior and trigger the following
warning, due to the lockdep_assert_rq_held() check:
[ 51.713737] WARNING: CPU: 3 PID: 3899 at kernel/sched/sched.h:1512 scx_bpf_cpuperf_set+0x1a0/0x1e0
...
[ 51.713836] Call trace:
[ 51.713837] scx_bpf_cpuperf_set+0x1a0/0x1e0 (P)
[ 51.713839] bpf_prog_62d35beb9301601f_bpfland_init+0x168/0x440
[ 51.713841] bpf__sched_ext_ops_init+0x54/0x8c
[ 51.713843] scx_ops_enable.constprop.0+0x2c0/0x10f0
[ 51.713845] bpf_scx_reg+0x18/0x30
[ 51.713847] bpf_struct_ops_link_create+0x154/0x1b0
[ 51.713849] __sys_bpf+0x1934/0x22a0
Fix by properly acquiring the rq lock when possible or raising an error
if we try to operate on a CPU that is not the one currently locked.
Fixes: d86adb4fc0655 ("sched_ext: Add cpuperf support")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Some kfuncs provided by sched_ext may need to operate on a struct rq,
but they can be invoked from various contexts, specifically, different
scx callbacks.
While some of these callbacks are invoked with a particular rq already
locked, others are not. This makes it impossible for a kfunc to reliably
determine whether it's safe to access a given rq, triggering potential
bugs or unsafe behaviors, see for example [1].
To address this, track the currently locked rq whenever a sched_ext
callback is invoked via SCX_CALL_OP*().
This allows kfuncs that need to operate on an arbitrary rq to retrieve
the currently locked one and apply the appropriate action as needed.
[1] https://lore.kernel.org/lkml/20250325140021.73570-1-arighi@nvidia.com/
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Add helper for refilling task with default slice and event
statistics accordingly.
Signed-off-by: Honglei Wang <jameshongleiwang@126.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
SCX_EV_ENQ_SLICE_DFL gives the impression that the event only occurs
when the tasks were enqueued, which seems not accurate. What it actually
means is the refilling with defalt slice, and this can occur either when
enqueue or pick_task. Let's change the variable to
SCX_EV_REFILL_SLICE_DFL.
Signed-off-by: Honglei Wang <jameshongleiwang@126.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
scx_has_op is used to encode which ops are implemented by the BPF scheduler
into an array of static_keys. While this saves a bit of branching overhead,
that is unlikely to be noticeable compared to the overall cost. As the
global static_keys can't work with the planned hierarchical multiple
scheduler support, replace the static_key array with a bitmap.
In repeated hackbench runs before and after static_keys removal on an AMD
Ryzen 3900X, I couldn't tell any measurable performance difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
|
|
scx_ops_allow_queued_wakeup is used to encode SCX_OPS_ALLOW_QUEUED_WAKEUP
into a static_key. The test is gated behind scx_enabled(), and, even when
sched_ext is enabled, is unlikely for the static_key usage to make any
meaningful difference. It is made to use a static_key mostly because there
was no reason not to. However, global static_keys can't work with the
planned hierarchical multiple scheduler support. Remove the static_key and
instead test SCX_OPS_ALLOW_QUEUED_WAKEUP directly.
In repeated hackbench runs before and after static_keys removal on an AMD
Ryzen 3900X, I couldn't tell any measurable performance difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
|
|
scx_ops_cpu_preempt is used to encode whether ops.cpu_acquire/release() are
implemented into a static_key. These tests aren't hot enough for static_key
usage to make any meaningful difference and are made to use a static_key
mostly because there was no reason not to. However, global static_keys can't
work with the planned hierarchical multiple scheduler support. Remove the
static_key and instead use an internal ops flag SCX_OPS_HAS_CPU_PREEMPT to
record and test whether ops.cpu_acquire/release() are implemented.
In repeated hackbench runs before and after static_keys removal on an AMD
Ryzen 3900X, I couldn't tell any measurable performance difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
|
|
scx_ops_enq_last/exiting/migration_disabled are used to encode the
corresponding SCX_OPS_ flags into static_keys. These flags aren't hot enough
for static_key usage to make any meaningful difference and are made
static_keys mostly because there was no reason not to. However, global
static_keys can't work with the planned hierarchical multiple scheduler
support. Remove the static_keys and test the ops flags directly.
In repeated hackbench runs before and after static_keys removal on an AMD
Ryzen 3900X, I couldn't tell any measurable performance difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
|
|
Purely cosmetic.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
|
|
Pull for-6.15-fixes to receive:
e776b26e3701 ("sched_ext: Remove cpu.weight / cpu.idle unimplemented warnings")
which conflicts with:
1a7ff7216c8b ("sched_ext: Drop "ops" from scx_ops_enable_state and friends")
The former removes code updated by the latter. Resolved by removing the
updated section.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
SCX_OPS_HAS_CGROUP_WEIGHT was only used to suppress the missing cgroup
weight support warnings. Now that the warnings are removed, the flag doesn't
do anything. Mark it for deprecation and remove its usage from scx_flatcg.
v2: Actually include the scx_flatcg update.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-and-reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
sched_ext generates warnings when cpu.weight / cpu.idle are set to
non-default values if the BPF scheduler doesn't implement weight support.
These warnings don't provide much value while adding constant annoyance. A
BPF scheduler may not implement any particular behavior and there's nothing
particularly special about missing cgroup weight support. Drop the warnings.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Replace kzalloc with kvzalloc for the exit_dump buffer allocation, which
can require large contiguous memory depending on the implementation.
This change prevents allocation failures by allowing the system to fall
back to vmalloc when contiguous memory allocation fails.
Since this buffer is only used for debugging purposes, physical memory
contiguity is not required, making vmalloc a suitable alternative.
Cc: stable@vger.kernel.org
Fixes: 07814a9439a3b0 ("sched_ext: Print debug dump after an error exit")
Suggested-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Provide a new kfunc, scx_bpf_select_cpu_and(), that can be used to apply
the built-in idle CPU selection policy to a subset of allowed CPU.
This new helper is basically an extension of scx_bpf_select_cpu_dfl().
However, when an idle CPU can't be found, it returns a negative value
instead of @prev_cpu, aligning its behavior more closely with
scx_bpf_pick_idle_cpu().
It also accepts %SCX_PICK_IDLE_* flags, which can be used to enforce
strict selection to @prev_cpu's node (%SCX_PICK_IDLE_IN_NODE), or to
request only a full-idle SMT core (%SCX_PICK_IDLE_CORE), while applying
the built-in selection logic.
With this helper, BPF schedulers can apply the built-in idle CPU
selection policy restricted to any arbitrary subset of CPUs.
Example usage
=============
Possible usage in ops.select_cpu():
s32 BPF_STRUCT_OPS(foo_select_cpu, struct task_struct *p,
s32 prev_cpu, u64 wake_flags)
{
const struct cpumask *cpus = task_allowed_cpus(p) ?: p->cpus_ptr;
s32 cpu;
cpu = scx_bpf_select_cpu_and(p, prev_cpu, wake_flags, cpus, 0);
if (cpu >= 0) {
scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
return cpu;
}
return prev_cpu;
}
Results
=======
Load distribution on a 4 sockets, 4 cores per socket system, simulated
using virtme-ng, running a modified version of scx_bpfland that uses
scx_bpf_select_cpu_and() with 0xff00 as the allowed subset of CPUs:
$ vng --cpu 16,sockets=4,cores=4,threads=1
...
$ stress-ng -c 16
...
$ htop
...
0[ 0.0%] 8[||||||||||||||||||||||||100.0%]
1[ 0.0%] 9[||||||||||||||||||||||||100.0%]
2[ 0.0%] 10[||||||||||||||||||||||||100.0%]
3[ 0.0%] 11[||||||||||||||||||||||||100.0%]
4[ 0.0%] 12[||||||||||||||||||||||||100.0%]
5[ 0.0%] 13[||||||||||||||||||||||||100.0%]
6[ 0.0%] 14[||||||||||||||||||||||||100.0%]
7[ 0.0%] 15[||||||||||||||||||||||||100.0%]
With scx_bpf_select_cpu_dfl() tasks would be distributed evenly across
all the available CPUs.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Modify scx_select_cpu_dfl() to take the allowed cpumask as an explicit
argument, instead of implicitly using @p->cpus_ptr.
This prepares for future changes where arbitrary cpumasks may be passed
to the built-in idle CPU selection policy.
This is a pure refactoring with no functional changes.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The tag "ops" is used for two different purposes. First, to indicate that
the entity is directly related to the operations such as flags carried in
sched_ext_ops. Second, to indicate that the entity applies to something
global such as enable or bypass states. The second usage is historical and
causes confusion rather than clarifying anything. For example,
scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with
scx_ops_flags. Let's drop the second usages.
Drop "ops" from SCX_OPS_TASK_ITER_BATCH.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-and-acked-by: Andrea Righi <arighi@nvidia.com>
|
|
friends
The tag "ops" is used for two different purposes. First, to indicate that
the entity is directly related to the operations such as flags carried in
sched_ext_ops. Second, to indicate that the entity applies to something
global such as enable or bypass states. The second usage is historical and
causes confusion rather than clarifying anything. For example,
scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with
scx_ops_flags. Let's drop the second usages.
Drop "ops" from scx_ops_{init|exit|enable|disable}[_task]() and friends.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
|
|
The tag "ops" is used for two different purposes. First, to indicate that
the entity is directly related to the operations such as flags carried in
sched_ext_ops. Second, to indicate that the entity applies to something
global such as enable or bypass states. The second usage is historical and
causes confusion rather than clarifying anything. For example,
scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with
scx_ops_flags. Let's drop the second usages.
Drop "ops" from scx_ops_exit(), scx_ops_error() and friends.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
|
|
The tag "ops" is used for two different purposes. First, to indicate that
the entity is directly related to the operations such as flags carried in
sched_ext_ops. Second, to indicate that the entity applies to something
global such as enable or bypass states. The second usage is historical and
causes confusion rather than clarifying anything. For example,
scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with
scx_ops_flags. Let's drop the second usages.
Drop "ops" from scx_ops_bypass(), scx_ops_breather() and friends. Update
scx_show_state.py accordingly.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
|
|
__scx_ops_enabled
The tag "ops" is used for two different purposes. First, to indicate that
the entity is directly related to the operations such as flags carried in
sched_ext_ops. Second, to indicate that the entity applies to something
global such as enable or bypass states. The second usage is historical and
causes confusion rather than clarifying anything. For example,
scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with
scx_ops_flags. Let's drop the second usages.
Drop "ops" from scx_ops_helper, scx_ops_enable_mutex and __scx_ops_enabled.
Update scx_show_state.py accordingly.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
|
|
The tag "ops" is used for two different purposes. First, to indicate that
the entity is directly related to the operations such as flags carried in
sched_ext_ops. Second, to indicate that the entity applies to something
global such as enable or bypass states. The second usage is historical and
causes confusion rather than clarifying anything. For example,
scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with
scx_ops_flags. Let's drop the second usages.
Drop "ops" from scx_ops_enable_state and friends. Update scx_show_state.py
accordingly.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
|