summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-04-29sched_ext: Avoid NULL scx_root deref through SCX_HAS_OP()Tejun Heo
SCX_HAS_OP() tests scx_root->has_op bitmap. The bitmap is currently in a statically allocated struct scx_sched and initialized while loading the BPF scheduler and cleared while unloading, and thus can be tested anytime. However, scx_root will be switched to dynamic allocation and thus won't always be deferenceable. Most usages of SCX_HAS_OP() are already protected by scx_enabled() either directly or indirectly (e.g. through a task which is on SCX). However, there are a couple places that could try to dereference NULL scx_root. Update them so that scx_root is guaranteed to be valid before SCX_HAS_OP() is called. - In handle_hotplug(), test whether scx_root is NULL before doing anything else. This is safe because scx_root updates will be protected by cpus_read_lock(). - In scx_tg_offline(), test scx_cgroup_enabled before invoking SCX_HAS_OP(), which should guarnatee that scx_root won't turn NULL. This is also in line with other cgroup operations. As the code path is synchronized against scx_cgroup_init/exit() through scx_cgroup_rwsem, this shouldn't cause any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Introduce scx_schedTejun Heo
To support multiple scheduler instances, collect some of the global variables that should be specific to a scheduler instance into the new struct scx_sched. scx_root is the root scheduler instance and points to a static instance of struct scx_sched. Except for an extra dereference through the scx_root pointer, this patch makes no functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29Merge branch 'for-6.15-fixes' into for-6.16Tejun Heo
To receive e38be1c7647c ("sched_ext: Fix rq lock state in hotplug ops") to avoid conflicts with scx_sched related patches pending for for-6.16.
2025-04-29sched_ext: Fix rq lock state in hotplug opsAndrea Righi
The ops.cpu_online() and ops.cpu_offline() callbacks incorrectly assume that the rq involved in the operation is locked, which is not the case during hotplug, triggering the following warning: WARNING: CPU: 1 PID: 20 at kernel/sched/sched.h:1504 handle_hotplug+0x280/0x340 Fix by not tracking the target rq as locked in the context of ops.cpu_online() and ops.cpu_offline(). Fixes: 18853ba782bef ("sched_ext: Track currently locked rq") Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrea Righi <arighi@nvidia.com> Tested-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-25sched_ext: Remove duplicate BTF_ID_FLAGS definitionsAndrea Righi
Some kfuncs specific to the idle CPU selection policy are registered in both the scx_kfunc_ids_any and scx_kfunc_ids_idle blocks, even though they should only be defined in the latter. Remove the duplicates from scx_kfunc_ids_any. Fixes: 337d1b354a297 ("sched_ext: Move built-in idle CPU selection policy to a separate file") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-23sched_ext: Clarify CPU context for running/stopping callbacksAndrea Righi
The ops.running() and ops.stopping() callbacks can be invoked from a CPU other than the one the task is assigned to, particularly when a task property is changed, as both scx_next_task_scx() and dequeue_task_scx() may run on CPUs different from the task's target CPU. This behavior can lead to confusion or incorrect assumptions if not properly clarified, potentially resulting in bugs (see [1]). Therefore, update the documentation to clarify this aspect and advise users to use scx_bpf_task_cpu() to determine the actual CPU the task will run on or was running on. [1] https://github.com/sched-ext/scx/pull/1728 Cc: Jake Hillion <jake@hillion.co.uk> Signed-off-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-22Merge branch 'for-6.15-fixes' into for-6.16Tejun Heo
a11d6784d731 ("sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()") added a call to scx_ops_error() which was renamed to scx_error() in for-6.16. Fix it up.
2025-04-22sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()Andrea Righi
scx_bpf_cpuperf_set() can be used to set a performance target level on any CPU. However, it doesn't correctly acquire the corresponding rq lock, which may lead to unsafe behavior and trigger the following warning, due to the lockdep_assert_rq_held() check: [ 51.713737] WARNING: CPU: 3 PID: 3899 at kernel/sched/sched.h:1512 scx_bpf_cpuperf_set+0x1a0/0x1e0 ... [ 51.713836] Call trace: [ 51.713837] scx_bpf_cpuperf_set+0x1a0/0x1e0 (P) [ 51.713839] bpf_prog_62d35beb9301601f_bpfland_init+0x168/0x440 [ 51.713841] bpf__sched_ext_ops_init+0x54/0x8c [ 51.713843] scx_ops_enable.constprop.0+0x2c0/0x10f0 [ 51.713845] bpf_scx_reg+0x18/0x30 [ 51.713847] bpf_struct_ops_link_create+0x154/0x1b0 [ 51.713849] __sys_bpf+0x1934/0x22a0 Fix by properly acquiring the rq lock when possible or raising an error if we try to operate on a CPU that is not the one currently locked. Fixes: d86adb4fc0655 ("sched_ext: Add cpuperf support") Signed-off-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-22sched_ext: Track currently locked rqAndrea Righi
Some kfuncs provided by sched_ext may need to operate on a struct rq, but they can be invoked from various contexts, specifically, different scx callbacks. While some of these callbacks are invoked with a particular rq already locked, others are not. This makes it impossible for a kfunc to reliably determine whether it's safe to access a given rq, triggering potential bugs or unsafe behaviors, see for example [1]. To address this, track the currently locked rq whenever a sched_ext callback is invoked via SCX_CALL_OP*(). This allows kfuncs that need to operate on an arbitrary rq to retrieve the currently locked one and apply the appropriate action as needed. [1] https://lore.kernel.org/lkml/20250325140021.73570-1-arighi@nvidia.com/ Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-18sched_ext: add helper for refill task with default sliceHonglei Wang
Add helper for refilling task with default slice and event statistics accordingly. Signed-off-by: Honglei Wang <jameshongleiwang@126.com> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-18sched_ext: change the variable name for slice refill eventHonglei Wang
SCX_EV_ENQ_SLICE_DFL gives the impression that the event only occurs when the tasks were enqueued, which seems not accurate. What it actually means is the refilling with defalt slice, and this can occur either when enqueue or pick_task. Let's change the variable to SCX_EV_REFILL_SLICE_DFL. Signed-off-by: Honglei Wang <jameshongleiwang@126.com> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-14sched_ext: Improve cross-compilation support in Makefileyangsonghua
Modify the tools/sched_ext/Makefile to better handle cross-compilation environments by: 1. Fix host tools build directory structure by separating obj/ from output (HOST_BUILD_DIR now points to $(OBJ_DIR)/host/obj) 2. Properly propagate CROSS_COMPILE to libbpf sub-make invocation 3. Add missing $(HOST_BPFOBJ) build rule with proper host toolchain flags (ARCH=, CROSS_COMPILE=, explicit HOSTCC/HOSTLD) 4. Consistently quote $(HOSTCC) in bpftool build rule 5. Change LDFLAGS assignment to += to allow external extensions The changes ensure proper cross-compilation behavior while maintaining backward compatibility with native builds. Host tools are now correctly built with the host toolchain while target binaries use the cross-toolchain. Signed-off-by: yangsonghua <yangsonghua@lixiang.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-09sched_ext: Make scx_has_op a bitmapTejun Heo
scx_has_op is used to encode which ops are implemented by the BPF scheduler into an array of static_keys. While this saves a bit of branching overhead, that is unlikely to be noticeable compared to the overall cost. As the global static_keys can't work with the planned hierarchical multiple scheduler support, replace the static_key array with a bitmap. In repeated hackbench runs before and after static_keys removal on an AMD Ryzen 3900X, I couldn't tell any measurable performance difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09sched_ext: Remove scx_ops_allow_queued_wakeup static_keyTejun Heo
scx_ops_allow_queued_wakeup is used to encode SCX_OPS_ALLOW_QUEUED_WAKEUP into a static_key. The test is gated behind scx_enabled(), and, even when sched_ext is enabled, is unlikely for the static_key usage to make any meaningful difference. It is made to use a static_key mostly because there was no reason not to. However, global static_keys can't work with the planned hierarchical multiple scheduler support. Remove the static_key and instead test SCX_OPS_ALLOW_QUEUED_WAKEUP directly. In repeated hackbench runs before and after static_keys removal on an AMD Ryzen 3900X, I couldn't tell any measurable performance difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09sched_ext: Remove scx_ops_cpu_preempt static_keyTejun Heo
scx_ops_cpu_preempt is used to encode whether ops.cpu_acquire/release() are implemented into a static_key. These tests aren't hot enough for static_key usage to make any meaningful difference and are made to use a static_key mostly because there was no reason not to. However, global static_keys can't work with the planned hierarchical multiple scheduler support. Remove the static_key and instead use an internal ops flag SCX_OPS_HAS_CPU_PREEMPT to record and test whether ops.cpu_acquire/release() are implemented. In repeated hackbench runs before and after static_keys removal on an AMD Ryzen 3900X, I couldn't tell any measurable performance difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09sched_ext: Remove scx_ops_enq_* static_keysTejun Heo
scx_ops_enq_last/exiting/migration_disabled are used to encode the corresponding SCX_OPS_ flags into static_keys. These flags aren't hot enough for static_key usage to make any meaningful difference and are made static_keys mostly because there was no reason not to. However, global static_keys can't work with the planned hierarchical multiple scheduler support. Remove the static_keys and test the ops flags directly. In repeated hackbench runs before and after static_keys removal on an AMD Ryzen 3900X, I couldn't tell any measurable performance difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09sched_ext: Indentation updatesTejun Heo
Purely cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-08sched_ext: Merge branch 'for-6.15-fixes' into for-6.16Tejun Heo
Pull for-6.15-fixes to receive: e776b26e3701 ("sched_ext: Remove cpu.weight / cpu.idle unimplemented warnings") which conflicts with: 1a7ff7216c8b ("sched_ext: Drop "ops" from scx_ops_enable_state and friends") The former removes code updated by the latter. Resolved by removing the updated section. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-08sched_ext: Mark SCX_OPS_HAS_CGROUP_WEIGHT for deprecationTejun Heo
SCX_OPS_HAS_CGROUP_WEIGHT was only used to suppress the missing cgroup weight support warnings. Now that the warnings are removed, the flag doesn't do anything. Mark it for deprecation and remove its usage from scx_flatcg. v2: Actually include the scx_flatcg update. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-and-reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-04-08sched_ext: Remove cpu.weight / cpu.idle unimplemented warningsTejun Heo
sched_ext generates warnings when cpu.weight / cpu.idle are set to non-default values if the BPF scheduler doesn't implement weight support. These warnings don't provide much value while adding constant annoyance. A BPF scheduler may not implement any particular behavior and there's nothing particularly special about missing cgroup weight support. Drop the warnings. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-08sched_ext: Use kvzalloc for large exit_dump allocationBreno Leitao
Replace kzalloc with kvzalloc for the exit_dump buffer allocation, which can require large contiguous memory depending on the implementation. This change prevents allocation failures by allowing the system to fall back to vmalloc when contiguous memory allocation fails. Since this buffer is only used for debugging purposes, physical memory contiguity is not required, making vmalloc a suitable alternative. Cc: stable@vger.kernel.org Fixes: 07814a9439a3b0 ("sched_ext: Print debug dump after an error exit") Suggested-by: Rik van Riel <riel@surriel.com> Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-07selftests/sched_ext: Add test for scx_bpf_select_cpu_and()Andrea Righi
Add a selftest to validate the behavior of the built-in idle CPU selection policy applied to a subset of allowed CPUs, using scx_bpf_select_cpu_and(). Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-07sched_ext: idle: Introduce scx_bpf_select_cpu_and()Andrea Righi
Provide a new kfunc, scx_bpf_select_cpu_and(), that can be used to apply the built-in idle CPU selection policy to a subset of allowed CPU. This new helper is basically an extension of scx_bpf_select_cpu_dfl(). However, when an idle CPU can't be found, it returns a negative value instead of @prev_cpu, aligning its behavior more closely with scx_bpf_pick_idle_cpu(). It also accepts %SCX_PICK_IDLE_* flags, which can be used to enforce strict selection to @prev_cpu's node (%SCX_PICK_IDLE_IN_NODE), or to request only a full-idle SMT core (%SCX_PICK_IDLE_CORE), while applying the built-in selection logic. With this helper, BPF schedulers can apply the built-in idle CPU selection policy restricted to any arbitrary subset of CPUs. Example usage ============= Possible usage in ops.select_cpu(): s32 BPF_STRUCT_OPS(foo_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags) { const struct cpumask *cpus = task_allowed_cpus(p) ?: p->cpus_ptr; s32 cpu; cpu = scx_bpf_select_cpu_and(p, prev_cpu, wake_flags, cpus, 0); if (cpu >= 0) { scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); return cpu; } return prev_cpu; } Results ======= Load distribution on a 4 sockets, 4 cores per socket system, simulated using virtme-ng, running a modified version of scx_bpfland that uses scx_bpf_select_cpu_and() with 0xff00 as the allowed subset of CPUs: $ vng --cpu 16,sockets=4,cores=4,threads=1 ... $ stress-ng -c 16 ... $ htop ... 0[ 0.0%] 8[||||||||||||||||||||||||100.0%] 1[ 0.0%] 9[||||||||||||||||||||||||100.0%] 2[ 0.0%] 10[||||||||||||||||||||||||100.0%] 3[ 0.0%] 11[||||||||||||||||||||||||100.0%] 4[ 0.0%] 12[||||||||||||||||||||||||100.0%] 5[ 0.0%] 13[||||||||||||||||||||||||100.0%] 6[ 0.0%] 14[||||||||||||||||||||||||100.0%] 7[ 0.0%] 15[||||||||||||||||||||||||100.0%] With scx_bpf_select_cpu_dfl() tasks would be distributed evenly across all the available CPUs. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-07sched_ext: idle: Accept an arbitrary cpumask in scx_select_cpu_dfl()Andrea Righi
Many scx schedulers implement their own hard or soft-affinity rules to support topology characteristics, such as heterogeneous architectures (e.g., big.LITTLE, P-cores/E-cores), or to categorize tasks based on specific properties (e.g., running certain tasks only in a subset of CPUs). Currently, there is no mechanism that allows to use the built-in idle CPU selection policy to an arbitrary subset of CPUs. As a result, schedulers often implement their own idle CPU selection policies, which are typically similar to one another, leading to a lot of code duplication. To address this, modify scx_select_cpu_dfl() to accept an arbitrary cpumask, that can be used by the BPF schedulers to apply the existent built-in idle CPU selection policy to a subset of allowed CPUs. With this concept the idle CPU selection policy becomes the following: - always prioritize CPUs from fully idle SMT cores (if SMT is enabled), - select the same CPU if it's idle and in the allowed CPUs, - select an idle CPU within the same LLC, if the LLC cpumask is a subset of the allowed CPUs, - select an idle CPU within the same node, if the node cpumask is a subset of the allowed CPUs, - select an idle CPU within the allowed CPUs. This functionality will be exposed through a dedicated kfunc in a separate patch. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-07sched_ext: idle: Explicitly pass allowed cpumask to scx_select_cpu_dfl()Andrea Righi
Modify scx_select_cpu_dfl() to take the allowed cpumask as an explicit argument, instead of implicitly using @p->cpus_ptr. This prepares for future changes where arbitrary cpumasks may be passed to the built-in idle CPU selection policy. This is a pure refactoring with no functional changes. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-07sched_ext: idle: Extend topology optimizations to all tasksAndrea Righi
The built-in idle selection policy, scx_select_cpu_dfl(), always prioritizes picking idle CPUs within the same LLC or NUMA node, but these optimizations are currently applied only when a task has no CPU affinity constraints. This is done primarily for efficiency, as it avoids the overhead of updating a cpumask every time we need to select an idle CPU (which can be costly in large SMP systems). However, this approach limits the effectiveness of the built-in idle policy and results in inconsistent behavior, as affinity-restricted tasks don't benefit from topology-aware optimizations. To address this, modify the policy to apply LLC and NUMA-aware optimizations even when a task is constrained to a subset of CPUs. We can still avoid updating the cpumasks by checking if the subset of LLC and node CPUs are contained in the subset of allowed CPUs usable by the task (which is true in most of the cases - for tasks that don't have affinity constratints). Moreover, use temporary local per-CPU cpumasks to determine the LLC and node subsets, minimizing potential overhead even on large SMP systems. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-04sched_ext: Drop "ops" from SCX_OPS_TASK_ITER_BATCHTejun Heo
The tag "ops" is used for two different purposes. First, to indicate that the entity is directly related to the operations such as flags carried in sched_ext_ops. Second, to indicate that the entity applies to something global such as enable or bypass states. The second usage is historical and causes confusion rather than clarifying anything. For example, scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with scx_ops_flags. Let's drop the second usages. Drop "ops" from SCX_OPS_TASK_ITER_BATCH. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-and-acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04sched_ext: Drop "ops" from scx_ops_{init|exit|enable|disable}[_task]() and ↵Tejun Heo
friends The tag "ops" is used for two different purposes. First, to indicate that the entity is directly related to the operations such as flags carried in sched_ext_ops. Second, to indicate that the entity applies to something global such as enable or bypass states. The second usage is historical and causes confusion rather than clarifying anything. For example, scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with scx_ops_flags. Let's drop the second usages. Drop "ops" from scx_ops_{init|exit|enable|disable}[_task]() and friends. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04sched_ext: Drop "ops" from scx_ops_exit(), scx_ops_error() and friendsTejun Heo
The tag "ops" is used for two different purposes. First, to indicate that the entity is directly related to the operations such as flags carried in sched_ext_ops. Second, to indicate that the entity applies to something global such as enable or bypass states. The second usage is historical and causes confusion rather than clarifying anything. For example, scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with scx_ops_flags. Let's drop the second usages. Drop "ops" from scx_ops_exit(), scx_ops_error() and friends. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04sched_ext: Drop "ops" from scx_ops_bypass(), scx_ops_breather() and friendsTejun Heo
The tag "ops" is used for two different purposes. First, to indicate that the entity is directly related to the operations such as flags carried in sched_ext_ops. Second, to indicate that the entity applies to something global such as enable or bypass states. The second usage is historical and causes confusion rather than clarifying anything. For example, scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with scx_ops_flags. Let's drop the second usages. Drop "ops" from scx_ops_bypass(), scx_ops_breather() and friends. Update scx_show_state.py accordingly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04sched_ext: Drop "ops" from scx_ops_helper, scx_ops_enable_mutex and ↵Tejun Heo
__scx_ops_enabled The tag "ops" is used for two different purposes. First, to indicate that the entity is directly related to the operations such as flags carried in sched_ext_ops. Second, to indicate that the entity applies to something global such as enable or bypass states. The second usage is historical and causes confusion rather than clarifying anything. For example, scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with scx_ops_flags. Let's drop the second usages. Drop "ops" from scx_ops_helper, scx_ops_enable_mutex and __scx_ops_enabled. Update scx_show_state.py accordingly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04sched_ext: Drop "ops" from scx_ops_enable_state and friendsTejun Heo
The tag "ops" is used for two different purposes. First, to indicate that the entity is directly related to the operations such as flags carried in sched_ext_ops. Second, to indicate that the entity applies to something global such as enable or bypass states. The second usage is historical and causes confusion rather than clarifying anything. For example, scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with scx_ops_flags. Let's drop the second usages. Drop "ops" from scx_ops_enable_state and friends. Update scx_show_state.py accordingly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04Merge tag 'riscv-for-linus-6.15-mw1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V updates from Palmer Dabbelt: - The sub-architecture selection Kconfig system has been cleaned up, the documentation has been improved, and various detections have been fixed - The vector-related extensions dependencies are now validated when parsing from device tree and in the DT bindings - Misaligned access probing can be overridden via a kernel command-line parameter, along with various fixes to misalign access handling - Support for relocatable !MMU kernels builds - Support for hpge pfnmaps, which should improve TLB utilization - Support for runtime constants, which improves the d_hash() performance - Support for bfloat16, Zicbom, Zaamo, Zalrsc, Zicntr, Zihpm - Various fixes, including: - We were missing a secondary mmu notifier call when flushing the tlb which is required for IOMMU - Fix ftrace panics by saving the registers as expected by ftrace - Fix a couple of stimecmp usage related to cpu hotplug - purgatory_start is now aligned as per the STVEC requirements - A fix for hugetlb when calculating the size of non-present PTEs * tag 'riscv-for-linus-6.15-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (65 commits) riscv: Add norvc after .option arch in runtime const riscv: Make sure toolchain supports zba before using zba instructions riscv/purgatory: 4B align purgatory_start riscv/kexec_file: Handle R_RISCV_64 in purgatory relocator selftests: riscv: fix v_exec_initval_nolibc.c riscv: Fix hugetlb retrieval of number of ptes in case of !present pte riscv: print hartid on bringup riscv: Add norvc after .option arch in runtime const riscv: Remove CONFIG_PAGE_OFFSET riscv: Support CONFIG_RELOCATABLE on riscv32 asm-generic: Always define Elf_Rel and Elf_Rela riscv: Support CONFIG_RELOCATABLE on NOMMU riscv: Allow NOMMU kernels to access all of RAM riscv: Remove duplicate CONFIG_PAGE_OFFSET definition RISC-V: errata: Use medany for relocatable builds dt-bindings: riscv: document vector crypto requirements dt-bindings: riscv: add vector sub-extension dependencies dt-bindings: riscv: d requires f RISC-V: add f & d extension validation checks RISC-V: add vector crypto extension validation checks ...
2025-04-04Merge tag 'net-6.15-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from netfilter. Current release - regressions: - four fixes for the netdev per-instance locking Current release - new code bugs: - consolidate more code between existing Rx zero-copy and uring so that the latter doesn't miss / have to duplicate the safety checks Previous releases - regressions: - ipv6: fix omitted Netlink attributes when using SKIP_STATS Previous releases - always broken: - net: fix geneve_opt length integer overflow - udp: fix multiple wrap arounds of sk->sk_rmem_alloc when it approaches INT_MAX - dsa: mvpp2: add a lock to avoid corruption of the shared TCAM - dsa: airoha: fix issues with traffic QoS configuration / offload, and flow table offload Misc: - touch up the Netlink YAML specs of old families to make them usable for user space C codegen" * tag 'net-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (56 commits) selftests: net: amt: indicate progress in the stress test netlink: specs: rt_route: pull the ifa- prefix out of the names netlink: specs: rt_addr: pull the ifa- prefix out of the names netlink: specs: rt_addr: fix get multi command name netlink: specs: rt_addr: fix the spec format / schema failures net: avoid false positive warnings in __net_mp_close_rxq() net: move mp dev config validation to __net_mp_open_rxq() net: ibmveth: make veth_pool_store stop hanging arcnet: Add NULL check in com20020pci_probe() ipv6: Do not consider link down nexthops in path selection ipv6: Start path selection from the first nexthop usbnet:fix NPE during rx_complete net: octeontx2: Handle XDP_ABORTED and XDP invalid as XDP_DROP net: fix geneve_opt length integer overflow io_uring/zcrx: fix selftests w/ updated netdev Python helpers selftests: net: use netdevsim in netns test docs: net: document netdev notifier expectations net: dummy: request ops lock netdevsim: add dummy device notifiers net: rename rtnl_net_debug to lock_debug ...
2025-04-04Merge tag 'spi-fix-v6.15-merge-window' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi Pull spi fixes from Mark Brown: "A small collection of fixes that came in during the merge window, everything is driver specific with nothing standing out particularly" * tag 'spi-fix-v6.15-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: spi: bcm2835: Restore native CS probing when pinctrl-bcm2835 is absent spi: bcm2835: Do not call gpiod_put() on invalid descriptor spi: cadence-qspi: revert "Improve spi memory performance" spi: cadence: Fix out-of-bounds array access in cdns_mrvl_xspi_setup_clock() spi: fsl-qspi: use devm function instead of driver remove spi: SPI_QPIC_SNAND should be tristate and depend on MTD spi-rockchip: Fix register out of bounds access
2025-04-04Merge tag 'soc-drivers-6.15-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc Pull more SoC driver updates from Arnd Bergmann: "This is the promised follow-up to the soc drivers branch, adding minor updates to omap and freescale drivers. Most notably, Ioana Ciornei takes over maintenance of the DPAA bus driver used in some NXP (originally Freescale) chips" * tag 'soc-drivers-6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: bus: fsl-mc: Remove deadcode MAINTAINERS: add the linuppc-dev list to the fsl-mc bus entry MAINTAINERS: fix nonexistent dtbinding file name MAINTAINERS: add myself as maintainer for the fsl-mc bus irqdomain: soc: Switch to irq_find_mapping() Input: tsc2007 - accept standard properties
2025-04-04Merge tag 'platform-drivers-x86-v6.15-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86 Pull x86 platform driver fixes from Ilpo Järvinen: - thinkpad_acpi: - Fix NULL pointer dereferences while probing - Disable ACPI fan access for T495* and E560 - ISST: Correct command storage data length * tag 'platform-drivers-x86-v6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: MAINTAINERS: consistently use my dedicated email address platform/x86: ISST: Correct command storage data length platform/x86: thinkpad_acpi: disable ACPI fan access for T495* and E560 platform/x86: thinkpad_acpi: Fix NULL pointer dereferences while probing
2025-04-04selftests: net: amt: indicate progress in the stress testJakub Kicinski
Our CI expects output from the test at least once every 10 minutes. The AMT test when running on debug kernel is just on the edge of that time for the stress test. Improve the output: - print the name of the test first, before starting it, - output a dot every 10% of the way. Output after: TEST: amt discovery [ OK ] TEST: IPv4 amt multicast forwarding [ OK ] TEST: IPv6 amt multicast forwarding [ OK ] TEST: IPv4 amt traffic forwarding torture .......... [ OK ] TEST: IPv6 amt traffic forwarding torture .......... [ OK ] Reviewed-by: Taehee Yoo <ap420073@gmail.com> Link: https://patch.msgid.link/20250403145636.2891166-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-04Merge branch 'netlink-specs-rt_addr-fix-problems-revealed-by-c-codegen'Jakub Kicinski
Jakub Kicinski says: ==================== netlink: specs: rt_addr: fix problems revealed by C codegen I put together basic YNL C support for classic netlink. This revealed a few problems in the rt_addr spec. v1: https://lore.kernel.org/20250401012939.2116915-1-kuba@kernel.org ==================== Link: https://patch.msgid.link/20250403013706.2828322-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-04netlink: specs: rt_route: pull the ifa- prefix out of the namesJakub Kicinski
YAML specs don't normally include the C prefix name in the name of the YAML attr. Remove the ifa- prefix from all attributes in route-attrs and metrics and specify name-prefix instead. This is a bit risky, hopefully there aren't many users out there. Fixes: 023289b4f582 ("doc/netlink: Add spec for rt route messages") Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250403013706.2828322-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-04netlink: specs: rt_addr: pull the ifa- prefix out of the namesJakub Kicinski
YAML specs don't normally include the C prefix name in the name of the YAML attr. Remove the ifa- prefix from all attributes in addr-attrs and specify name-prefix instead. This is a bit risky, hopefully there aren't many users out there. Fixes: dfb0f7d9d979 ("doc/netlink: Add spec for rt addr messages") Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250403013706.2828322-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-04netlink: specs: rt_addr: fix get multi command nameJakub Kicinski
Command names should match C defines, codegens may depend on it. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Fixes: 4f280376e531 ("selftests/net: Add selftest for IPv4 RTM_GETMULTICAST support") Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250403013706.2828322-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-04netlink: specs: rt_addr: fix the spec format / schema failuresJakub Kicinski
The spec is mis-formatted, schema validation says: Failed validating 'type' in schema['properties']['operations']['properties']['list']['items']['properties']['dump']['properties']['request']['properties']['value']: {'minimum': 0, 'type': 'integer'} On instance['operations']['list'][3]['dump']['request']['value']: '58 - ifa-family' The ifa-family clearly wants to be part of an attribute list. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Yuyang Huang <yuyanghuang@google.com> Fixes: 4f280376e531 ("selftests/net: Add selftest for IPv4 RTM_GETMULTICAST support") Link: https://patch.msgid.link/20250403013706.2828322-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-04Merge branch 'net-make-memory-provider-install-close-paths-more-common'Jakub Kicinski
Jakub Kicinski says: ==================== net: make memory provider install / close paths more common We seem to be fixing bugs in config path for devmem which also exist in the io_uring ZC path. Let's try to make the two paths more common, otherwise this is bound to keep happening. Found by code inspection and compile tested only. v1: https://lore.kernel.org/20250331194201.2026422-1-kuba@kernel.org ==================== Link: https://patch.msgid.link/20250403013405.2827250-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-04net: avoid false positive warnings in __net_mp_close_rxq()Jakub Kicinski
Commit under Fixes solved the problem of spurious warnings when we uninstall an MP from a device while its down. The __net_mp_close_rxq() which is used by io_uring was not fixed. Move the fix over and reuse __net_mp_close_rxq() in the devmem path. Acked-by: Stanislav Fomichev <sdf@fomichev.me> Fixes: a70f891e0fa0 ("net: devmem: do not WARN conditionally after netdev_rx_queue_restart()") Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250403013405.2827250-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-04net: move mp dev config validation to __net_mp_open_rxq()Jakub Kicinski
devmem code performs a number of safety checks to avoid having to reimplement all of them in the drivers. Move those to __net_mp_open_rxq() and reuse that function for binding to make sure that io_uring ZC also benefits from them. While at it rename the queue ID variable to rxq_idx in __net_mp_open_rxq(), we touch most of the relevant lines. The XArray insertion is reordered after the netdev_rx_queue_restart() call, otherwise we'd need to duplicate the queue index check or risk inserting an invalid pointer. The XArray allocation failures should be extremely rare. Reviewed-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Fixes: 6e18ed929d3b ("net: add helpers for setting a memory provider on an rx queue") Link: https://patch.msgid.link/20250403013405.2827250-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-04net: ibmveth: make veth_pool_store stop hangingDave Marquardt
v2: - Created a single error handling unlock and exit in veth_pool_store - Greatly expanded commit message with previous explanatory-only text Summary: Use rtnl_mutex to synchronize veth_pool_store with itself, ibmveth_close and ibmveth_open, preventing multiple calls in a row to napi_disable. Background: Two (or more) threads could call veth_pool_store through writing to /sys/devices/vio/30000002/pool*/*. You can do this easily with a little shell script. This causes a hang. I configured LOCKDEP, compiled ibmveth.c with DEBUG, and built a new kernel. I ran this test again and saw: Setting pool0/active to 0 Setting pool1/active to 1 [ 73.911067][ T4365] ibmveth 30000002 eth0: close starting Setting pool1/active to 1 Setting pool1/active to 0 [ 73.911367][ T4366] ibmveth 30000002 eth0: close starting [ 73.916056][ T4365] ibmveth 30000002 eth0: close complete [ 73.916064][ T4365] ibmveth 30000002 eth0: open starting [ 110.808564][ T712] systemd-journald[712]: Sent WATCHDOG=1 notification. [ 230.808495][ T712] systemd-journald[712]: Sent WATCHDOG=1 notification. [ 243.683786][ T123] INFO: task stress.sh:4365 blocked for more than 122 seconds. [ 243.683827][ T123] Not tainted 6.14.0-01103-g2df0c02dab82-dirty #8 [ 243.683833][ T123] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 243.683838][ T123] task:stress.sh state:D stack:28096 pid:4365 tgid:4365 ppid:4364 task_flags:0x400040 flags:0x00042000 [ 243.683852][ T123] Call Trace: [ 243.683857][ T123] [c00000000c38f690] [0000000000000001] 0x1 (unreliable) [ 243.683868][ T123] [c00000000c38f840] [c00000000001f908] __switch_to+0x318/0x4e0 [ 243.683878][ T123] [c00000000c38f8a0] [c000000001549a70] __schedule+0x500/0x12a0 [ 243.683888][ T123] [c00000000c38f9a0] [c00000000154a878] schedule+0x68/0x210 [ 243.683896][ T123] [c00000000c38f9d0] [c00000000154ac80] schedule_preempt_disabled+0x30/0x50 [ 243.683904][ T123] [c00000000c38fa00] [c00000000154dbb0] __mutex_lock+0x730/0x10f0 [ 243.683913][ T123] [c00000000c38fb10] [c000000001154d40] napi_enable+0x30/0x60 [ 243.683921][ T123] [c00000000c38fb40] [c000000000f4ae94] ibmveth_open+0x68/0x5dc [ 243.683928][ T123] [c00000000c38fbe0] [c000000000f4aa20] veth_pool_store+0x220/0x270 [ 243.683936][ T123] [c00000000c38fc70] [c000000000826278] sysfs_kf_write+0x68/0xb0 [ 243.683944][ T123] [c00000000c38fcb0] [c0000000008240b8] kernfs_fop_write_iter+0x198/0x2d0 [ 243.683951][ T123] [c00000000c38fd00] [c00000000071b9ac] vfs_write+0x34c/0x650 [ 243.683958][ T123] [c00000000c38fdc0] [c00000000071bea8] ksys_write+0x88/0x150 [ 243.683966][ T123] [c00000000c38fe10] [c0000000000317f4] system_call_exception+0x124/0x340 [ 243.683973][ T123] [c00000000c38fe50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec ... [ 243.684087][ T123] Showing all locks held in the system: [ 243.684095][ T123] 1 lock held by khungtaskd/123: [ 243.684099][ T123] #0: c00000000278e370 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x50/0x248 [ 243.684114][ T123] 4 locks held by stress.sh/4365: [ 243.684119][ T123] #0: c00000003a4cd3f8 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x88/0x150 [ 243.684132][ T123] #1: c000000041aea888 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x154/0x2d0 [ 243.684143][ T123] #2: c0000000366fb9a8 (kn->active#64){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x160/0x2d0 [ 243.684155][ T123] #3: c000000035ff4cb8 (&dev->lock){+.+.}-{3:3}, at: napi_enable+0x30/0x60 [ 243.684166][ T123] 5 locks held by stress.sh/4366: [ 243.684170][ T123] #0: c00000003a4cd3f8 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x88/0x150 [ 243.684183][ T123] #1: c00000000aee2288 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x154/0x2d0 [ 243.684194][ T123] #2: c0000000366f4ba8 (kn->active#64){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x160/0x2d0 [ 243.684205][ T123] #3: c000000035ff4cb8 (&dev->lock){+.+.}-{3:3}, at: napi_disable+0x30/0x60 [ 243.684216][ T123] #4: c0000003ff9bbf18 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0x138/0x12a0 From the ibmveth debug, two threads are calling veth_pool_store, which calls ibmveth_close and ibmveth_open. Here's the sequence: T4365 T4366 ----------------- ----------------- --------- veth_pool_store veth_pool_store ibmveth_close ibmveth_close napi_disable napi_disable ibmveth_open napi_enable <- HANG ibmveth_close calls napi_disable at the top and ibmveth_open calls napi_enable at the top. https://docs.kernel.org/networking/napi.html]] says The control APIs are not idempotent. Control API calls are safe against concurrent use of datapath APIs but an incorrect sequence of control API calls may result in crashes, deadlocks, or race conditions. For example, calling napi_disable() multiple times in a row will deadlock. In the normal open and close paths, rtnl_mutex is acquired to prevent other callers. This is missing from veth_pool_store. Use rtnl_mutex in veth_pool_store fixes these hangs. Signed-off-by: Dave Marquardt <davemarq@linux.ibm.com> Fixes: 860f242eb534 ("[PATCH] ibmveth change buffer pools dynamically") Reviewed-by: Nick Child <nnac123@linux.ibm.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250402154403.386744-1-davemarq@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-04arcnet: Add NULL check in com20020pci_probe()Henry Martin
devm_kasprintf() returns NULL when memory allocation fails. Currently, com20020pci_probe() does not check for this case, which results in a NULL pointer dereference. Add NULL check after devm_kasprintf() to prevent this issue and ensure no resources are left allocated. Fixes: 6b17a597fc2f ("arcnet: restoring support for multiple Sohard Arcnet cards") Signed-off-by: Henry Martin <bsdhenrymartin@gmail.com> Link: https://patch.msgid.link/20250402135036.44697-1-bsdhenrymartin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-04Merge branch 'ipv6-multipath-routing-fixes'Jakub Kicinski
Ido Schimmel says: ==================== ipv6: Multipath routing fixes This patchset contains two fixes for IPv6 multipath routing. See the commit messages for more details. ==================== Link: https://patch.msgid.link/20250402114224.293392-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-04ipv6: Do not consider link down nexthops in path selectionIdo Schimmel
Nexthops whose link is down are not supposed to be considered during path selection when the "ignore_routes_with_linkdown" sysctl is set. This is done by assigning them a negative region boundary. However, when comparing the computed hash (unsigned) with the region boundary (signed), the negative region boundary is treated as unsigned, resulting in incorrect nexthop selection. Fix by treating the computed hash as signed. Note that the computed hash is always in range of [0, 2^31 - 1]. Fixes: 3d709f69a3e7 ("ipv6: Use hash-threshold instead of modulo-N") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250402114224.293392-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>