summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2025-04-29sched_ext: Factor out scx_alloc_and_add_sched()Tejun Heo
More will be moved into scx_sched. Factor out the allocation and kobject addition path into scx_alloc_and_add_sched(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Inline create_dsq() into scx_bpf_create_dsq()Tejun Heo
create_dsq() is only used by scx_bpf_create_dsq() and the separation gets in the way of making dsq_hash per scx_sched. Inline it into scx_bpf_create_dsq(). While at it, add unlikely() around SCX_DSQ_FLAG_BUILTIN test. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Use dynamic allocation for scx_schedTejun Heo
To prepare for supporting multiple schedulers, make scx_sched allocated dynamically. scx_sched->kobj is now an embedded field and the kobj's lifetime determines the lifetime of the containing scx_sched. - Enable path is updated so that kobj init and addition are performed later. - scx_sched freeing is initiated in scx_kobj_release() and also goes through an rcu_work so that scx_root can be accessed from an unsynchronized path - scx_disable(). - sched_ext_ops->priv is added and used to point to scx_sched instance created for the ops instance. This is used by bpf_scx_unreg() to determine the scx_sched instance to disable and put. No behavior changes intended. v2: Andrea reported kernel oops due to scx_bpf_unreg() trying to deref NULL scx_root after scheduler init failure. sched_ext_ops->priv added so that scx_bpf_unreg() can always find the scx_sched instance to unregister even if it failed early during init. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Avoid NULL scx_root deref through SCX_HAS_OP()Tejun Heo
SCX_HAS_OP() tests scx_root->has_op bitmap. The bitmap is currently in a statically allocated struct scx_sched and initialized while loading the BPF scheduler and cleared while unloading, and thus can be tested anytime. However, scx_root will be switched to dynamic allocation and thus won't always be deferenceable. Most usages of SCX_HAS_OP() are already protected by scx_enabled() either directly or indirectly (e.g. through a task which is on SCX). However, there are a couple places that could try to dereference NULL scx_root. Update them so that scx_root is guaranteed to be valid before SCX_HAS_OP() is called. - In handle_hotplug(), test whether scx_root is NULL before doing anything else. This is safe because scx_root updates will be protected by cpus_read_lock(). - In scx_tg_offline(), test scx_cgroup_enabled before invoking SCX_HAS_OP(), which should guarnatee that scx_root won't turn NULL. This is also in line with other cgroup operations. As the code path is synchronized against scx_cgroup_init/exit() through scx_cgroup_rwsem, this shouldn't cause any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Introduce scx_schedTejun Heo
To support multiple scheduler instances, collect some of the global variables that should be specific to a scheduler instance into the new struct scx_sched. scx_root is the root scheduler instance and points to a static instance of struct scx_sched. Except for an extra dereference through the scx_root pointer, this patch makes no functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29Merge branch 'for-6.15-fixes' into for-6.16Tejun Heo
To receive e38be1c7647c ("sched_ext: Fix rq lock state in hotplug ops") to avoid conflicts with scx_sched related patches pending for for-6.16.
2025-04-29sched_ext: Fix rq lock state in hotplug opsAndrea Righi
The ops.cpu_online() and ops.cpu_offline() callbacks incorrectly assume that the rq involved in the operation is locked, which is not the case during hotplug, triggering the following warning: WARNING: CPU: 1 PID: 20 at kernel/sched/sched.h:1504 handle_hotplug+0x280/0x340 Fix by not tracking the target rq as locked in the context of ops.cpu_online() and ops.cpu_offline(). Fixes: 18853ba782bef ("sched_ext: Track currently locked rq") Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrea Righi <arighi@nvidia.com> Tested-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-29entry: Inline syscall_exit_to_user_mode()Charlie Jenkins
Similar to commit 221a164035fd ("entry: Move syscall_enter_from_user_mode() to header file"), move syscall_exit_to_user_mode() to the header file as well. Testing was done with the byte-unixbench syscall benchmark (which calls getpid) and QEMU. On riscv I measured a 7.09246% improvement, on x86 a 2.98843% improvement, on loongarch a 6.07954% improvement, and on s390 a 11.1328% improvement. The Intel bot also reported "kernel test robot noticed a 1.9% improvement of stress-ng.seek.ops_per_sec". Signed-off-by: Charlie Jenkins <charlie@rivosinc.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/all/20250320-riscv_optimize_entry-v6-4-63e187e26041@rivosinc.com Link: https://lore.kernel.org/linux-riscv/202502051555.85ae6844-lkp@intel.com/
2025-04-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc4Alexei Starovoitov
Cross-merge bpf and other fixes after downstream PRs. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-28PM: EM: Fix potential division-by-zero error in em_compute_costs()Yaxiong Tian
When the device is of a non-CPU type, table[i].performance won't be initialized in the previous em_init_performance(), resulting in division by zero when calculating costs in em_compute_costs(). Since the 'cost' algorithm is only used for EAS energy efficiency calculations and is currently not utilized by other device drivers, we should add the _is_cpu_device(dev) check to prevent this division-by-zero issue. Fixes: 1b600da51073 ("PM: EM: Optimize em_cpu_energy() and remove division") Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://patch.msgid.link/tencent_7F99ED4767C1AF7889D0D8AD50F34859CE06@qq.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-04-28timekeeping: Prevent coarse clocks going backwardsThomas Gleixner
Lei Chen raised an issue with CLOCK_MONOTONIC_COARSE seeing time inconsistencies. Lei tracked down that this was being caused by the adjustment: tk->tkr_mono.xtime_nsec -= offset; which is made to compensate for the unaccumulated cycles in offset when the multiplicator is adjusted forward, so that the non-_COARSE clockids don't see inconsistencies. However, the _COARSE clockid getter functions use the adjusted xtime_nsec value directly and do not compensate the negative offset via the clocksource delta multiplied with the new multiplicator. In that case the caller can observe time going backwards in consecutive calls. By design, this negative adjustment should be fine, because the logic run from timekeeping_adjust() is done after it accumulated approximately multiplicator * interval_cycles into xtime_nsec. The accumulated value is always larger then the mult_adj * offset value, which is subtracted from xtime_nsec. Both operations are done together under the tk_core.lock, so the net change to xtime_nsec is always always be positive. However, do_adjtimex() calls into timekeeping_advance() as well, to apply the NTP frequency adjustment immediately. In this case, timekeeping_advance() does not return early when the offset is smaller then interval_cycles. In that case there is no time accumulated into xtime_nsec. But the subsequent call into timekeeping_adjust(), which modifies the multiplicator, subtracts from xtime_nsec to correct for the new multiplicator. Here because there was no accumulation, xtime_nsec becomes smaller than before, which opens a window up to the next accumulation, where the _COARSE clockid getters, which don't compensate for the offset, can observe the inconsistency. This has been tried to be fixed by forwarding the timekeeper in the case that adjtimex() adjusts the multiplier, which resets the offset to zero: 757b000f7b93 ("timekeeping: Fix possible inconsistencies in _COARSE clockids") That works correctly, but unfortunately causes a regression on the adjtimex() side. There are two issues: 1) The forwarding of the base time moves the update out of the original period and establishes a new one. 2) The clearing of the accumulated NTP error is changing the behaviour as well. User-space expects that multiplier/frequency updates are in effect, when the syscall returns, so delaying the update to the next tick is not solving the problem either. Commit 757b000f7b93 was reverted so that the established expectations of user space implementations (ntpd, chronyd) are restored, but that obviously brought the inconsistencies back. One of the initial approaches to fix this was to establish a separate storage for the coarse time getter nanoseconds part by calculating it from the offset. That was dropped on the floor because not having yet another state to maintain was simpler. But given the result of the above exercise, this solution turns out to be the right one. Bring it back in a slightly modified form. Thus introduce timekeeper::coarse_nsec and store that nanoseconds part in it, switch the time getter functions and the VDSO update to use that value. coarse_nsec is set on operations which forward or initialize the timekeeper and after time was accumulated during a tick. If there is no accumulation the timestamp is unchanged. This leaves the adjtimex() behaviour unmodified and prevents coarse time from going backwards. [ jstultz: Simplified the coarse_nsec calculation and kept behavior so coarse clockids aren't adjusted on each inter-tick adjtimex call, slightly reworked the comments and commit message ] Fixes: da15cfdae033 ("time: Introduce CLOCK_REALTIME_COARSE") Reported-by: Lei Chen <lei.chen@smartx.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/all/20250419054706.2319105-1-jstultz@google.com Closes: https://lore.kernel.org/lkml/20250310030004.3705801-1-lei.chen@smartx.com/
2025-04-26Merge tag 'sched-urgent-2025-04-26' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix sporadic crashes in dequeue_entities() due to ... bad math. [ Arguably if pick_eevdf()/pick_next_entity() was less trusting of complex math being correct it could have de-escalated a crash into a warning, but that's for a different patch ]" * tag 'sched-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/eevdf: Fix se->slice being set to U64_MAX and resulting crash
2025-04-26Merge tag 'perf-urgent-2025-04-26' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc perf events fixes from Ingo Molnar: - Use POLLERR for events in error state, instead of the ambiguous POLLHUP error value - Fix non-sampling (counting) events on certain x86 platforms * tag 'perf-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86: Fix non-sampling (counting) events on certain x86 platforms perf/core: Change to POLLERR for pinned events with error
2025-04-26sched/eevdf: Fix se->slice being set to U64_MAX and resulting crashOmar Sandoval
There is a code path in dequeue_entities() that can set the slice of a sched_entity to U64_MAX, which sometimes results in a crash. The offending case is when dequeue_entities() is called to dequeue a delayed group entity, and then the entity's parent's dequeue is delayed. In that case: 1. In the if (entity_is_task(se)) else block at the beginning of dequeue_entities(), slice is set to cfs_rq_min_slice(group_cfs_rq(se)). If the entity was delayed, then it has no queued tasks, so cfs_rq_min_slice() returns U64_MAX. 2. The first for_each_sched_entity() loop dequeues the entity. 3. If the entity was its parent's only child, then the next iteration tries to dequeue the parent. 4. If the parent's dequeue needs to be delayed, then it breaks from the first for_each_sched_entity() loop _without updating slice_. 5. The second for_each_sched_entity() loop sets the parent's ->slice to the saved slice, which is still U64_MAX. This throws off subsequent calculations with potentially catastrophic results. A manifestation we saw in production was: 6. In update_entity_lag(), se->slice is used to calculate limit, which ends up as a huge negative number. 7. limit is used in se->vlag = clamp(vlag, -limit, limit). Because limit is negative, vlag > limit, so se->vlag is set to the same huge negative number. 8. In place_entity(), se->vlag is scaled, which overflows and results in another huge (positive or negative) number. 9. The adjusted lag is subtracted from se->vruntime, which increases or decreases se->vruntime by a huge number. 10. pick_eevdf() calls entity_eligible()/vruntime_eligible(), which incorrectly returns false because the vruntime is so far from the other vruntimes on the queue, causing the (vruntime - cfs_rq->min_vruntime) * load calulation to overflow. 11. Nothing appears to be eligible, so pick_eevdf() returns NULL. 12. pick_next_entity() tries to dereference the return value of pick_eevdf() and crashes. Dumping the cfs_rq states from the core dumps with drgn showed tell-tale huge vruntime ranges and bogus vlag values, and I also traced se->slice being set to U64_MAX on live systems (which was usually "benign" since the rest of the runqueue needed to be in a particular state to crash). Fix it in dequeue_entities() by always setting slice from the first non-empty cfs_rq. Fixes: aef6987d8954 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy") Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/f0c2d1072be229e1bdddc73c0703919a8b00c652.1745570998.git.osandov@fb.com
2025-04-26pidfs: get rid of __pidfd_prepare()Christian Brauner
Fold it into pidfd_prepare() and rename PIDFD_CLONE to PIDFD_STALE to indicate that the passed pid might not have task linkage and no explicit check for that should be performed. Link: https://lore.kernel.org/20250425-work-pidfs-net-v2-3-450a19461e75@kernel.org Reviewed-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: David Rheinsberg <david@readahead.eu> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-25Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds
Pull bpf fixes from Alexei Starovoitov: - Add namespace to BPF internal symbols (Alexei Starovoitov) - Fix possible endless loop in BPF map iteration (Brandon Kammerdiener) - Fix compilation failure for samples/bpf on LoongArch (Haoran Jiang) - Disable a part of sockmap_ktls test (Ihor Solodrai) - Correct typo in __clang_major__ macro (Peilin Ye) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Correct typo in __clang_major__ macro samples/bpf: Fix compilation failure for samples/bpf on LoongArch Fedora bpf: Add namespace to BPF internal symbols selftests/bpf: add test for softlock when modifying hashmap while iterating bpf: fix possible endless loop in BPF map iteration selftests/bpf: Mitigate sockmap_ktls disconnect_after_delete failure
2025-04-25cgroup/rstat: Improve cgroup_rstat_push_children() documentationWaiman Long
The cgroup_rstat_push_children() function converts a set of updated_children lists from different cgroups into a single ordered list of cgroups to be flushed via the rstat_flush_next pointer. The algorithm used isn't that well illustrated and it takes time to grasp what it is doing. Improve the embedded documentation and variable names to better illustrate the transformation process and make the code easier to understand. Also cgroup_rstat_lock must be held for the whole duration from where the rstat_flush_next list is being constructed in cgroup_rstat_push_children() to when it is consumed later in css_rstat_flush(). Otherwise, list corruption can happen leading to system crash as reported in [1]. In this particular case, the branch being used has commit 093c8812de2d ("cgroup: rstat: Cleanup flushing functions and locking") which breaks this rule, but is missing the fix commit 7d6c63c31914 ("cgroup: rstat: call cgroup_rstat_updated_list with cgroup_rstat_lock") that fixes it. This patch has no functional change. [1] https://lore.kernel.org/lkml/BY5PR04MB68495E9E8A46CA9614D62669BCBB2@BY5PR04MB6849.namprd04.prod.outlook.com/ Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-25PM: sleep: Remove unnecessary !!Zihuan Zhang
Since initcall_debug is a bool variable, it is not necessary to convert it to bool with the help of a double logical negation (!!). Remove the redundant operation. Signed-off-by: Zihuan Zhang <zhangzihuan@kylinos.cn> Link: https://patch.msgid.link/20250424060339.73119-1-zhangzihuan@kylinos.cn [ rjw: Changelog rewrite ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-04-25sched_ext: Remove duplicate BTF_ID_FLAGS definitionsAndrea Righi
Some kfuncs specific to the idle CPU selection policy are registered in both the scx_kfunc_ids_any and scx_kfunc_ids_idle blocks, even though they should only be defined in the latter. Remove the duplicates from scx_kfunc_ids_any. Fixes: 337d1b354a297 ("sched_ext: Move built-in idle CPU selection policy to a separate file") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-25Merge tag 'dma-mapping-6.15-2025-04-25' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-maping fixes from Marek Szyprowski: - avoid unused variable warnings (Arnd Bergmann, Marek Szyprowski) - add runtume warnings and debug messages for devices with limited DMA capabilities (Balbir Singh, Chen-Yu Tsai) * tag 'dma-mapping-6.15-2025-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma-coherent: Warn if OF reserved memory is beyond current coherent DMA mask dma-mapping: Fix warning reported for missing prototype dma-mapping: avoid potential unused data compilation warning dma/mapping.c: dev_dbg support for dma_addressing_limited dma/contiguous: avoid warning about unused size_bytes
2025-04-25bpf: Add namespace to BPF internal symbolsAlexei Starovoitov
Add namespace to BPF internal symbols used by light skeleton to prevent abuse and document with the code their allowed usage. Fixes: b1d18a7574d0 ("bpf: Extend sys_bpf commands for bpf_syscall programs.") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/bpf/20250425014542.62385-1-alexei.starovoitov@gmail.com
2025-04-25bpf: fix possible endless loop in BPF map iterationBrandon Kammerdiener
The _safe variant used here gets the next element before running the callback, avoiding the endless loop condition. Signed-off-by: Brandon Kammerdiener <brandon.kammerdiener@intel.com> Link: https://lore.kernel.org/r/20250424153246.141677-2-brandon.kammerdiener@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Hou Tao <houtao1@huawei.com>
2025-04-25perf/core: Fix broken throttling when max_samples_per_tick=1Qing Wang
According to the throttling mechanism, the pmu interrupts number can not exceed the max_samples_per_tick in one tick. But this mechanism is ineffective when max_samples_per_tick=1, because the throttling check is skipped during the first interrupt and only performed when the second interrupt arrives. Perhaps this bug may cause little influence in one tick, but if in a larger time scale, the problem can not be underestimated. When max_samples_per_tick = 1: Allowed-interrupts-per-second max-samples-per-second default-HZ ARCH 200 100 100 X86 500 250 250 ARM64 ... Obviously, the pmu interrupt number far exceed the user's expect. Fixes: e050e3f0a71b ("perf: Fix broken interrupt rate throttling") Signed-off-by: Qing Wang <wangqing7171@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250405141635.243786-3-wangqing7171@gmail.com
2025-04-25Merge branch 'perf/urgent'Peter Zijlstra
Merge urgent fixes for dependencies. Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2025-04-24cgroup: fix goto ordering in cgroup_init()JP Kobryn
Go to the appropriate section labels when css_rstat_init() or psi_cgroup_alloc() fails. Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Fixes: a97915559f5c ("cgroup: change rstat function signatures from cgroup-based to css-based") Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.15-rc4). This pull includes wireless and a fix to vxlan which isn't in Linus's tree just yet. The latter creates with a silent conflict / build breakage, so merging it now to avoid causing problems. drivers/net/vxlan/vxlan_vnifilter.c 094adad91310 ("vxlan: Use a single lock to protect the FDB table") 087a9eb9e597 ("vxlan: vnifilter: Fix unlocked deletion of default FDB entry") https://lore.kernel.org/20250423145131.513029-1-idosch@nvidia.com No "normal" conflicts, or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-24Merge branch 'kvm-fixes-6.15-rc4' into HEADPaolo Bonzini
* Single fix for broken usage of 'multi-MIDR' infrastructure in PI code, adding an open-coded erratum check for Cavium ThunderX * Bugfixes from a planned posted interrupt rework * Do not use kvm_rip_read() unconditionally to cater for guests with inaccessible register state.
2025-04-24timers: Remove unused __round_jiffies(_up)Dr. David Alan Gilbert
Remove two trivial but long unused functions. __round_jiffies() has been unused since 2008's commit 9c133c469d38 ("Add round_jiffies_up and related routines") __round_jiffies_up() has been unused since 2019's commit 7ae3f6e130e8 ("powerpc/watchdog: Use hrtimers for per-CPU heartbeat") Remove them. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250418200803.427911-1-linux@treblig.org
2025-04-23sched_ext: Clarify CPU context for running/stopping callbacksAndrea Righi
The ops.running() and ops.stopping() callbacks can be invoked from a CPU other than the one the task is assigned to, particularly when a task property is changed, as both scx_next_task_scx() and dequeue_task_scx() may run on CPUs different from the task's target CPU. This behavior can lead to confusion or incorrect assumptions if not properly clarified, potentially resulting in bugs (see [1]). Therefore, update the documentation to clarify this aspect and advise users to use scx_bpf_task_cpu() to determine the actual CPU the task will run on or was running on. [1] https://github.com/sched-ext/scx/pull/1728 Cc: Jake Hillion <jake@hillion.co.uk> Signed-off-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-23bpf: Allow access to const void pointer arguments in tracing programsKaFai Wan
Adding support to access arguments with const void pointer arguments in tracing programs. Currently we allow tracing programs to access void pointers. If we try to access argument which is pointer to const void like 2nd argument in kfree, verifier will fail to load the program with; 0: R1=ctx() R10=fp0 ; asm volatile ("r2 = *(u64 *)(r1 + 8); "); 0: (79) r2 = *(u64 *)(r1 +8) func 'kfree' arg1 type UNKNOWN is not a struct Changing the is_int_ptr to void and generic integer check and renaming it to is_void_or_int_ptr. Signed-off-by: KaFai Wan <mannkafai@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20250423121329.3163461-2-mannkafai@gmail.com
2025-04-23bpf: Streamline allowed helpers between tracing and base setsFeng Yang
Many conditional checks in switch-case are redundant with bpf_base_func_proto and should be removed. Regarding the permission checks bpf_base_func_proto: The permission checks in bpf_prog_load (as outlined below) ensure that the trace has both CAP_BPF and CAP_PERFMON capabilities, thus enabling the use of corresponding prototypes in bpf_base_func_proto without adverse effects. bpf_prog_load ...... bpf_cap = bpf_token_capable(token, CAP_BPF); ...... if (type != BPF_PROG_TYPE_SOCKET_FILTER && type != BPF_PROG_TYPE_CGROUP_SKB && !bpf_cap) goto put_token; ...... if (is_perfmon_prog_type(type) && !bpf_token_capable(token, CAP_PERFMON)) goto put_token; ...... Signed-off-by: Feng Yang <yangfeng@kylinos.cn> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20250423073151.297103-1-yangfeng59949@163.com
2025-04-23bpf: Use proper type to calculate bpf_raw_tp_null_args.mask indexShung-Hsi Yu
The calculation of the index used to access the mask field in 'struct bpf_raw_tp_null_args' is done with 'int' type, which could overflow when the tracepoint being attached has more than 8 arguments. While none of the tracepoints mentioned in raw_tp_null_args[] currently have more than 8 arguments, there do exist tracepoints that had more than 8 arguments (e.g. iocost_iocg_forgive_debt), so use the correct type for calculation and avoid Smatch static checker warning. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/bpf/20250418074946.35569-1-shung-hsi.yu@suse.com Closes: https://lore.kernel.org/r/843a3b94-d53d-42db-93d4-be10a4090146@stanley.mountain/
2025-04-23workqueue: Fix race condition in wq->stats incrementationJiayuan Chen
Fixed a race condition in incrementing wq->stats[PWQ_STAT_COMPLETED] by moving the operation under pool->lock. Reported-by: syzbot+01affb1491750534256d@syzkaller.appspotmail.com Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-23Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio fixes from Michael Tsirkin: "A small number of fixes: - virtgpu is exempt from reset shutdown fow now - a more complete fix is in the works - spec compliance fixes in: - virtio-pci cap commands - vhost_scsi_send_bad_target - virtio console resize - missing locking fix in vhost-scsi - virtio ring - a KCSAN false positive fix - VHOST_*_OWNER documentation fix" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: vhost-scsi: Fix vhost_scsi_send_status() vhost-scsi: Fix vhost_scsi_send_bad_target() vhost-scsi: protect vq->log_used with vq->mutex vhost_task: fix vhost_task_create() documentation virtio_console: fix order of fields cols and rows virtio_console: fix missing byte order handling for cols and rows virtgpu: don't reset on shutdown virtio_ring: Fix data race by tagging event_triggered as racy for KCSAN vhost: fix VHOST_*_OWNER documentation virtio_pci: Use self group type for cap commands
2025-04-23perf/core: Change to POLLERR for pinned events with errorNamhyung Kim
Commit: f4b07fd62d4d11d5 ("perf/core: Use POLLHUP for pinned events in error") started to emit POLLHUP for pinned events in an error state. But the POLLHUP is also used to signal events that the attached task is terminated. To distinguish pinned per-task events in the error state it would need to check if the task is live. Change it to POLLERR to make it clear. Suggested-by: Gabriel Marin <gmx@google.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250422223318.180343-1-namhyung@kernel.org
2025-04-22Merge branch 'for-6.15-fixes' into for-6.16Tejun Heo
a11d6784d731 ("sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()") added a call to scx_ops_error() which was renamed to scx_error() in for-6.16. Fix it up.
2025-04-22sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()Andrea Righi
scx_bpf_cpuperf_set() can be used to set a performance target level on any CPU. However, it doesn't correctly acquire the corresponding rq lock, which may lead to unsafe behavior and trigger the following warning, due to the lockdep_assert_rq_held() check: [ 51.713737] WARNING: CPU: 3 PID: 3899 at kernel/sched/sched.h:1512 scx_bpf_cpuperf_set+0x1a0/0x1e0 ... [ 51.713836] Call trace: [ 51.713837] scx_bpf_cpuperf_set+0x1a0/0x1e0 (P) [ 51.713839] bpf_prog_62d35beb9301601f_bpfland_init+0x168/0x440 [ 51.713841] bpf__sched_ext_ops_init+0x54/0x8c [ 51.713843] scx_ops_enable.constprop.0+0x2c0/0x10f0 [ 51.713845] bpf_scx_reg+0x18/0x30 [ 51.713847] bpf_struct_ops_link_create+0x154/0x1b0 [ 51.713849] __sys_bpf+0x1934/0x22a0 Fix by properly acquiring the rq lock when possible or raising an error if we try to operate on a CPU that is not the one currently locked. Fixes: d86adb4fc0655 ("sched_ext: Add cpuperf support") Signed-off-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-22sched_ext: Track currently locked rqAndrea Righi
Some kfuncs provided by sched_ext may need to operate on a struct rq, but they can be invoked from various contexts, specifically, different scx callbacks. While some of these callbacks are invoked with a particular rq already locked, others are not. This makes it impossible for a kfunc to reliably determine whether it's safe to access a given rq, triggering potential bugs or unsafe behaviors, see for example [1]. To address this, track the currently locked rq whenever a sched_ext callback is invoked via SCX_CALL_OP*(). This allows kfuncs that need to operate on an arbitrary rq to retrieve the currently locked one and apply the appropriate action as needed. [1] https://lore.kernel.org/lkml/20250325140021.73570-1-arighi@nvidia.com/ Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-22dma-coherent: Warn if OF reserved memory is beyond current coherent DMA maskChen-Yu Tsai
When a reserved memory region described in the device tree is attached to a device, it is expected that the device's limitations are correctly included in that description. However, if the device driver failed to implement DMA address masking or addressing beyond the default 32 bits (on arm64), then bad things could happen because the DMA address was truncated, such as playing back audio with no actual audio coming out, or DMA overwriting random blocks of kernel memory. Check against the coherent DMA mask when the memory regions are attached to the device. Give a warning when the memory region can not be covered by the mask. A warning instead of a hard error was chosen, because it is possible that existing drivers could be working fine even if they forgot to extend the coherent DMA mask. Signed-off-by: Chen-Yu Tsai <wenst@chromium.org> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20250421083930.374173-1-wenst@chromium.org
2025-04-22dma-mapping: Fix warning reported for missing prototypeBalbir Singh
lkp reported a warning about missing prototype for a recent patch. The kernel-doc style comments are out of sync, move them to the right function. Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Christoph Hellwig <hch@lst.de> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202504190615.g9fANxHw-lkp@intel.com/ Signed-off-by: Balbir Singh <balbirs@nvidia.com> [mszyprow: reformatted subject] Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20250422114034.3535515-1-balbirs@nvidia.com
2025-04-22PM: sleep: Use two lines for "Restarting..." / "done" messagesAndrew Sayers
Other messages are occasionally printed between these two, for example: [203104.106534] Restarting tasks ... [203104.106559] mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_ops [i915]) [203104.112354] done. This seems to be a timing issue, seen in two of the eleven hibernation exits in my current `dmesg` output. When printed on its own, the "done" message has the default log level. This makes the output of `dmesg --level=warn` quite misleading. Add enough context for the "done" messages to make sense on their own, and use the same log level for all messages. Change the messages to "<event>..." / "Done <event>.", unlike a449dfbfc089 which uses "<event>..." / "<event> completed.". Front-loading the unique part of the message makes it easier to scan the log, and reduces ambiguity for users who aren't confident in their English comprehension. Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Andrew Sayers <kernel.org@pileofstuff.org> Link: https://patch.msgid.link/20250411152632.2806038-1-kernel.org@pileofstuff.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-04-21Merge tag 'sched_ext-for-6.15-rc3-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Use kvzalloc() so that large exit_dump buffer allocations don't fail easily - Remove cpu.weight / cpu.idle unimplemented warnings which are more annoying than helpful. This makes SCX_OPS_HAS_CGROUP_WEIGHT unnecessary. Mark it for deprecation * tag 'sched_ext-for-6.15-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Mark SCX_OPS_HAS_CGROUP_WEIGHT for deprecation sched_ext: Remove cpu.weight / cpu.idle unimplemented warnings sched_ext: Use kvzalloc for large exit_dump allocation
2025-04-21Merge tag 'cgroup-for-6.15-rc3-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix compilation in CONFIG_LOCKDEP && !CONFIG_PROVE_RCU configurations - Allow "cpuset_v2_mode" mount option for "cpuset" filesystem type to make life easier for android * tag 'cgroup-for-6.15-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset-v1: Add missing support for cpuset_v2_mode cgroup: Fix compilation issue due to cgroup_mutex not being exported
2025-04-21Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Martin KaFai Lau says: ==================== pull-request: bpf-next 2025-04-17 We've added 12 non-merge commits during the last 9 day(s) which contain a total of 18 files changed, 1748 insertions(+), 19 deletions(-). The main changes are: 1) bpf qdisc support, from Amery Hung. A qdisc can be implemented in bpf struct_ops programs and can be used the same as other existing qdiscs in the "tc qdisc" command. 2) Add xsk tail adjustment tests, from Tushar Vyavahare. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: selftests/bpf: Test attaching bpf qdisc to mq and non root selftests/bpf: Add a bpf fq qdisc to selftest selftests/bpf: Add a basic fifo qdisc test libbpf: Support creating and destroying qdisc bpf: net_sched: Disable attaching bpf qdisc to non root bpf: net_sched: Support updating bstats bpf: net_sched: Add a qdisc watchdog timer bpf: net_sched: Add basic bpf qdisc kfuncs bpf: net_sched: Support implementation of Qdisc_ops in bpf bpf: Prepare to reuse get_ctx_arg_idx selftests/xsk: Add tail adjustment tests and support check selftests/xsk: Add packet stream replacement function ==================== Link: https://patch.msgid.link/20250417184338.3152168-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-21cgroup: fix pointer check in css_rstat_init()JP Kobryn
In css_rstat_init() allocations are done for the cgroup's pointers rstat_cpu and rstat_base_cpu. Make sure the allocation checks are consistent with what they are allocating. Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc3Alexei Starovoitov
Cross-merge bpf and other fixes after downstream PRs. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-19Merge tag 'vfs-6.15-rc3.fixes.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Revert the hfs{plus} deprecation warning that's also included in this pull request. The commit introducing the deprecation warning resides rather early in this branch. So simply dropping it would've rebased all other commits which I decided to avoid. Hence the revert in the same branch [ Background - the deprecation warning discussion resulted in people stepping up, and so hfs{plus} will have a maintainer taking care of it after all.. - Linus ] - Switch CONFIG_SYSFS_SYCALL default to n and decouple from CONFIG_EXPERT - Fix an audit bug caused by changes to our kernel path lookup helpers this cycle. Audit needs the parent path even if the dentry it tried to look up is negative - Ensure that the kernel path lookup helpers leave the passed in path argument clean when they return an error. This is consistent with all our other helpers - Ensure that vfs_getattr_nosec() calls bdev_statx() so the relevant information is available to kernel consumers as well - Don't set a timer and call schedule() if the timer will expire immediately in epoll - Make netfs lookup tables with __nonstring * tag 'vfs-6.15-rc3.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: Revert "hfs{plus}: add deprecation warning" fs: move the bdex_statx call to vfs_getattr_nosec netfs: Mark __nonstring lookup tables eventpoll: Set epoll timeout if it's in the future fs: ensure that *path_locked*() helpers leave passed path pristine fs: add kern_path_locked_negative() hfs{plus}: add deprecation warning Kconfig: switch CONFIG_SYSFS_SYCALL default to n
2025-04-19Merge tag 'trace-v6.15-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Initialize hash variables in ftrace subops logic The fix that simplified the ftrace subops logic opened a path where some variables could be used without being initialized, and done subtly where the compiler did not catch it. Initialize those variables to the EMPTY_HASH, which is the default hash. - Reinitialize the hash pointers after they are freed Some of the hash pointers in the subop logic were freed but may still be referenced later. To prevent use-after-free bugs, initialize them back to the EMPTY_HASH. - Free the ftrace hashes when they are replaced The fix that simplified the subops logic updated some hash pointers, but left the original hash that they were pointing to where they are no longer used. This caused a memory leak. Free the hashes that are pointed to by the pointers when they are replaced. - Fix size initialization of ftrace direct function hash The ftrace direct function hash used by BPF initialized the hash size incorrectly. It checked the size of items to a hard coded 32, which made the hash bit size of 5. The hash size is supposed to be limited by the bit size of the hash, as the bitmask is allowed to be greater than 5. Rework the size check to first pass the number of elements to fls() and then compare that to FTRACE_HASH_MAX_BITS before allocating the hash. - Fix format output of ftrace_graph_ent_entry event The field depth of the ftrace_graph_ent_entry event is of size 4 but the output showed it as unsigned long and use "%lu". Change it to unsigned int and use "%u" in the print format that is displayed to user space. - Fix the trace event filter on strings Events can be filtered on numbers or string values. The return value checked from strncpy_from_kernel_nofault() and strncpy_from_user_nofault() was used to determine if reading the strings would fault or not. It would return fault if the value was non zero, which is basically meant that it was always considering the read as a fault. - Add selftest to test trace event string filtering In order to catch the breakage of the string filtering, add a self test to make sure that it continues to work. * tag 'trace-v6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: selftests: Add testing a user string to filters tracing: Fix filter string testing ftrace: Fix type of ftrace_graph_ent_entry.depth ftrace: fix incorrect hash size in register_ftrace_direct() ftrace: Free ftrace hashes after they are replaced in the subops code ftrace: Reinitialize hash to EMPTY_HASH after freeing ftrace: Initialize variables for ftrace_startup/shutdown_subops()
2025-04-18sched_ext: add helper for refill task with default sliceHonglei Wang
Add helper for refilling task with default slice and event statistics accordingly. Signed-off-by: Honglei Wang <jameshongleiwang@126.com> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-18sched_ext: change the variable name for slice refill eventHonglei Wang
SCX_EV_ENQ_SLICE_DFL gives the impression that the event only occurs when the tasks were enqueued, which seems not accurate. What it actually means is the refilling with defalt slice, and this can occur either when enqueue or pick_task. Let's change the variable to SCX_EV_REFILL_SLICE_DFL. Signed-off-by: Honglei Wang <jameshongleiwang@126.com> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>