| Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- "mm, swap: improve cluster scan strategy" from Kairui Song improves
performance and reduces the failure rate of swap cluster allocation
- "support large align and nid in Rust allocators" from Vitaly Wool
permits Rust allocators to set NUMA node and large alignment when
perforning slub and vmalloc reallocs
- "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend
DAMOS_STAT's handling of the DAMON operations sets for virtual
address spaces for ops-level DAMOS filters
- "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren
Baghdasaryan reduces mmap_lock contention during reads of
/proc/pid/maps
- "mm/mincore: minor clean up for swap cache checking" from Kairui Song
performs some cleanup in the swap code
- "mm: vm_normal_page*() improvements" from David Hildenbrand provides
code cleanup in the pagemap code
- "add persistent huge zero folio support" from Pankaj Raghav provides
a block layer speedup by optionalls making the
huge_zero_pagepersistent, instead of releasing it when its refcount
falls to zero
- "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to
the recently added Kexec Handover feature
- "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo
Stoakes turns mm_struct.flags into a bitmap. To end the constant
struggle with space shortage on 32-bit conflicting with 64-bit's
needs
- "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap
code
- "selftests/mm: Fix false positives and skip unsupported tests" from
Donet Tom fixes a few things in our selftests code
- "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised"
from David Hildenbrand "allows individual processes to opt-out of
THP=always into THP=madvise, without affecting other workloads on the
system".
It's a long story - the [1/N] changelog spells out the considerations
- "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on
the memdesc project. Please see
https://kernelnewbies.org/MatthewWilcox/Memdescs and
https://blogs.oracle.com/linux/post/introducing-memdesc
- "Tiny optimization for large read operations" from Chi Zhiling
improves the efficiency of the pagecache read path
- "Better split_huge_page_test result check" from Zi Yan improves our
folio splitting selftest code
- "test that rmap behaves as expected" from Wei Yang adds some rmap
selftests
- "remove write_cache_pages()" from Christoph Hellwig removes that
function and converts its two remaining callers
- "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD
selftests issues
- "introduce kernel file mapped folios" from Boris Burkov introduces
the concept of "kernel file pages". Using these permits btrfs to
account its metadata pages to the root cgroup, rather than to the
cgroups of random inappropriate tasks
- "mm/pageblock: improve readability of some pageblock handling" from
Wei Yang provides some readability improvements to the page allocator
code
- "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON
to understand arm32 highmem
- "tools: testing: Use existing atomic.h for vma/maple tests" from
Brendan Jackman performs some code cleanups and deduplication under
tools/testing/
- "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes
a couple of 32-bit issues in tools/testing/radix-tree.c
- "kasan: unify kasan_enabled() and remove arch-specific
implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific
initialization code into a common arch-neutral implementation
- "mm: remove zpool" from Johannes Weiner removes zspool - an
indirection layer which now only redirects to a single thing
(zsmalloc)
- "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a
couple of cleanups in the fork code
- "mm: remove nth_page()" from David Hildenbrand makes rather a lot of
adjustments at various nth_page() callsites, eventually permitting
the removal of that undesirable helper function
- "introduce kasan.write_only option in hw-tags" from Yeoreum Yun
creates a KASAN read-only mode for ARM, using that architecture's
memory tagging feature. It is felt that a read-only mode KASAN is
suitable for use in production systems rather than debug-only
- "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does
some tidying in the hugetlb folio allocation code
- "mm: establish const-correctness for pointer parameters" from Max
Kellermann makes quite a number of the MM API functions more accurate
about the constness of their arguments. This was getting in the way
of subsystems (in this case CEPH) when they attempt to improving
their own const/non-const accuracy
- "Cleanup free_pages() misuse" from Vishal Moola fixes a number of
code sites which were confused over when to use free_pages() vs
__free_pages()
- "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the
mapletree code accessible to Rust. Required by nouveau and by its
forthcoming successor: the new Rust Nova driver
- "selftests/mm: split_huge_page_test: split_pte_mapped_thp
improvements" from David Hildenbrand adds a fix and some cleanups to
the thp selftesting code
- "mm, swap: introduce swap table as swap cache (phase I)" from Chris
Li and Kairui Song is the first step along the path to implementing
"swap tables" - a new approach to swap allocation and state tracking
which is expected to yield speed and space improvements. This
patchset itself yields a 5-20% performance benefit in some situations
- "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc
layer to clean up the ptdesc code a little
- "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some
issues in our 5-level pagetable selftesting code
- "Minor fixes for memory allocation profiling" from Suren Baghdasaryan
addresses a couple of minor issues in relatively new memory
allocation profiling feature
- "Small cleanups" from Matthew Wilcox has a few cleanups in
preparation for more memdesc work
- "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from
Quanmin Yan makes some changes to DAMON in furtherance of supporting
arm highmem
- "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad
Anjum adds that compiler check to selftests code and fixes the
fallout, by removing dead code
- "Improvements to Victim Process Thawing and OOM Reaper Traversal
Order" from zhongjinji makes a number of improvements in the OOM
killer: mainly thawing a more appropriate group of victim threads so
they can release resources
- "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park
is a bunch of small and unrelated fixups for DAMON
- "mm/damon: define and use DAMON initialization check function" from
SeongJae Park implement reliability and maintainability improvements
to a recently-added bug fix
- "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from
SeongJae Park provides additional transparency to userspace clients
of the DAMON_STAT information
- "Expand scope of khugepaged anonymous collapse" from Dev Jain removes
some constraints on khubepaged's collapsing of anon VMAs. It also
increases the success rate of MADV_COLLAPSE against an anon vma
- "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()"
from Lorenzo Stoakes moves us further towards removal of
file_operations.mmap(). This patchset concentrates upon clearing up
the treatment of stacked filesystems
- "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau
provides some fixes and improvements to mlock's tracking of large
folios. /proc/meminfo's "Mlocked" field became more accurate
- "mm/ksm: Fix incorrect accounting of KSM counters during fork" from
Donet Tom fixes several user-visible KSM stats inaccuracies across
forks and adds selftest code to verify these counters
- "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses
some potential but presently benign issues in KSM's mm_slot handling
* tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits)
mm: swap: check for stable address space before operating on the VMA
mm: convert folio_page() back to a macro
mm/khugepaged: use start_addr/addr for improved readability
hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list
alloc_tag: fix boot failure due to NULL pointer dereference
mm: silence data-race in update_hiwater_rss
mm/memory-failure: don't select MEMORY_ISOLATION
mm/khugepaged: remove definition of struct khugepaged_mm_slot
mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL
hugetlb: increase number of reserving hugepages via cmdline
selftests/mm: add fork inheritance test for ksm_merging_pages counter
mm/ksm: fix incorrect KSM counter handling in mm_struct during fork
drivers/base/node: fix double free in register_one_node()
mm: remove PMD alignment constraint in execmem_vmalloc()
mm/memory_hotplug: fix typo 'esecially' -> 'especially'
mm/rmap: improve mlock tracking for large folios
mm/filemap: map entire large folio faultaround
mm/fault: try to map the entire file folio in finish_fault()
mm/rmap: mlock large folios in try_to_unmap_one()
mm/rmap: fix a mlock race condition in folio_referenced_one()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer core updates from Thomas Gleixner:
- Address the inconsistent shutdown sequence of per CPU clockevents on
CPU hotplug, which only removed it from the core but failed to invoke
the actual device driver shutdown callback. This kept the timer
active, which prevented power savings and caused pointless noise in
virtualization.
- Encapsulate the open coded access to the hrtimer clock base, which is
a private implementation detail, so that the implementation can be
changed without breaking a lot of usage sites.
- Enhance the debug output of the clocksource watchdog to provide
better information for analysis.
- The usual set of cleanups and enhancements all over the place
* tag 'timers-core-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time: Fix spelling mistakes in comments
clocksource: Print durations for sync check unconditionally
LoongArch: Remove clockevents shutdown call on offlining
tick: Do not set device to detached state in tick_shutdown()
hrtimer: Reorder branches in hrtimer_clockid_to_base()
hrtimer: Remove hrtimer_clock_base:: Get_time
hrtimer: Use hrtimer_cb_get_time() helper
media: pwm-ir-tx: Avoid direct access to hrtimer clockbase
ALSA: hrtimer: Avoid direct access to hrtimer clockbase
lib: test_objpool: Avoid direct access to hrtimer clockbase
sched/core: Avoid direct access to hrtimer clockbase
timers/itimer: Avoid direct access to hrtimer clockbase
posix-timers: Avoid direct access to hrtimer clockbase
jiffies: Remove obsolete SHIFTED_HZ comment
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Core scheduler changes:
- Make migrate_{en,dis}able() inline, to improve performance
(Menglong Dong)
- Move STDL_INIT() functions out-of-line (Peter Zijlstra)
- Unify the SCHED_{SMT,CLUSTER,MC} Kconfig (Peter Zijlstra)
Fair scheduling:
- Defer throttling to when tasks exit to user-space, to reduce the
chance & impact of throttle-preemption with held locks and other
resources (Aaron Lu, Valentin Schneider)
- Get rid of sched_domains_curr_level hack for tl->cpumask(), as the
warning was getting triggered on certain topologies (Peter
Zijlstra)
Misc cleanups & fixes:
- Header cleanups (Menglong Dong)
- Fix race in push_dl_task() (Harshit Agarwal)"
* tag 'sched-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix some typos in include/linux/preempt.h
sched: Make migrate_{en,dis}able() inline
rcu: Replace preempt.h with sched.h in include/linux/rcupdate.h
arch: Add the macro COMPILE_OFFSETS to all the asm-offsets.c
sched/fair: Do not balance task to a throttled cfs_rq
sched/fair: Do not special case tasks in throttled hierarchy
sched/fair: update_cfs_group() for throttled cfs_rqs
sched/fair: Propagate load for throttled cfs_rq
sched/fair: Get rid of throttled_lb_pair()
sched/fair: Task based throttle time accounting
sched/fair: Switch to task based throttle model
sched/fair: Implement throttle task work and related helpers
sched/fair: Add related data structure for task based throttle
sched: Unify the SCHED_{SMT,CLUSTER,MC} Kconfig
sched: Move STDL_INIT() functions out-of-line
sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask()
sched/deadline: Fix race in push_dl_task()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext updates from Tejun Heo:
- Code organization cleanup. Separate internal types and accessors to
ext_internal.h to reduce the size of ext.c and improve
maintainability.
- Prepare for cgroup sub-scheduler support by adding @sch parameter to
various functions and helpers, reorganizing scheduler instance
handling, and dropping obsolete helpers like scx_kf_exit() and
kf_cpu_valid().
- Add new scx_bpf_cpu_curr() and scx_bpf_locked_rq() BPF helpers to
provide safer access patterns with proper RCU protection.
scx_bpf_cpu_rq() is deprecated with warnings due to potential race
conditions.
- Improve debugging with migration-disabled counter in error state
dumps, SCX_EFLAG_INITIALIZED flag, bitfields for warning flags, and
other enhancements to help diagnose issues.
- Use cgroup_lock/unlock() for cgroup synchronization instead of
scx_cgroup_rwsem based synchronization. This is simpler and allows
enable/disable paths to synchronize against cgroup changes
independent of the CPU controller.
- rhashtable_lookup() replacement to avoid redundant RCU locking was
reverted due to RCU usage warnings. Will be redone once rhashtable is
updated to use rcu_dereference_all().
- Other misc updates and fixes including bypass handling improvements,
scx_task_iter_relock() improvements, tools/sched_ext updates, and
compatibility helpers.
* tag 'sched_ext-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (28 commits)
Revert "sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()"
sched_ext: Misc updates around scx_sched instance pointer
sched_ext: Drop scx_kf_exit() and scx_kf_error()
sched_ext: Add the @sch parameter to scx_dsq_insert_preamble/commit()
sched_ext: Drop kf_cpu_valid()
sched_ext: Add the @sch parameter to ext_idle helpers
sched_ext: Add the @sch parameter to __bstr_format()
sched_ext: Separate out scx_kick_cpu() and add @sch to it
tools/sched_ext: scx_qmap: Make debug output quieter by default
sched_ext: Make qmap dump operation non-destructive
sched_ext: Add SCX_EFLAG_INITIALIZED to indicate successful ops.init()
sched_ext: Use bitfields for boolean warning flags
sched_ext: Fix stray scx_root usage in task_can_run_on_remote_rq()
sched_ext: Improve SCX_KF_DISPATCH comment
sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()
sched_ext: Verify RCU protection in scx_bpf_cpu_curr()
sched_ext: Add migration-disabled counter to error state dump
sched_ext: Fix NULL dereference in scx_bpf_cpu_rq() warning
tools/sched_ext: Add compat helper for scx_bpf_cpu_curr()
sched_ext: deprecation warn for scx_bpf_cpu_rq()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull copy_process updates from Christian Brauner:
"This contains the changes to enable support for clone3() on nios2
which apparently is still a thing.
The more exciting part of this is that it cleans up the inconsistency
in how the 64-bit flag argument is passed from copy_process() into the
various other copy_*() helpers"
[ Fixed up rv ltl_monitor 32-bit support as per Sasha Levin in the merge ]
* tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
nios2: implement architecture-specific portion of sys_clone3
arch: copy_thread: pass clone_flags as u64
copy_process: pass clone_flags as u64 across calltree
copy_sighand: Handle architectures where sizeof(unsigned long) < sizeof(u64)
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Fix two dl_server regressions: a race that can end up leaving the
dl_server stuck, and a dl_server throttling bug causing lag to fair
tasks"
* tag 'sched-urgent-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Fix dl_server behaviour
sched/deadline: Fix dl_server getting stuck
|
|
For now, migrate_enable and migrate_disable are global, which makes them
become hotspots in some case. Take BPF for example, the function calling
to migrate_enable and migrate_disable in BPF trampoline can introduce
significant overhead, and following is the 'perf top' of FENTRY's
benchmark (./tools/testing/selftests/bpf/bench trig-fentry):
54.63% bpf_prog_2dcccf652aac1793_bench_trigger_fentry [k]
bpf_prog_2dcccf652aac1793_bench_trigger_fentry
10.43% [kernel] [k] migrate_enable
10.07% bpf_trampoline_6442517037 [k] bpf_trampoline_6442517037
8.06% [kernel] [k] __bpf_prog_exit_recur
4.11% libc.so.6 [.] syscall
2.15% [kernel] [k] entry_SYSCALL_64
1.48% [kernel] [k] memchr_inv
1.32% [kernel] [k] fput
1.16% [kernel] [k] _copy_to_user
0.73% [kernel] [k] bpf_prog_test_run_raw_tp
So in this commit, we make migrate_enable/migrate_disable inline to obtain
better performance. The struct rq is defined internally in
kernel/sched/sched.h, and the field "nr_pinned" is accessed in
migrate_enable/migrate_disable, which makes it hard to make them inline.
Alexei Starovoitov suggests to generate the offset of "nr_pinned" in [1],
so we can define the migrate_enable/migrate_disable in
include/linux/sched.h and access "this_rq()->nr_pinned" with
"(void *)this_rq() + RQ_nr_pinned".
The offset of "nr_pinned" is generated in include/generated/rq-offsets.h
by kernel/sched/rq-offsets.c.
Generally speaking, we move the definition of migrate_enable and
migrate_disable to include/linux/sched.h from kernel/sched/core.c. The
calling to __set_cpus_allowed_ptr() is leaved in ___migrate_enable().
The "struct rq" is not available in include/linux/sched.h, so we can't
access the "runqueues" with this_cpu_ptr(), as the compilation will fail
in this_cpu_ptr() -> raw_cpu_ptr() -> __verify_pcpu_ptr():
typeof((ptr) + 0)
So we introduce the this_rq_raw() and access the runqueues with
arch_raw_cpu_ptr/PERCPU_PTR directly.
The variable "runqueues" is not visible in the kernel modules, and export
it is not a good idea. As Peter Zijlstra advised in [2], we define and
export migrate_enable/migrate_disable in kernel/sched/core.c too, and use
them for the modules.
Before this patch, the performance of BPF FENTRY is:
fentry : 113.030 ± 0.149M/s
fentry : 112.501 ± 0.187M/s
fentry : 112.828 ± 0.267M/s
fentry : 115.287 ± 0.241M/s
After this patch, the performance of BPF FENTRY increases to:
fentry : 143.644 ± 0.670M/s
fentry : 149.764 ± 0.362M/s
fentry : 149.642 ± 0.156M/s
fentry : 145.263 ± 0.221M/s
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/bpf/CAADnVQ+5sEDKHdsJY5ZsfGDO_1SEhhQWHrt2SMBG5SYyQ+jt7w@mail.gmail.com/ [1]
Link: https://lore.kernel.org/all/20250819123214.GH4067720@noisy.programming.kicks-ass.net/ [2]
|
|
John reported undesirable behaviour with the dl_server since commit:
cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling").
When starving fair tasks on purpose (starting spinning FIFO tasks),
his fair workload, which often goes (briefly) idle, would delay fair
invocations for a second, running one invocation per second was both
unexpected and terribly slow.
The reason this happens is that when dl_se->server_pick_task() returns
NULL, indicating no runnable tasks, it would yield, pushing any later
jobs out a whole period (1 second).
Instead simply stop the server. This should restore behaviour in that
a later wakeup (which restarts the server) will be able to continue
running (subject to the CBS wakeup rules).
Notably, this does not re-introduce the behaviour cccb45d7c4295 set
out to solve, any start/stop cycle is naturally throttled by the timer
period (no active cancel).
Fixes: cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling")
Reported-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: John Stultz <jstultz@google.com>
|
|
John found it was easy to hit lockup warnings when running locktorture
on a 2 CPU VM, which he bisected down to: commit cccb45d7c429
("sched/deadline: Less agressive dl_server handling").
While debugging it seems there is a chance where we end up with the
dl_server dequeued, with dl_se->dl_server_active. This causes
dl_server_start() to return without enqueueing the dl_server, thus it
fails to run when RT tasks starve the cpu.
When this happens, dl_server_timer() catches the
'!dl_se->server_has_tasks(dl_se)' case, which then calls
replenish_dl_entity() and dl_server_stopped() and finally return
HRTIMER_NO_RESTART.
This ends in no new timer and also no enqueue, leaving the dl_server
'dead', allowing starvation.
What should have happened is for the bandwidth timer to start the
zero-laxity timer, which in turn would enqueue the dl_server and cause
dl_se->server_pick_task() to be called -- which will stop the
dl_server if no fair tasks are observed for a whole period.
IOW, it is totally irrelevant if there are fair tasks at the moment of
bandwidth refresh.
This removes all dl_se->server_has_tasks() users, so remove the whole
thing.
Fixes: cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling")
Reported-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: John Stultz <jstultz@google.com>
|
|
This reverts commit c8191ee8e64a8c5c021a34e32868f2380965e82b which triggers
the following suspicious RCU usage warning:
[ 6.647598] =============================
[ 6.647603] WARNING: suspicious RCU usage
[ 6.647605] 6.17.0-rc7-virtme #1 Not tainted
[ 6.647608] -----------------------------
[ 6.647608] ./include/linux/rhashtable.h:602 suspicious rcu_dereference_check() usage!
[ 6.647610]
[ 6.647610] other info that might help us debug this:
[ 6.647610]
[ 6.647612]
[ 6.647612] rcu_scheduler_active = 2, debug_locks = 1
[ 6.647613] 1 lock held by swapper/10/0:
[ 6.647614] #0: ffff8b14bbb3cc98 (&rq->__lock){-.-.}-{2:2}, at:
+raw_spin_rq_lock_nested+0x20/0x90
[ 6.647630]
[ 6.647630] stack backtrace:
[ 6.647633] CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Not tainted 6.17.0-rc7-virtme #1
+PREEMPT(full)
[ 6.647643] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 6.647646] Sched_ext: beerland_1.0.2_g27d63fc3_x86_64_unknown_linux_gnu (enabled+all)
[ 6.647648] Call Trace:
[ 6.647652] <IRQ>
[ 6.647655] dump_stack_lvl+0x78/0xe0
[ 6.647665] lockdep_rcu_suspicious+0x14a/0x1b0
[ 6.647672] __rhashtable_lookup.constprop.0+0x1d5/0x250
[ 6.647680] find_dsq_for_dispatch+0xbc/0x190
[ 6.647684] do_enqueue_task+0x25b/0x550
[ 6.647689] enqueue_task_scx+0x21d/0x360
[ 6.647692] ? trace_lock_acquire+0x22/0xb0
[ 6.647695] enqueue_task+0x2e/0xd0
[ 6.647698] ttwu_do_activate+0xa2/0x290
[ 6.647703] sched_ttwu_pending+0xfd/0x250
[ 6.647706] __flush_smp_call_function_queue+0x1cd/0x610
[ 6.647714] __sysvec_call_function_single+0x34/0x150
[ 6.647720] sysvec_call_function_single+0x6e/0x80
[ 6.647726] </IRQ>
[ 6.647726] <TASK>
[ 6.647727] asm_sysvec_call_function_single+0x1a/0x20
Reported-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Pull sched_ext/for-6.17-fixes to receive:
55ed11b181c4 ("sched_ext: idle: Handle migration-disabled tasks in BPF code")
which conflicts with the following commit in for-6.18:
2407bae23d1e ("sched_ext: Add the @sch parameter to ext_idle helpers")
The conflict is a simple context conflict which can be resolved by taking
the updated parts from both commits.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
In preparation for multiple scheduler support:
- Add the @sch parameter to find_global_dsq() and refill_task_slice_dfl().
- Restructure scx_allow_ttwu_queue() and make it read scx_root into $sch.
- Make RCU protection in scx_dsq_move() and scx_bpf_dsq_move_to_local()
explicit.
v2: Add scx_root -> sch conversion in scx_allow_ttwu_queue().
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The intention behind scx_kf_exit/error() was that when called from kfuncs,
scx_kf_exit/error() would be able to implicitly determine the scx_sched
instance being operated on and thus wouldn't need the @sch parameter passed
in explicitly. This turned out to be unnecessarily complicated to implement
and not have enough practical benefits. Replace scx_kf_exit/error() usages
with scx_exit/error() which take an explicit @sch parameter.
- Add the @sch parameter to scx_kf_allowed(), scx_kf_allowed_on_arg_tasks,
mark_direct_dispatch() and other intermediate functions transitively.
- In callers that don't already have @sch available, grab RCU, read
$scx_root, verify it's not NULL and use it.
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
In preparation for multiple scheduler support, add the @sch parameter to
scx_dsq_insert_preamble/commit() and update the callers to read $scx_root
and pass it in. The passed in @sch parameter is not used yet.
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The intention behind kf_cpu_valid() was that when called from kfuncs,
kf_cpu_valid() would be able to implicitly determine the scx_sched instance
being operated on and thus wouldn't need @sch passed in explicitly. This
turned out to be unnecessarily complicated to implement and not have
justifiable practical benefits. Replace kf_cpu_valid() usages with
ops_cpu_valid() which takes explicit @sch.
Callers which don't have $sch available in the context are updated to read
$scx_root under RCU read lock, verify that it's not NULL and pass it in.
scx_bpf_cpu_rq() is restructured to use guard(rcu)() instead of explicit
rcu_read_[un]lock().
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
In preparation for multiple scheduler support, add the @sch parameter to
validate_node(), check_builtin_idle_enabled() and select_cpu_from_kfunc(),
and update their callers to read $scx_root, verify that it's not NULL and
pass it in. The passed in @sch parameter is not used yet.
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
In preparation for multiple scheduler support, add the @sch parameter to
__bstr_format() and update the callers to read $scx_root, verify that it's
not NULL and pass it in. The passed in @sch parameter is not used yet.
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
In preparation for multiple scheduler support, separate out scx_kick_cpu()
from scx_bpf_kick_cpu() and add the @sch parameter to it. scx_bpf_kick_cpu()
now acquires an RCU read lock, reads $scx_root, and calls scx_kick_cpu()
with it if non-NULL. The passed in @sch parameter is not used yet.
Internal uses of scx_bpf_kick_cpu() are converted to scx_kick_cpu(). Where
$sch is available, it's used. In the pick_task_scx() path where no
associated scheduler can be identified, $scx_root is used directly. Note
that $scx_root cannot be NULL in this case.
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
ops.exit() may be called even if the loading failed before ops.init()
finishes successfully. This is because ops.exit() allows rich exit info
communication. Add SCX_EFLAG_INITIALIZED flag to scx_exit_info.flags to
indicate whether ops.init() finished successfully.
This enables BPF schedulers to distinguish between exit scenarios and
handle cleanup appropriately based on initialization state.
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Convert warned_zero_slice and warned_deprecated_rq in scx_sched struct to
single-bit bitfields. While this doesn't reduce struct size immediately,
it prepares for future bitfield additions.
v2: Update patch description.
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
task_can_run_on_remote_rq() takes @sch but it is using scx_root when
incrementing SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE, which is inconsistent and
gets in the way of implementing multiple scheduler support. Use @sch
instead. As currently scx_root is the only possible scheduler instance, this
doesn't cause any behavior changes.
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The find_user_dsq() function is called from contexts that are already
under RCU read lock protection. Switch from rhashtable_lookup_fast() to
rhashtable_lookup() to avoid redundant RCU locking.
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
scx_bpf_cpu_curr() has been introduced to retrieve the current task of a
given runqueue, allowing schedulers to interact with that task.
The kfunc assumes that it is always called in an RCU context, but this
is not always guaranteed and some BPF schedulers can trigger the
following warning:
WARNING: suspicious RCU usage
sched_ext: BPF scheduler "cosmos_1.0.2_gd0e71ca_x86_64_unknown_linux_gnu_debug" enabled
6.17.0-rc1 #1-NixOS Not tainted
-----------------------------
kernel/sched/ext.c:6415 suspicious rcu_dereference_check() usage!
...
Call Trace:
<IRQ>
dump_stack_lvl+0x6f/0xb0
lockdep_rcu_suspicious.cold+0x4e/0x96
scx_bpf_cpu_curr+0x7e/0x80
bpf_prog_c68b2b6b6b1b0ff8_sched_timerfn+0xce/0x1dc
bpf_timer_cb+0x7b/0x130
__hrtimer_run_queues+0x1ea/0x380
hrtimer_run_softirq+0x8c/0xd0
handle_softirqs+0xc9/0x3b0
__irq_exit_rcu+0x96/0xc0
irq_exit_rcu+0xe/0x20
sysvec_apic_timer_interrupt+0x73/0x80
</IRQ>
<TASK>
To address this, mark the kfunc with KF_RCU_PROTECTED, so the verifier
can enforce its usage only inside RCU-protected sections.
Note: this also requires commit 1512231b6cc86 ("bpf: Enforce RCU protection
for KF_RCU_PROTECTED"), currently in bpf-next, to enforce the proper
KF_RCU_PROTECTED.
Fixes: 20b158094a1ad ("sched_ext: Introduce scx_bpf_cpu_curr()")
Cc: Christian Loehle <christian.loehle@arm.com>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fix from jun Heo:
"This contains a fix for sched_ext idle CPU selection that likely fixes
a substantial performance regression.
The scx_bpf_select_cpu_dfl/and() kfuncs were incorrectly detecting all
tasks as migration-disabled when called outside ops.select_cpu(),
causing them to always return -EBUSY instead of finding idle CPUs.
The fix properly distinguishes between genuinely migration-disabled
tasks vs. the current task whose migration is temporarily disabled by
BPF execution"
* tag 'sched_ext-for-6.17-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: idle: Handle migration-disabled tasks in BPF code
|
|
When scx_bpf_select_cpu_dfl()/and() kfuncs are invoked outside of
ops.select_cpu() we can't rely on @p->migration_disabled to determine if
migration is disabled for the task @p.
In fact, migration is always disabled for the current task while running
BPF code: __bpf_prog_enter() disables migration and __bpf_prog_exit()
re-enables it.
To handle this, when @p->migration_disabled == 1, check whether @p is
the current task. If so, migration was not disabled before entering the
callback, otherwise migration was disabled.
This ensures correct idle CPU selection in all cases. The behavior of
ops.select_cpu() remains unchanged, because this callback is never
invoked for the current task and migration-disabled tasks are always
excluded.
Example: without this change scx_bpf_select_cpu_and() called from
ops.enqueue() always returns -EBUSY; with this change applied, it
correctly returns idle CPUs.
Fixes: 06efc9fe0b8de ("sched_ext: idle: Handle migration-disabled tasks in idle selection")
Cc: stable@vger.kernel.org # v6.16+
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Include the task's migration-disabled counter when dumping task state
during an error exit.
This can help diagnose cases where tasks can get stuck, because they're
unable to migrate elsewhere.
tj: s/nomig/no_mig/ for readability and consistency with other keys.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Fix build failure when !FAIR_GROUP_SCHED && EXT_GROUP_SCHED
- Revert "sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()"
which was causing issues with per-CPU task scheduling and reenqueuing
behavior
* tag 'sched_ext-for-6.17-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext, sched/core: Fix build failure when !FAIR_GROUP_SCHED && EXT_GROUP_SCHED
Revert "sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()"
|
|
EXT_GROUP_SCHED
While collecting SCX related fields in struct task_group into struct
scx_task_group, 6e6558a6bc41 ("sched_ext, sched/core: Factor out struct
scx_task_group") forgot update tg->scx_weight usage in tg_weight(), which
leads to build failure when CONFIG_FAIR_GROUP_SCHED is disabled but
CONFIG_EXT_GROUP_SCHED is enabled. Fix it.
Fixes: 6e6558a6bc41 ("sched_ext, sched/core: Factor out struct scx_task_group")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202509170230.MwZsJSWa-lkp@intel.com/
Tested-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
scx_bpf_reenqueue_local() can be called from ops.cpu_release() when a
CPU is taken by a higher scheduling class to give tasks queued to the
CPU's local DSQ a chance to be migrated somewhere else, instead of
waiting indefinitely for that CPU to become available again.
In doing so, we decided to skip migration-disabled tasks, under the
assumption that they cannot be migrated anyway.
However, when a higher scheduling class preempts a CPU, the running task
is always inserted at the head of the local DSQ as a migration-disabled
task. This means it is always skipped by scx_bpf_reenqueue_local(), and
ends up being confined to the same CPU even if that CPU is heavily
contended by other higher scheduling class tasks.
As an example, let's consider the following scenario:
$ schedtool -a 0,1, -e yes > /dev/null
$ sudo schedtool -F -p 99 -a 0, -e \
stress-ng -c 1 --cpu-load 99 --cpu-load-slice 1000
The first task (SCHED_EXT) can run on CPU0 or CPU1. The second task
(SCHED_FIFO) is pinned to CPU0 and consumes ~99% of it. If the SCHED_EXT
task initially runs on CPU0, it will remain there because it always sees
CPU0 as "idle" in the short gaps left by the RT task, resulting in ~1%
utilization while CPU1 stays idle:
0[||||||||||||||||||||||100.0%] 8[ 0.0%]
1[ 0.0%] 9[ 0.0%]
2[ 0.0%] 10[ 0.0%]
3[ 0.0%] 11[ 0.0%]
4[ 0.0%] 12[ 0.0%]
5[ 0.0%] 13[ 0.0%]
6[ 0.0%] 14[ 0.0%]
7[ 0.0%] 15[ 0.0%]
PID USER PRI NI S CPU CPU%▽MEM% TIME+ Command
1067 root RT 0 R 0 99.0 0.2 0:31.16 stress-ng-cpu [run]
975 arighi 20 0 R 0 1.0 0.0 0:26.32 yes
By allowing scx_bpf_reenqueue_local() to re-enqueue migration-disabled
tasks, the scheduler can choose to migrate them to other CPUs (CPU1 in
this case) via ops.enqueue(), leading to better CPU utilization:
0[||||||||||||||||||||||100.0%] 8[ 0.0%]
1[||||||||||||||||||||||100.0%] 9[ 0.0%]
2[ 0.0%] 10[ 0.0%]
3[ 0.0%] 11[ 0.0%]
4[ 0.0%] 12[ 0.0%]
5[ 0.0%] 13[ 0.0%]
6[ 0.0%] 14[ 0.0%]
7[ 0.0%] 15[ 0.0%]
PID USER PRI NI S CPU CPU%▽MEM% TIME+ Command
577 root RT 0 R 0 100.0 0.2 0:23.17 stress-ng-cpu [run]
555 arighi 20 0 R 1 100.0 0.0 0:28.67 yes
It's debatable whether per-CPU tasks should be re-enqueued as well, but
doing so is probably safer: the scheduler can recognize re-enqueued
tasks through the %SCX_ENQ_REENQ flag, reassess their placement, and
either put them back at the head of the local DSQ or let another task
attempt to take the CPU.
This also prevents giving per-CPU tasks an implicit priority boost,
which would otherwise make them more likely to reclaim CPUs preempted by
higher scheduling classes.
Fixes: 97e13ecb02668 ("sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()")
Cc: stable@vger.kernel.org # v6.15+
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
When doing load balance and the target cfs_rq is in throttled hierarchy,
whether to allow balancing there is a question.
The good side to allow balancing is: if the target CPU is idle or less
loaded and the being balanced task is holding some kernel resources,
then it seems a good idea to balance the task there and let the task get
the CPU earlier and release kernel resources sooner. The bad part is, if
the task is not holding any kernel resources, then the balance seems not
that useful.
While theoretically it's debatable, a performance test[0] which involves
200 cgroups and each cgroup runs hackbench(20 sender, 20 receiver) in
pipe mode showed a performance degradation on AMD Genoa when allowing
load balance to throttled cfs_rq. Analysis[1] showed hackbench doesn't
like task migration across LLC boundary. For this reason, add a check in
can_migrate_task() to forbid balancing to a cfs_rq that is in throttled
hierarchy. This reduced task migration a lot and performance restored.
[0]: https://lore.kernel.org/lkml/20250822110701.GB289@bytedance/
[1]: https://lore.kernel.org/lkml/20250903101102.GB42@bytedance/
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
|
|
With the introduction of task based throttle model, task in a throttled
hierarchy is allowed to continue to run till it gets throttled on its
ret2user path.
For this reason, remove those throttled_hierarchy() checks in the
following functions so that those tasks can get their turn as normal
tasks: dequeue_entities(), check_preempt_wakeup_fair() and
yield_to_task_fair().
The benefit of doing it this way is: if those tasks gets the chance to
run earlier and if they hold any kernel resources, they can release
those resources earlier. The downside is, if they don't hold any kernel
resouces, all they can do is to throttle themselves on their way back to
user space so the favor to let them run seems not that useful and for
check_preempt_wakeup_fair(), that favor may be bad for curr.
K Prateek Nayak pointed out prio_changed_fair() can send a throttled
task to check_preempt_wakeup_fair(), further tests showed the affinity
change path from move_queued_task() can also send a throttled task to
check_preempt_wakeup_fair(), that's why the check of task_is_throttled()
in that function.
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
With task based throttle model, tasks in a throttled hierarchy are
allowed to continue to run if they are running in kernel mode. For this
reason, PELT clock is not stopped for these cfs_rqs in throttled
hierarchy when they still have tasks running or queued.
Since PELT clock is not stopped, whether to allow update_cfs_group()
doing its job for cfs_rqs which are in throttled hierarchy but still
have tasks running/queued is a question.
The good side is, continue to run update_cfs_group() can get these
cfs_rq entities with an up2date weight and that up2date weight can be
useful to derive an accurate load for the CPU as well as ensure fairness
if multiple tasks of different cgroups are running on the same CPU.
OTOH, as Benjamin Segall pointed: when unthrottle comes around the most
likely correct distribution is the distribution we had at the time of
throttle.
In reality, either way may not matter that much if tasks in throttled
hierarchy don't run in kernel mode for too long. But in case that
happens, let these cfs_rq entities have an up2date weight seems a good
thing to do.
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
Before task based throttle model, propagating load will stop at a
throttled cfs_rq and that propagate will happen on unthrottle time by
update_load_avg().
Now that there is no update_load_avg() on unthrottle for throttled
cfs_rq and all load tracking is done by task related operations, let the
propagate happen immediately.
While at it, add a comment to explain why cfs_rqs that are not affected
by throttle have to be added to leaf cfs_rq list in
propagate_entity_cfs_rq() per my understanding of commit 0258bdfaff5b
("sched/fair: Fix unfairness caused by missing load decay").
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
|
|
Replace repeated (20 - PAGE_SHIFT) calculations with standard macros:
- MB_TO_PAGES(mb) converts MB to page count
- PAGES_TO_MB(pages) converts pages to MB
No functional change.
[akpm@linux-foundation.org: remove arc's private PAGES_TO_MB, remove its unused PAGES_TO_KB]
[akpm@linux-foundation.org: don't include mm.h due to include file ordering mess]
Link: https://lkml.kernel.org/r/20250718024134.1304745-1-ye.liu@linux.dev
Signed-off-by: Ye Liu <liuye@kylinos.cn>
Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Lai jiangshan <jiangshanlai@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mel Gorman <mgorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Goto-san reported confusing pgpromote statistics where the
pgpromote_success count significantly exceeded pgpromote_candidate.
On a system with three nodes (nodes 0-1: DRAM 4GB, node 2: NVDIMM 4GB):
# Enable demotion only
echo 1 > /sys/kernel/mm/numa/demotion_enabled
numactl -m 0-1 memhog -r200 3500M >/dev/null &
pid=$!
sleep 2
numactl memhog -r100 2500M >/dev/null &
sleep 10
kill -9 $pid # terminate the 1st memhog
# Enable promotion
echo 2 > /proc/sys/kernel/numa_balancing
After a few seconds, we observeed `pgpromote_candidate < pgpromote_success`
$ grep -e pgpromote /proc/vmstat
pgpromote_success 2579
pgpromote_candidate 0
In this scenario, after terminating the first memhog, the conditions for
pgdat_free_space_enough() are quickly met, and triggers promotion.
However, these migrated pages are only counted for in PGPROMOTE_SUCCESS,
not in PGPROMOTE_CANDIDATE.
To solve these confusing statistics, introduce PGPROMOTE_CANDIDATE_NRL to
count the missed promotion pages. And also, not counting these pages into
PGPROMOTE_CANDIDATE is to avoid changing the existing algorithm or
performance of the promotion rate limit.
Link: https://lkml.kernel.org/r/20250901090122.124262-1-ruansy.fnst@fujitsu.com
Link: https://lkml.kernel.org/r/20250729035101.1601407-1-ruansy.fnst@fujitsu.com
Fixes: c6833e10008f ("memory tiering: rate limit NUMA migration throughput")
Co-developed-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Ruan Shiyang <ruansy.fnst@fujitsu.com>
Reported-by: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com>
Suggested-by: Huang Ying <ying.huang@linux.alibaba.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The field timer->base->get_time is a private implementation detail and
should not be accessed outside of the hrtimer core.
Switch to the equivalent helper.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250821-hrtimer-cleanup-get_time-v2-3-3ae822e5bfbd@linutronix.de
|
|
When printing the deprecation warning for scx_bpf_cpu_rq(), we may hit a
NULL pointer dereference if the kfunc is called before a BPF scheduler
is fully attached, for example, when invoked from a BPF timer or during
ops.init():
[ 50.752775] BUG: kernel NULL pointer dereference, address: 0000000000000331
...
[ 50.764205] RIP: 0010:scx_bpf_cpu_rq+0x30/0xa0
...
[ 50.787661] Call Trace:
[ 50.788398] <TASK>
[ 50.789061] bpf_prog_08f7fd2dcb187aaf_wakeup_timerfn+0x75/0x1a8
[ 50.792477] bpf_timer_cb+0x7e/0x140
[ 50.796003] hrtimer_run_softirq+0x91/0xe0
[ 50.796952] handle_softirqs+0xce/0x3c0
[ 50.799087] run_ksoftirqd+0x3e/0x70
[ 50.800197] smpboot_thread_fn+0x133/0x290
[ 50.802320] kthread+0x115/0x220
[ 50.804984] ret_from_fork+0x17a/0x1d0
[ 50.806920] ret_from_fork_asm+0x1a/0x30
[ 50.807799] </TASK>
Fix this by only printing the warning once the scheduler is fully
registered.
Fixes: 5c48d88fe0049 ("sched_ext: deprecation warn for scx_bpf_cpu_rq()")
Cc: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
scx_bpf_cpu_rq() works on an unlocked rq which generally isn't safe.
For the common use-cases scx_bpf_locked_rq() and
scx_bpf_cpu_curr() work, so add a deprecation warning
to scx_bpf_cpu_rq() so it can eventually be removed.
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Provide scx_bpf_cpu_curr() as a way for scx schedulers to check the curr
task of a remote rq without assuming its lock is held.
Many scx schedulers make use of scx_bpf_cpu_rq() to check a remote curr
(e.g. to see if it should be preempted). This is problematic because
scx_bpf_cpu_rq() provides access to all fields of struct rq, most of
which aren't safe to use without holding the associated rq lock.
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Most fields in scx_bpf_cpu_rq() assume that its rq_lock is held.
Furthermore they become meaningless without rq lock, too.
Make a safer version of scx_bpf_cpu_rq() that only returns a rq
if we hold rq lock of that rq.
Also mark the new scx_bpf_locked_rq() as returning NULL as
scx_bpf_cpu_rq() should've been too.
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
SCX hooks into CPU cgroup controller operations and read-locks
scx_cgroup_rwsem to exclude them while enabling and disable schedulers.
While this works, it's unnecessarily complicated given that
cgroup_[un]lock() are available and thus the cgroup operations can be locked
out that way.
Drop scx_cgroup_rwsem locking from the tg on/offline and cgroup [can_]attach
operations. Instead, grab cgroup_lock() from scx_cgroup_lock(). Drop
scx_cgroup_finish_attach() which is no longer necessary. Drop the now
unnecessary rcu locking and css ref bumping in scx_cgroup_init() and
scx_cgroup_exit().
As scx_cgroup_set_weight/bandwidth() paths aren't protected by
cgroup_lock(), rename scx_cgroup_rwsem to scx_cgroup_ops_rwsem and retain
the locking there.
This is overall simpler and will also allow enable/disable paths to
synchronize against cgroup changes independent of the CPU controller.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
|
|
scx_sched.event_stats_cpu is the percpu counters that are used to track
stats. Introduce struct scx_sched_pcpu and move the counters inside. This
will ease adding more per-cpu fields. No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
|
|
There currently isn't a place to place SCX-internal types and accessors to
be shared between ext.c and ext_idle.c. Create kernel/sched/ext_internal.h
and move internal type and accessor definitions there. This trims ext.c a
bit and makes future additions easier. Pure code reorganization. No
functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
|
|
scx_enable() turns on the bypass mode while enable is in progress. If
enabling fails, it turns off the bypass mode and then triggers scx_error().
scx_error() will trigger scx_disable_workfn() which will turn on the bypass
mode again and unload the failed scheduler.
This moves the system out of bypass mode between the enable error path and
the disable path, which is unnecessary and can be brittle - e.g. the thread
running scx_enable() may already be on the failed scheduler and can be
switched out before it triggers scx_error() leading to a stall. The watchdog
would eventually kick in, so the situation isn't critical but is still
suboptimal.
There is nothing to be gained by turning off the bypass mode between
scx_enable() failure and scx_disable_workfn(). Keep bypass on.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
|
|
During tasks iteration, the locks can be dropped using
scx_task_iter_unlock() to perform e.g. sleepable allocations. Afterwards,
scx_task_iter_relock() has to be called prior to other iteration operations,
which is error-prone. This can be easily automated by tracking whether
scx_tasks_lock is held in scx_task_iter and re-acquiring when necessary. It
already tracks whether the task's rq is locked after all.
- Add scx_task_iter->list_locked which remembers whether scx_tasks_lock is
held.
- Rename scx_task_iter->locked to scx_task_iter->locked_task to better
distinguish it from ->list_locked.
- Replace scx_task_iter_relock() with __scx_task_iter_maybe_relock() which
is automatically called by scx_task_iter_next() and scx_task_iter_stop().
- Drop explicit scx_task_iter_relock() calls.
The resulting behavior should be equivalent.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
|
|
sched_numa_find_nth_cpu() uses a bsearch to look for the 'closest'
CPU in sched_domains_numa_masks and given cpus mask. However they
might not intersect if all CPUs in the cpus mask are offline. bsearch
will return NULL in that case, bail out instead of dereferencing a
bogus pointer.
The previous behaviour lead to this bug when using maxcpus=4 on an
rk3399 (LLLLbb) (i.e. booting with all big CPUs offline):
[ 1.422922] Unable to handle kernel paging request at virtual address ffffff8000000000
[ 1.423635] Mem abort info:
[ 1.423889] ESR = 0x0000000096000006
[ 1.424227] EC = 0x25: DABT (current EL), IL = 32 bits
[ 1.424715] SET = 0, FnV = 0
[ 1.424995] EA = 0, S1PTW = 0
[ 1.425279] FSC = 0x06: level 2 translation fault
[ 1.425735] Data abort info:
[ 1.425998] ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000
[ 1.426499] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[ 1.426952] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 1.427428] swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000004a9f000
[ 1.428038] [ffffff8000000000] pgd=18000000f7fff403, p4d=18000000f7fff403, pud=18000000f7fff403, pmd=0000000000000000
[ 1.429014] Internal error: Oops: 0000000096000006 [#1] SMP
[ 1.429525] Modules linked in:
[ 1.429813] CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.17.0-rc4-dirty #343 PREEMPT
[ 1.430559] Hardware name: Pine64 RockPro64 v2.1 (DT)
[ 1.431012] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 1.431634] pc : sched_numa_find_nth_cpu+0x2a0/0x488
[ 1.432094] lr : sched_numa_find_nth_cpu+0x284/0x488
[ 1.432543] sp : ffffffc084e1b960
[ 1.432843] x29: ffffffc084e1b960 x28: ffffff80078a8800 x27: ffffffc0846eb1d0
[ 1.433495] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
[ 1.434144] x23: 0000000000000000 x22: fffffffffff7f093 x21: ffffffc081de6378
[ 1.434792] x20: 0000000000000000 x19: 0000000ffff7f093 x18: 00000000ffffffff
[ 1.435441] x17: 3030303866666666 x16: 66663d736b73616d x15: ffffffc104e1b5b7
[ 1.436091] x14: 0000000000000000 x13: ffffffc084712860 x12: 0000000000000372
[ 1.436739] x11: 0000000000000126 x10: ffffffc08476a860 x9 : ffffffc084712860
[ 1.437389] x8 : 00000000ffffefff x7 : ffffffc08476a860 x6 : 0000000000000000
[ 1.438036] x5 : 000000000000bff4 x4 : 0000000000000000 x3 : 0000000000000000
[ 1.438683] x2 : 0000000000000000 x1 : ffffffc0846eb000 x0 : ffffff8000407b68
[ 1.439332] Call trace:
[ 1.439559] sched_numa_find_nth_cpu+0x2a0/0x488 (P)
[ 1.440016] smp_call_function_any+0xc8/0xd0
[ 1.440416] armv8_pmu_init+0x58/0x27c
[ 1.440770] armv8_cortex_a72_pmu_init+0x20/0x2c
[ 1.441199] arm_pmu_device_probe+0x1e4/0x5e8
[ 1.441603] armv8_pmu_device_probe+0x1c/0x28
[ 1.442007] platform_probe+0x5c/0xac
[ 1.442347] really_probe+0xbc/0x298
[ 1.442683] __driver_probe_device+0x78/0x12c
[ 1.443087] driver_probe_device+0xdc/0x160
[ 1.443475] __driver_attach+0x94/0x19c
[ 1.443833] bus_for_each_dev+0x74/0xd4
[ 1.444190] driver_attach+0x24/0x30
[ 1.444525] bus_add_driver+0xe4/0x208
[ 1.444874] driver_register+0x60/0x128
[ 1.445233] __platform_driver_register+0x24/0x30
[ 1.445662] armv8_pmu_driver_init+0x28/0x4c
[ 1.446059] do_one_initcall+0x44/0x25c
[ 1.446416] kernel_init_freeable+0x1dc/0x3bc
[ 1.446820] kernel_init+0x20/0x1d8
[ 1.447151] ret_from_fork+0x10/0x20
[ 1.447493] Code: 90022e21 f000e5f5 910de2b5 2a1703e2 (f8767803)
[ 1.448040] ---[ end trace 0000000000000000 ]---
[ 1.448483] note: swapper/0[1] exited with preempt_count 1
[ 1.449047] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[ 1.449741] SMP: stopping secondary CPUs
[ 1.450105] Kernel Offset: disabled
[ 1.450419] CPU features: 0x000000,00080000,20002001,0400421b
[ 1.450935] Memory Limit: none
[ 1.451217] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---
Yury: with the fix, the function returns cpu == nr_cpu_ids, and later in
smp_call_function_any ->
smp_call_function_single ->
generic_exec_single
we test the cpu for '>= nr_cpu_ids' and return -ENXIO. So everything is
handled correctly.
Fixes: cd7f55359c90 ("sched: add sched_numa_find_nth_cpu()")
Cc: stable@vger.kernel.org
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
|
|
Now that throttled tasks are dequeued and can not stay on rq's cfs_tasks
list, there is no need to take special care of these throttled tasks
anymore in load balance.
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20250829081120.806-6-ziqianlu@bytedance.com
|
|
With task based throttle model, the previous way to check cfs_rq's
nr_queued to decide if throttled time should be accounted doesn't work
as expected, e.g. when a cfs_rq which has a single task is throttled,
that task could later block in kernel mode instead of being dequeued on
limbo list and accounting this as throttled time is not accurate.
Rework throttle time accounting for a cfs_rq as follows:
- start accounting when the first task gets throttled in its hierarchy;
- stop accounting on unthrottle.
Note that there will be a time gap between when a cfs_rq is throttled
and when a task in its hierarchy is actually throttled. This accounting
mechanism only starts accounting in the latter case.
Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # accounting mechanism
Co-developed-by: K Prateek Nayak <kprateek.nayak@amd.com> # simplify implementation
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20250829081120.806-5-ziqianlu@bytedance.com
|
|
In current throttle model, when a cfs_rq is throttled, its entity will
be dequeued from cpu's rq, making tasks attached to it not able to run,
thus achiveing the throttle target.
This has a drawback though: assume a task is a reader of percpu_rwsem
and is waiting. When it gets woken, it can not run till its task group's
next period comes, which can be a relatively long time. Waiting writer
will have to wait longer due to this and it also makes further reader
build up and eventually trigger task hung.
To improve this situation, change the throttle model to task based, i.e.
when a cfs_rq is throttled, record its throttled status but do not remove
it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
they get picked, add a task work to them so that when they return
to user, they can be dequeued there. In this way, tasks throttled will
not hold any kernel resources. And on unthrottle, enqueue back those
tasks so they can continue to run.
Throttled cfs_rq's PELT clock is handled differently now: previously the
cfs_rq's PELT clock is stopped once it entered throttled state but since
now tasks(in kernel mode) can continue to run, change the behaviour to
stop PELT clock when the throttled cfs_rq has no tasks left.
Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # tag on pick
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20250829081120.806-4-ziqianlu@bytedance.com
|
|
Implement throttle_cfs_rq_work() task work which gets executed on task's
ret2user path where the task is dequeued and marked as throttled.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20250829081120.806-3-ziqianlu@bytedance.com
|