summaryrefslogtreecommitdiff
path: root/kernel/sched
AgeCommit message (Collapse)Author
14 daysMerge tag 'mm-stable-2025-05-31-14-50' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of creating a pte which addresses the first page in a folio and reduces the amount of plumbing which architecture must implement to provide this. - "Misc folio patches for 6.16" from Matthew Wilcox is a shower of largely unrelated folio infrastructure changes which clean things up and better prepare us for future work. - "memory,x86,acpi: hotplug memory alignment advisement" from Gregory Price adds early-init code to prevent x86 from leaving physical memory unused when physical address regions are not aligned to memory block size. - "mm/compaction: allow more aggressive proactive compaction" from Michal Clapinski provides some tuning of the (sadly, hard-coded (more sadly, not auto-tuned)) thresholds for our invokation of proactive compaction. In a simple test case, the reduction of a guest VM's memory consumption was dramatic. - "Minor cleanups and improvements to swap freeing code" from Kemeng Shi provides some code cleaups and a small efficiency improvement to this part of our swap handling code. - "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin adds the ability for a ptracer to modify syscalls arguments. At this time we can alter only "system call information that are used by strace system call tampering, namely, syscall number, syscall arguments, and syscall return value. This series should have been incorporated into mm.git's "non-MM" branch, but I goofed. - "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl against /proc/pid/pagemap. This permits CRIU to more efficiently get at the info about guard regions. - "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan implements that fix. No runtime effect is expected because validate_page_before_insert() happens to fix up this error. - "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David Hildenbrand basically brings uprobe text poking into the current decade. Remove a bunch of hand-rolled implementation in favor of using more current facilities. - "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman Khandual provides enhancements and generalizations to the pte dumping code. This might be needed when 128-bit Page Table Descriptors are enabled for ARM. - "Always call constructor for kernel page tables" from Kevin Brodsky ensures that the ctor/dtor is always called for kernel pgtables, as it already is for user pgtables. This permits the addition of more functionality such as "insert hooks to protect page tables". This change does result in various architectures performing unnecesary work, but this is fixed up where it is anticipated to occur. - "Rust support for mm_struct, vm_area_struct, and mmap" from Alice Ryhl adds plumbing to permit Rust access to core MM structures. - "fix incorrectly disallowed anonymous VMA merges" from Lorenzo Stoakes takes advantage of some VMA merging opportunities which we've been missing for 15 years. - "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from SeongJae Park optimizes process_madvise()'s TLB flushing. Instead of flushing each address range in the provided iovec, we batch the flushing across all the iovec entries. The syscall's cost was approximately halved with a microbenchmark which was designed to load this particular operation. - "Track node vacancy to reduce worst case allocation counts" from Sidhartha Kumar makes the maple tree smarter about its node preallocation. stress-ng mmap performance increased by single-digit percentages and the amount of unnecessarily preallocated memory was dramaticelly reduced. - "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes a few unnecessary things which Baoquan noted when reading the code. - ""Enhance sysfs handling for memory hotplug in weighted interleave" from Rakie Kim "enhances the weighted interleave policy in the memory management subsystem by improving sysfs handling, fixing memory leaks, and introducing dynamic sysfs updates for memory hotplug support". Fixes things on error paths which we are unlikely to hit. - "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory" from SeongJae Park introduces new DAMOS quota goal metrics which eliminate the manual tuning which is required when utilizing DAMON for memory tiering. - "mm/vmalloc.c: code cleanup and improvements" from Baoquan He provides cleanups and small efficiency improvements which Baoquan found via code inspection. - "vmscan: enforce mems_effective during demotion" from Gregory Price changes reclaim to respect cpuset.mems_effective during demotion when possible. because presently, reclaim explicitly ignores cpuset.mems_effective when demoting, which may cause the cpuset settings to violated. This is useful for isolating workloads on a multi-tenant system from certain classes of memory more consistently. - "Clean up split_huge_pmd_locked() and remove unnecessary folio pointers" from Gavin Guo provides minor cleanups and efficiency gains in in the huge page splitting and migrating code. - "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache for `struct mem_cgroup', yielding improved memory utilization. - "add max arg to swappiness in memory.reclaim and lru_gen" from Zhongkun He adds a new "max" argument to the "swappiness=" argument for memory.reclaim MGLRU's lru_gen. This directs proactive reclaim to reclaim from only anon folios rather than file-backed folios. - "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the first step on the path to permitting the kernel to maintain existing VMs while replacing the host kernel via file-based kexec. At this time only memblock's reserve_mem is preserved. - "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides and uses a smarter way of looping over a pfn range. By skipping ranges of invalid pfns. - "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning when a task is pinned a single NUMA mode. Dramatic performance benefits were seen in some real world cases. - "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank Garg addresses a warning which occurs during memory compaction when using JFS. - "move all VMA allocation, freeing and duplication logic to mm" from Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more appropriate mm/vma.c. - "mm, swap: clean up swap cache mapping helper" from Kairui Song provides code consolidation and cleanups related to the folio_index() function. - "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that. - "memcg: Fix test_memcg_min/low test failures" from Waiman Long addresses some bogus failures which are being reported by the test_memcontrol selftest. - "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo Stoakes commences the deprecation of file_operations.mmap() in favor of the new file_operations.mmap_prepare(). The latter is more restrictive and prevents drivers from messing with things in ways which, amongst other problems, may defeat VMA merging. - "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples the per-cpu memcg charge cache from the objcg's one. This is a step along the way to making memcg and objcg charging NMI-safe, which is a BPF requirement. - "mm/damon: minor fixups and improvements for code, tests, and documents" from SeongJae Park is yet another batch of miscellaneous DAMON changes. Fix and improve minor problems in code, tests and documents. - "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg stats to be irq safe. Another step along the way to making memcg charging and stats updates NMI-safe, a BPF requirement. - "Let unmap_hugepage_range() and several related functions take folio instead of page" from Fan Ni provides folio conversions in the hugetlb code. * tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits) mm: pcp: increase pcp->free_count threshold to trigger free_high mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range() mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page mm/hugetlb: pass folio instead of page to unmap_ref_private() memcg: objcg stock trylock without irq disabling memcg: no stock lock for cpu hot-unplug memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs memcg: make count_memcg_events re-entrant safe against irqs memcg: make mod_memcg_state re-entrant safe against irqs memcg: move preempt disable to callers of memcg_rstat_updated memcg: memcg_rstat_updated re-entrant safe against irqs mm: khugepaged: decouple SHMEM and file folios' collapse selftests/eventfd: correct test name and improve messages alloc_tag: check mem_profiling_support in alloc_tag_init Docs/damon: update titles and brief introductions to explain DAMOS selftests/damon/_damon_sysfs: read tried regions directories in order mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject() mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat() mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs ...
2025-05-28Merge tag 'bpf-next-6.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: - Fix and improve BTF deduplication of identical BTF types (Alan Maguire and Andrii Nakryiko) - Support up to 12 arguments in BPF trampoline on arm64 (Xu Kuohai and Alexis Lothoré) - Support load-acquire and store-release instructions in BPF JIT on riscv64 (Andrea Parri) - Fix uninitialized values in BPF_{CORE,PROBE}_READ macros (Anton Protopopov) - Streamline allowed helpers across program types (Feng Yang) - Support atomic update for hashtab of BPF maps (Hou Tao) - Implement json output for BPF helpers (Ihor Solodrai) - Several s390 JIT fixes (Ilya Leoshkevich) - Various sockmap fixes (Jiayuan Chen) - Support mmap of vmlinux BTF data (Lorenz Bauer) - Support BPF rbtree traversal and list peeking (Martin KaFai Lau) - Tests for sockmap/sockhash redirection (Michal Luczaj) - Introduce kfuncs for memory reads into dynptrs (Mykyta Yatsenko) - Add support for dma-buf iterators in BPF (T.J. Mercier) - The verifier support for __bpf_trap() (Yonghong Song) * tag 'bpf-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (135 commits) bpf, arm64: Remove unused-but-set function and variable. selftests/bpf: Add tests with stack ptr register in conditional jmp bpf: Do not include stack ptr register in precision backtracking bookkeeping selftests/bpf: enable many-args tests for arm64 bpf, arm64: Support up to 12 function arguments bpf: Check rcu_read_lock_trace_held() in bpf_map_lookup_percpu_elem() bpf: Avoid __bpf_prog_ret0_warn when jit fails bpftool: Add support for custom BTF path in prog load/loadall selftests/bpf: Add unit tests with __bpf_trap() kfunc bpf: Warn with __bpf_trap() kfunc maybe due to uninitialized variable bpf: Remove special_kfunc_set from verifier selftests/bpf: Add test for open coded dmabuf_iter selftests/bpf: Add test for dmabuf_iter bpf: Add open coded dmabuf iterator bpf: Add dmabuf iterator dma-buf: Rename debugfs symbols bpf: Fix error return value in bpf_copy_from_user_dynptr libbpf: Use mmap to parse vmlinux BTF from sysfs selftests: bpf: Add a test for mmapable vmlinux BTF btf: Allow mmap of vmlinux btf ...
2025-05-27Merge tag 'sched_ext-for-6.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - More in-kernel idle CPU selection improvements. Expand topology awareness coverage add scx_bpf_select_cpu_and() to allow more flexibility. The idle CPU selection kfuncs can now be called from unlocked contexts too. - A bunch of reorganization changes to lay the foundation for multiple hierarchical scheduler support. This isn't ready yet and the included changes don't make meaningful behavior differences. One notable change is replacing some static_key tests with dynamic tests as the test results may differ depending on the scheduler instance. This isn't expected to cause meaningful performance difference. - Other minor and doc updates. - There were multiple patches in for-6.15-fixes which conflicted with changes in for-6.16. for-6.15-fixes were pulled three times into for-6.16 to resolve the conflicts. * tag 'sched_ext-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (49 commits) sched_ext: Call ops.update_idle() after updating builtin idle bits sched_ext, docs: convert mentions of "CFS" to "fair-class scheduler" selftests/sched_ext: Update test enq_select_cpu_fails sched_ext: idle: Consolidate default idle CPU selection kfuncs selftests/sched_ext: Add test for scx_bpf_select_cpu_and() via test_run sched_ext: idle: Allow scx_bpf_select_cpu_and() from unlocked context sched_ext: idle: Validate locking correctness in scx_bpf_select_cpu_and() sched_ext: Make scx_kf_allowed_if_unlocked() available outside ext.c sched_ext, docs: add label sched_ext: Explain the temporary situation around scx_root dereferences sched_ext: Add @sch to SCX_CALL_OP*() sched_ext: Cleanup [__]scx_exit/error*() sched_ext: Add @sch to SCX_CALL_OP*() sched_ext: Clean up scx_root usages Documentation: scheduler: Changed lowercase acronyms to uppercase sched_ext: Avoid NULL scx_root deref in __scx_exit() sched_ext: Add RCU protection to scx_root in DSQ iterator sched_ext: Clean up SCX_EXIT_NONE handling in scx_disable_workfn() sched_ext: Move disable machinery into scx_sched sched_ext: Move event_stats_cpu into scx_sched ...
2025-05-27Merge tag 'pm-6.16-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "Once again, the changes are dominated by cpufreq updates, but this time the majority of them are cpufreq core changes, mostly related to the introduction of policy locking guards and __free() usage, and fixes related to boost handling. Still, there is also a significant update of the intel_pstate driver making it register an energy model when running on a hybrid platform which is used for enabling energy-aware scheduling (EAS) if the driver operates in the passive mode (and schedutil is used as the cpufreq governor for all CPUs which is the passive mode default). There are some amd-pstate driver updates too, for a good measure, including the "Requested CPU Min frequency" BIOS option support and new online/offline callbacks. In the cpuidle space, the most significant change is the addition of a C1 demotion on/off sysfs knob to intel_idle which should help some users to configure their systems more precisely. There is also the conversion of the PSCI cpuidle driver to a faux device one and there are two small updates of cpuidle governors. Device power management is also modified quite a bit, especially the handling of devices with asynchronous suspend and resume enabled during system transitions. They are now going to be handled more asynchronously during suspend transitions and somewhat less aggressively during resume transitions. Apart from the above, the operating performance points (OPP) library is now going to use mutex locking guards and scope-based cleanup helpers and there is the usual bunch of assorted fixes and code cleanups. Specifics: - Fix potential division-by-zero error in em_compute_costs() (Yaxiong Tian) - Fix typos in energy model documentation and example driver code (Moon Hee Lee, Atul Kumar Pant) - Rearrange the energy model management code and add a new function for adjusting a CPU energy model after adjusting the capacity of the given CPU to it (Rafael Wysocki) - Refactor cpufreq_online(), add and use cpufreq policy locking guards, use __free() in policy reference counting, and clean up core cpufreq code on top of that (Rafael Wysocki) - Fix boost handling on CPU suspend/resume and sysfs updates (Viresh Kumar) - Fix des_perf clamping with max_perf in amd_pstate_update() (Dhananjay Ugwekar) - Add offline, online and suspend callbacks to the amd-pstate driver, rename and use the existing amd_pstate_epp callbacks in it (Dhananjay Ugwekar) - Add support for the "Requested CPU Min frequency" BIOS option to the amd-pstate driver (Dhananjay Ugwekar) - Reset amd-pstate driver mode after running selftests (Swapnil Sapkal) - Avoid shadowing ret in amd_pstate_ut_check_driver() (Nathan Chancellor) - Add helper for governor checks to the schedutil cpufreq governor and move cpufreq-specific EAS checks to cpufreq (Rafael Wysocki) - Populate the cpu_capacity sysfs entries from the intel_pstate driver after registering asym capacity support (Ricardo Neri) - Add support for enabling Energy-aware scheduling (EAS) to the intel_pstate driver when operating in the passive mode on a hybrid platform (Rafael Wysocki) - Drop redundant cpus_read_lock() from store_local_boost() in the cpufreq core (Seyediman Seyedarab) - Replace sscanf() with kstrtouint() in the cpufreq code and use a symbol instead of a raw number in it (Bowen Yu) - Add support for autonomous CPU performance state selection to the CPPC cpufreq driver (Lifeng Zheng) - OPP: Add dev_pm_opp_set_level() (Praveen Talari) - Introduce scope-based cleanup headers and mutex locking guards in OPP core (Viresh Kumar) - Switch OPP to use kmemdup_array() (Zhang Enpei) - Optimize bucket assignment when next_timer_ns equals KTIME_MAX in the menu cpuidle governor (Zhongqiu Han) - Convert the cpuidle PSCI driver to a faux device one (Sudeep Holla) - Add C1 demotion on/off sysfs knob to the intel_idle driver (Artem Bityutskiy) - Fix typos in two comments in the teo cpuidle governor (Atul Kumar Pant) - Fix denying of auto suspend in pm_suspend_timer_fn() (Charan Teja Kalla) - Move debug runtime PM attributes to runtime_attrs[] (Rafael Wysocki) - Add new devm_ functions for enabling runtime PM and runtime PM reference counting (Bence Csókás) - Remove size arguments from strscpy() calls in the hibernation core code (Thorsten Blum) - Adjust the handling of devices with asynchronous suspend enabled during system suspend and resume to start resuming them immediately after resuming their parents and to start suspending such a device immediately after suspending its first child (Rafael Wysocki) - Adjust messages printed during tasks freezing to avoid using pr_cont() (Andrew Sayers, Paul Menzel) - Clean up unnecessary usage of !! in pm_print_times_init() (Zihuan Zhang) - Add missing wakeup source attribute relax_count to sysfs and remove the space character at the end ofi the string produced by pm_show_wakelocks() (Zijun Hu) - Add configurable pm_test delay for hibernation (Zihuan Zhang) - Disable asynchronous suspend in ucsi_ccg_probe() to prevent the cypd4226 device on Tegra boards from suspending prematurely (Jon Hunter) - Unbreak printing PM debug messages during hibernation and clean up some related code (Rafael Wysocki) - Add a systemd service to run cpupower and change cpupower binding's Makefile to use -lcpupower (John B. Wyatt IV, Francesco Poli)" * tag 'pm-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (72 commits) cpufreq: CPPC: Add support for autonomous selection cpufreq: Update sscanf() to kstrtouint() cpufreq: Replace magic number OPP: switch to use kmemdup_array() PM: freezer: Rewrite restarting tasks log to remove stray *done.* PM: runtime: fix denying of auto suspend in pm_suspend_timer_fn() cpufreq: drop redundant cpus_read_lock() from store_local_boost() cpupower: do not install files to /etc/default/ cpupower: do not call systemctl at install time cpupower: do not write DESTDIR to cpupower.service PM: sleep: Introduce pm_sleep_transition_in_progress() cpufreq/amd-pstate: Avoid shadowing ret in amd_pstate_ut_check_driver() cpufreq: intel_pstate: Document hybrid processor support cpufreq: intel_pstate: EAS: Increase cost for CPUs using L3 cache cpufreq: intel_pstate: EAS support for hybrid platforms PM: EM: Introduce em_adjust_cpu_capacity() PM: EM: Move CPU capacity check to em_adjust_new_capacity() PM: EM: Documentation: Fix typos in example driver code cpufreq: Drop policy locking from cpufreq_policy_is_good_for_eas() PM: sleep: Introduce pm_suspend_in_progress() ...
2025-05-26Merge tag 'sched-core-2025-05-25' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Core & fair scheduler changes: - Tweak wait_task_inactive() to force dequeue sched_delayed tasks (John Stultz) - Adhere to place_entity() constraints (Peter Zijlstra) - Allow decaying util_est when util_avg > CPU capacity (Pierre Gondois) - Fix up wake_up_sync() vs DELAYED_DEQUEUE (Xuewen Yan) Energy management: - Introduce sched_update_asym_prefer_cpu() (K Prateek Nayak) - cpufreq/amd-pstate: Update asym_prefer_cpu when core rankings change (K Prateek Nayak) - Align uclamp and util_est and call before freq update (Xuewen Yan) CPU isolation: - Make use of more than one housekeeping CPU (Phil Auld) RT scheduler: - Fix race in push_rt_task() (Harshit Agarwal) - Add kernel cmdline option for rt_group_sched (Michal Koutný) Scheduler topology support: - Improve topology_span_sane speed (Steve Wahl) Scheduler debugging: - Move and extend the sched_process_exit() tracepoint (Andrii Nakryiko) - Add RT_GROUP WARN checks for non-root task_groups (Michal Koutný) - Fix trace_sched_switch(.prev_state) (Peter Zijlstra) - Untangle cond_resched() and live-patching (Peter Zijlstra) Fixes and cleanups: - Misc fixes and cleanups (K Prateek Nayak, Michal Koutný, Peter Zijlstra, Xuewen Yan)" * tag 'sched-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits) sched/uclamp: Align uclamp and util_est and call before freq update sched/util_est: Simplify condition for util_est_{en,de}queue() sched/fair: Fixup wake_up_sync() vs DELAYED_DEQUEUE sched,livepatch: Untangle cond_resched() and live-patching sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks sched/fair: Adhere to place_entity() constraints sched/debug: Print the local group's asym_prefer_cpu cpufreq/amd-pstate: Update asym_prefer_cpu when core rankings change sched/topology: Introduce sched_update_asym_prefer_cpu() sched/fair: Use READ_ONCE() to read sg->asym_prefer_cpu sched/isolation: Make use of more than one housekeeping cpu sched/rt: Fix race in push_rt_task sched: Add annotations to RT_GROUP_SCHED fields sched: Add RT_GROUP WARN checks for non-root task_groups sched: Do not construct nor expose RT_GROUP_SCHED structures if disabled sched: Bypass bandwitdh checks with runtime disabled RT_GROUP_SCHED sched: Skip non-root task_groups with disabled RT_GROUP_SCHED sched: Add commadline option for RT_GROUP_SCHED toggling sched: Always initialize rt_rq's task_group sched: Remove unneeed macro wrap ...
2025-05-26Merge branch 'pm-cpufreq'Rafael J. Wysocki
Merge cpufreq updates for 6.16-rc1: - Refactor cpufreq_online(), add and use cpufreq policy locking guards, use __free() in policy reference counting, and clean up core cpufreq code on top of that (Rafael Wysocki). - Fix boost handling on CPU suspend/resume and sysfs updates (Viresh Kumar). - Fix des_perf clamping with max_perf in amd_pstate_update() (Dhananjay Ugwekar). - Add offline, online and suspend callbacks to the amd-pstate driver, rename and use the existing amd_pstate_epp callbacks in it (Dhananjay Ugwekar). - Add support for the "Requested CPU Min frequency" BIOS option to the amd-pstate driver (Dhananjay Ugwekar). - Reset amd-pstate driver mode after running selftests (Swapnil Sapkal). - Add helper for governor checks to the schedutil cpufreq governor and move cpufreq-specific EAS checks to cpufreq (Rafael Wysocki). - Populate the cpu_capacity sysfs entries from the intel_pstate driver after registering asym capacity support (Ricardo Neri). - Add support for enabling Energy-aware scheduling (EAS) to the intel_pstate driver when operating in the passive mode on a hybrid platform (Rafael Wysocki). - Avoid shadowing ret in amd_pstate_ut_check_driver() (Nathan Chancellor). - Drop redundant cpus_read_lock() from store_local_boost() in the cpufreq core (Seyediman Seyedarab). - Replace sscanf() with kstrtouint() in the cpufreq code and use a symbol instead of a raw number in it (Bowen Yu). - Add support for autonomous CPU performance state selection to the CPPC cpufreq driver (Lifeng Zheng). * pm-cpufreq: (31 commits) cpufreq: CPPC: Add support for autonomous selection cpufreq: Update sscanf() to kstrtouint() cpufreq: Replace magic number cpufreq: drop redundant cpus_read_lock() from store_local_boost() cpufreq/amd-pstate: Avoid shadowing ret in amd_pstate_ut_check_driver() cpufreq: intel_pstate: Document hybrid processor support cpufreq: intel_pstate: EAS: Increase cost for CPUs using L3 cache cpufreq: intel_pstate: EAS support for hybrid platforms cpufreq: Drop policy locking from cpufreq_policy_is_good_for_eas() cpufreq: intel_pstate: Populate the cpu_capacity sysfs entries arch_topology: Relocate cpu_scale to topology.[h|c] cpufreq/sched: Move cpufreq-specific EAS checks to cpufreq cpufreq/sched: schedutil: Add helper for governor checks amd-pstate-ut: Reset amd-pstate driver mode after running selftests cpufreq/amd-pstate: Add support for the "Requested CPU Min frequency" BIOS option cpufreq/amd-pstate: Add offline, online and suspend callbacks for amd_pstate_driver cpufreq: Force sync policy boost with global boost on sysfs update cpufreq: Preserve policy's boost state after resume cpufreq: Introduce policy_set_boost() cpufreq: Don't unnecessarily call set_boost() ...
2025-05-22sched_ext: Call ops.update_idle() after updating builtin idle bitsTejun Heo
BPF schedulers that use both builtin CPU idle mechanism and ops.update_idle() may want to use the latter to create interlocking between ops.enqueue() and CPU idle transitions so that either ops.enqueue() sees the idle bit or ops.update_idle() sees the task queued somewhere. This can prevent race conditions where CPUs go idle while tasks are waiting in DSQs. For such interlocking to work, ops.update_idle() must be called after builtin CPU masks are updated. Relocate the invocation. Currently, there are no ordering requirements on transitions from idle and this relocation isn't expected to make meaningful differences in that direction. This also makes the ops.update_idle() behavior semantically consistent: any action performed in this callback should be able to override the builtin idle state, not the other way around. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-and-tested-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-05-21sched_ext: idle: Consolidate default idle CPU selection kfuncsAndrea Righi
There is no reason to restrict scx_bpf_select_cpu_dfl() invocations to ops.select_cpu() while allowing scx_bpf_select_cpu_and() to be used from multiple contexts, as both provide equivalent functionality, with the latter simply accepting an additional "allowed" cpumask. Therefore, unify the two APIs, enabling both kfuncs to be used from ops.select_cpu(), ops.enqueue(), and unlocked contexts (e.g., via BPF test_run). This allows schedulers to implement a consistent idle CPU selection policy and helps reduce code duplication. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-21sched/uclamp: Align uclamp and util_est and call before freq updateXuewen Yan
The commit dfa0a574cbc47 ("sched/uclamg: Handle delayed dequeue") has add the sched_delayed check to prevent double uclamp_dec/inc. However, it put the uclamp_rq_inc() after enqueue_task(). This may lead to the following issues: When a task with uclamp goes through enqueue_task() and could trigger cpufreq update, its uclamp won't even be considered in the cpufreq update. It is only after enqueue will the uclamp be added to rq buckets, and cpufreq will only pick it up at the next update. This could cause a delay in frequency updating. It may affect the performance(uclamp_min > 0) or power(uclamp_max < 1024). So, just like util_est, put the uclamp_rq_inc() before enqueue_task(). And as for the sched_delayed_task, same as util_est, using the sched_delayed flag to prevent inc the sched_delayed_task's uclamp, using the ENQUEUE_DELAYED flag to allow inc the sched_delayed_task's uclamp which is being woken up. Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20250417043457.10632-3-xuewen.yan@unisoc.com
2025-05-21sched/util_est: Simplify condition for util_est_{en,de}queue()Xuewen Yan
To prevent double enqueue/dequeue of the util-est for sched_delayed tasks, commit 729288bc6856 ("kernel/sched: Fix util_est accounting for DELAY_DEQUEUE") added the corresponding check. This check excludes double en/dequeue during task migration and priority changes. In fact, these conditions can be simplified. For util_est_dequeue, we know that sched_delayed flag is set in dequeue_entity. When the task is sleeping, we need to call util_est_dequeue to subtract util-est from the cfs_rq. At this point, sched_delayed has not yet been set. If we find that sched_delayed is already set, it indicates that this task has already called dequeue_task_fair once. In this case, there is no need to call util_est_dequeue again. Therefore, simply checking the sched_delayed flag should be sufficient to prevent unnecessary util_est updates during the dequeue. For util_est_enqueue, our goal is to add the util_est to the cfs_rq when task enqueue. However, we don't want to add the util_est of a sched_delayed task to the cfs_rq because the task is sleeping. Therefore, we can exclude the util_est_enqueue for sched_delayed tasks by checking the sched_delayed flag. However, when waking up a delayed task, the sched_delayed flag is cleared after util_est_enqueue. As a result, if we only check the sched_delayed flag, we would miss the util_est_enqueue. Since waking up a sched_delayed task calls enqueue_task with the ENQUEUE_DELAYED flag, we can determine whether to call util_est_enqueue by checking if the enqueue_flag contains ENQUEUE_DELAYED. Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20250417043457.10632-2-xuewen.yan@unisoc.com
2025-05-21sched/fair: Fixup wake_up_sync() vs DELAYED_DEQUEUEXuewen Yan
Delayed dequeued feature keeps a sleeping task enqueued until its lag has elapsed. As a result, it stays also visible in rq->nr_running. So when in wake_affine_idle(), we should use the real running-tasks in rq to check whether we should place the wake-up task to current cpu. On the other hand, add a helper function to return the nr-delayed. Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Reviewed-and-tested-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250303105241.17251-2-xuewen.yan@unisoc.com
2025-05-20sched_ext: idle: Allow scx_bpf_select_cpu_and() from unlocked contextAndrea Righi
Allow scx_bpf_select_cpu_and() to be used from an unlocked context, in addition to ops.enqueue() or ops.select_cpu(). This enables schedulers, including user-space ones, to implement a consistent idle CPU selection policy and helps reduce code duplication. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-20sched_ext: idle: Validate locking correctness in scx_bpf_select_cpu_and()Andrea Righi
Validate locking correctness when accessing p->nr_cpus_allowed and p->cpus_ptr inside scx_bpf_select_cpu_and(): if the rq lock is held, access is safe; otherwise, require that p->pi_lock is held. This allows to catch potential unsafe calls to scx_bpf_select_cpu_and(). Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-20sched_ext: Make scx_kf_allowed_if_unlocked() available outside ext.cAndrea Righi
Relocate the scx_kf_allowed_if_unlocked(), so it can be used from other source files (e.g., ext_idle.c). No functional change. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-14sched_ext: Explain the temporary situation around scx_root dereferencesTejun Heo
Naked scx_root dereferences are being used as temporary markers to indicate that they need to be updated to point to the right scheduler instance. Explain the situation. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-05-14sched_ext: Add @sch to SCX_CALL_OP*()Tejun Heo
In preparation of hierarchical scheduling support, add @sch to scx_exit() and friends: - scx_exit/error() updated to take explicit @sch instead of assuming scx_root. - scx_kf_exit/error() added. These are to be used from kfuncs, don't take @sch and internally determine the scx_sched instance to abort. Currently, it's always scx_root but once multiple scheduler support is in place, it will be the scx_sched instance that invoked the kfunc. This simplifies many callsites and defers scx_sched lookup until error is triggered. - @sch is propagated to ops_cpu_valid() and ops_sanitize_err(). The CPU validity conditions in ops_cpu_valid() are factored into __cpu_valid() to implement kf_cpu_valid() which is the counterpart to scx_kf_exit/error(). - All users are converted. Most conversions are straightforward. check_rq_for_timeouts() and scx_softlockup() are updated to use explicit rcu_dereference*(scx_root) for safety as they may execute asynchronous to the exit path. scx_tick() is also updated to use rcu_dereference(). While not strictly necessary due to the preceding scx_enabled() test and IRQ disabled context, this removes the subtlety at no noticeable cost. No behavior changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-05-14sched_ext: Cleanup [__]scx_exit/error*()Tejun Heo
__scx_exit() is the base exit implementation and there are three wrappers on top of it - scx_exit(), __scx_error() and scx_error(). This is more confusing than helpful especially given that there are only a couple users of scx_exit() and __scx_error(). To simplify the situation: - Make __scx_exit() take va_list and rename it to scx_vexit(). This is to ease implementing more complex extensions on top. - Make scx_exit() a varargs wrapper around __scx_exit(). scx_exit() now takes both @kind and @exit_code. - Convert existing scx_exit() and __scx_error() users to use the new scx_exit(). - scx_error() remains unchanged. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-05-14sched_ext: Add @sch to SCX_CALL_OP*()Tejun Heo
In preparation of hierarchical scheduling support, make SCX_CALL_OP*() take explicit @sch instead of assuming scx_root. As scx_root is still the only scheduler instance, this patch doesn't make any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-05-14sched_ext: Clean up scx_root usagesTejun Heo
- Always cache scx_root into local variable sch before using. - Don't use scx_root if cached sch is available. - Wrap !sch test with unlikely(). - Pass @scx into scx_cgroup_init/exit(). No behavior changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-05-14sched,livepatch: Untangle cond_resched() and live-patchingPeter Zijlstra
With the goal of deprecating / removing VOLUNTARY preempt, live-patch needs to stop relying on cond_resched() to make forward progress. Instead, rely on schedule() with TASK_FREEZABLE set. Just like live-patching, the freezer needs to be able to stop tasks in a safe / known state. [bigeasy: use likely() in __klp_sched_try_switch() and update comments] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Petr Mladek <pmladek@suse.com> Tested-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/r/20250509113659.wkP_HJ5z@linutronix.de
2025-05-13Merge Energy Model management code changes for 6.16Rafael J. Wysocki
2025-05-12sched/numa: add tracepoint that tracks the skipping of numa balancing due to ↵Libo Chen
cpuset memory pinning Unlike sched_skip_vma_numa tracepoint which tracks skipped VMAs, this tracks the task subjected to cpuset.mems pinning and prints out its allowed memory node mask. Link: https://lkml.kernel.org/r/20250424024523.2298272-3-libo.chen@oracle.com Signed-off-by: Libo Chen <libo.chen@oracle.com> Cc: "Chen, Tim C" <tim.c.chen@intel.com> Cc: Chen Yu <yu.c.chen@intel.com> Cc: Chris Hyser <chris.hyser@oracle.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Koutný <mkoutny@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@amd.com> Cc: Srikanth Aithal <sraithal@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12sched/numa: skip VMA scanning on memory pinned to one NUMA node via cpuset.memsLibo Chen
Patch series "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems", v5. This patch (of 2): When the memory of the current task is pinned to one NUMA node by cgroup, there is no point in continuing the rest of VMA scanning and hinting page faults as they will just be overhead. With this change, there will be no more unnecessary PTE updates or page faults in this scenario. We have seen up to a 6x improvement on a typical java workload running on VMs with memory and CPU pinned to one NUMA node via cpuset in a two-socket AARCH64 system. With the same pinning, on a 18-cores-per-socket Intel platform, we have seen 20% improvment in a microbench that creates a 30-vCPU selftest KVM guest with 4GB memory, where each vCPU reads 4KB pages in a fixed number of loops. Link: https://lkml.kernel.org/r/20250424024523.2298272-1-libo.chen@oracle.com Link: https://lkml.kernel.org/r/20250424024523.2298272-2-libo.chen@oracle.com Signed-off-by: Libo Chen <libo.chen@oracle.com> Tested-by: Chen Yu <yu.c.chen@intel.com> Tested-by: Srikanth Aithal <sraithal@amd.com> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Cc: "Chen, Tim C" <tim.c.chen@intel.com> Cc: Chris Hyser <chris.hyser@oracle.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Koutný <mkoutny@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12Merge tag 'sched_ext-for-6.15-rc6-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: "A little bit invasive for rc6 but they're important fixes, pass tests fine and won't break anything outside sched_ext: - scx_bpf_cpuperf_set() calls internal functions that require the rq to be locked. It assumed that the BPF caller has rq locked but that's not always true. Fix it by tracking whether rq is currently held by the CPU and grabbing it if necessary - bpf_iter_scx_dsq_new() was leaving the DSQ iterator in an uninitialized state after an error. However, next() and destroy() can be called on an iterator which failed initialization and thus they always need to be initialized even after an init error. Fix by always initializing the iterator - Remove duplicate BTF_ID_FLAGS() entries" * tag 'sched_ext-for-6.15-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: bpf_iter_scx_dsq_new() should always initialize iterator sched_ext: Fix rq lock state in hotplug ops sched_ext: Remove duplicate BTF_ID_FLAGS definitions sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set() sched_ext: Track currently locked rq
2025-05-09sched_ext: Remove bpf_scx_get_func_protoFeng Yang
task_storage_{get,delete} has been moved to bpf_base_func_proto. Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Signed-off-by: Feng Yang <yangfeng@kylinos.cn> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/bpf/20250506061434.94277-3-yangfeng59949@163.com
2025-05-07cpufreq/sched: Move cpufreq-specific EAS checks to cpufreqRafael J. Wysocki
Doing cpufreq-specific EAS checks that require accessing policy internals directly from sched_is_eas_possible() is a bit unfortunate, so introduce cpufreq_ready_for_eas() in cpufreq, move those checks into that new function and make sched_is_eas_possible() call it. While at it, address a possible race between the EAS governor check and governor change by doing the former under the policy rwsem. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://patch.msgid.link/2317800.iZASKD2KPV@rjwysocki.net
2025-05-07cpufreq/sched: schedutil: Add helper for governor checksRafael J. Wysocki
Add a helper for checking if schedutil is the current governor for a given cpufreq policy and use it in sched_is_eas_possible() to avoid accessing cpufreq policy internals directly from there. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://patch.msgid.link/3365956.44csPzL39Z@rjwysocki.net
2025-05-07Merge branch 'for-6.15-fixes' into for-6.16Tejun Heo
To receive 428dc9fc0873 ("sched_ext: bpf_iter_scx_dsq_new() should always initialize iterator") which conflicts with cdf5a6faa8cf ("sched_ext: Move dsq_hash into scx_sched"). The conflict is a simple context conflict which can be resolved by taking changes from both changes in the right order.
2025-05-07sched_ext: bpf_iter_scx_dsq_new() should always initialize iteratorTejun Heo
BPF programs may call next() and destroy() on BPF iterators even after new() returns an error value (e.g. bpf_for_each() macro ignores error returns from new()). bpf_iter_scx_dsq_new() could leave the iterator in an uninitialized state after an error return causing bpf_iter_scx_dsq_next() to dereference garbage data. Make bpf_iter_scx_dsq_new() always clear $kit->dsq so that next() and destroy() become noops. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 650ba21b131e ("sched_ext: Implement DSQ iterator") Cc: stable@vger.kernel.org # v6.12+ Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-30sched_ext: Avoid NULL scx_root deref in __scx_exit()Andrea Righi
A sched_ext scheduler may trigger __scx_exit() from a BPF timer callback, where scx_root may not be safely dereferenced. This can lead to a NULL pointer dereference as shown below (triggered by scx_tickless): BUG: kernel NULL pointer dereference, address: 0000000000000330 ... CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.14.0-virtme #1 PREEMPT(full) RIP: 0010:__scx_exit+0x2b/0x190 ... Call Trace: <IRQ> scx_bpf_get_idle_smtmask+0x59/0x80 bpf_prog_8320d4217989178c_dispatch_all_cpus+0x35/0x1b6 ... bpf_prog_97f847d871513f95_sched_timerfn+0x4c/0x264 bpf_timer_cb+0x7a/0x140 __hrtimer_run_queues+0x1f9/0x3a0 hrtimer_run_softirq+0x8c/0xd0 handle_softirqs+0xd3/0x3d0 __irq_exit_rcu+0x9a/0xc0 irq_exit_rcu+0xe/0x20 Fix this by checking for a valid scx_root and adding proper RCU protection. Fixes: 48e1267773866 ("sched_ext: Introduce scx_sched") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-30sched_ext: Add RCU protection to scx_root in DSQ iteratorAndrea Righi
Using a DSQ iterators from a timer callback can trigger the following lockdep splat when accessing scx_root: ============================= WARNING: suspicious RCU usage 6.14.0-virtme #1 Not tainted ----------------------------- kernel/sched/ext.c:6907 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 no locks held by swapper/0/0. stack backtrace: CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.14.0-virtme #1 PREEMPT(full) Sched_ext: tickless (enabled+all) Call Trace: <IRQ> dump_stack_lvl+0x6f/0xb0 lockdep_rcu_suspicious.cold+0x4e/0xa3 bpf_iter_scx_dsq_new+0xb1/0xd0 bpf_prog_63f4fd1bccc101e7_dispatch_cpu+0x3e/0x156 bpf_prog_8320d4217989178c_dispatch_all_cpus+0x153/0x1b6 bpf_prog_97f847d871513f95_sched_timerfn+0x4c/0x264 ? hrtimer_run_softirq+0x4f/0xd0 bpf_timer_cb+0x7a/0x140 __hrtimer_run_queues+0x1f9/0x3a0 hrtimer_run_softirq+0x8c/0xd0 handle_softirqs+0xd3/0x3d0 __irq_exit_rcu+0x9a/0xc0 irq_exit_rcu+0xe/0x20 sysvec_apic_timer_interrupt+0x73/0x80 Add a proper dereference check to explicitly validate RCU-safe access to scx_root from rcu_read_lock() contexts and also from contexts that hold rcu_read_lock_bh(), such as timer callbacks. Fixes: cdf5a6faa8cf0 ("sched_ext: Move dsq_hash into scx_sched") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-30sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasksJohn Stultz
It was reported that in 6.12, smpboot_create_threads() was taking much longer then in 6.6. I narrowed down the call path to: smpboot_create_threads() -> kthread_create_on_cpu() -> kthread_bind() -> __kthread_bind_mask() ->wait_task_inactive() Where in wait_task_inactive() we were regularly hitting the queued case, which sets a 1 tick timeout, which when called multiple times in a row, accumulates quickly into a long delay. I noticed disabling the DELAY_DEQUEUE sched feature recovered the performance, and it seems the newly create tasks are usually sched_delayed and left on the runqueue. So in wait_task_inactive() when we see the task p->se.sched_delayed, manually dequeue the sched_delayed task with DEQUEUE_DELAYED, so we don't have to constantly wait a tick. Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Reported-by: peter-yc.chang@mediatek.com Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lkml.kernel.org/r/20250429150736.3778580-1-jstultz@google.com
2025-04-29sched_ext: Clean up SCX_EXIT_NONE handling in scx_disable_workfn()Tejun Heo
With the global states and disable machinery moved into scx_sched, scx_disable_workfn() can only be scheduled and run for the specific scheduler instance. This makes it impossible for scx_disable_workfn() to see SCX_EXIT_NONE. Turn that condition into WARN_ON_ONCE(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Move disable machinery into scx_schedTejun Heo
Because disable can be triggered from any place and the scheduler cannot be trusted, disable path uses an irq_work to bounce and a kthread_work which is executed on an RT helper kthread to perform disable. These must be per scheduler instance to guarantee forward progress. Move them into scx_sched. - If an scx_sched is accessible, its helper kthread is always valid making the `helper` check in schedule_scx_disable_work() unnecessary. As the function becomes trivial after the removal of the test, inline it. - scx_create_rt_helper() has only one user - creation of the disable helper kthread. Inline it into scx_alloc_and_add_sched(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Move event_stats_cpu into scx_schedTejun Heo
The event counters are going to become per scheduler instance. Move event_stats_cpu into scx_sched. - [__]scx_add_event() are updated to take @sch. While at it, add missing parentheses around @cnt expansions. - scx_read_events() is updated to take @sch. - scx_bpf_events() accesses scx_root under RCU read lock. v2: - Replace stale scx_bpf_get_event_stat() reference in a comment with scx_bpf_events(). - Trivial goto label rename. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Factor out scx_read_events()Tejun Heo
In prepration of moving event_stats_cpu into scx_sched, factor out scx_read_events() out of scx_bpf_events() and update the in-kernel users - scx_attr_events_show() and scx_dump_state() - to use scx_read_events() instead of scx_bpf_events(). No observable behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Relocate scx_event_stats definitionTejun Heo
In prepration of moving event_stats_cpu into scx_sched, move scx_event_stats definitions above scx_sched definition. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Move global_dsqs into scx_schedTejun Heo
Global DSQs are going to become per scheduler instance. Move global_dsqs into scx_sched. find_global_dsq() already takes a task_struct pointer as an argument and should later be able to determine the scx_sched to use from that. For now, assume scx_root. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Move dsq_hash into scx_schedTejun Heo
User DSQs are going to become per scheduler instance. Move dsq_hash into scx_sched. This shifts the code that assumes scx_root to be the only scx_sched instance up the call stack but doesn't remove them yet. v2: Add missing rcu_read_lock() in scx_bpf_destroy_dsq() as per Andrea. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Factor out scx_alloc_and_add_sched()Tejun Heo
More will be moved into scx_sched. Factor out the allocation and kobject addition path into scx_alloc_and_add_sched(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Inline create_dsq() into scx_bpf_create_dsq()Tejun Heo
create_dsq() is only used by scx_bpf_create_dsq() and the separation gets in the way of making dsq_hash per scx_sched. Inline it into scx_bpf_create_dsq(). While at it, add unlikely() around SCX_DSQ_FLAG_BUILTIN test. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Use dynamic allocation for scx_schedTejun Heo
To prepare for supporting multiple schedulers, make scx_sched allocated dynamically. scx_sched->kobj is now an embedded field and the kobj's lifetime determines the lifetime of the containing scx_sched. - Enable path is updated so that kobj init and addition are performed later. - scx_sched freeing is initiated in scx_kobj_release() and also goes through an rcu_work so that scx_root can be accessed from an unsynchronized path - scx_disable(). - sched_ext_ops->priv is added and used to point to scx_sched instance created for the ops instance. This is used by bpf_scx_unreg() to determine the scx_sched instance to disable and put. No behavior changes intended. v2: Andrea reported kernel oops due to scx_bpf_unreg() trying to deref NULL scx_root after scheduler init failure. sched_ext_ops->priv added so that scx_bpf_unreg() can always find the scx_sched instance to unregister even if it failed early during init. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Avoid NULL scx_root deref through SCX_HAS_OP()Tejun Heo
SCX_HAS_OP() tests scx_root->has_op bitmap. The bitmap is currently in a statically allocated struct scx_sched and initialized while loading the BPF scheduler and cleared while unloading, and thus can be tested anytime. However, scx_root will be switched to dynamic allocation and thus won't always be deferenceable. Most usages of SCX_HAS_OP() are already protected by scx_enabled() either directly or indirectly (e.g. through a task which is on SCX). However, there are a couple places that could try to dereference NULL scx_root. Update them so that scx_root is guaranteed to be valid before SCX_HAS_OP() is called. - In handle_hotplug(), test whether scx_root is NULL before doing anything else. This is safe because scx_root updates will be protected by cpus_read_lock(). - In scx_tg_offline(), test scx_cgroup_enabled before invoking SCX_HAS_OP(), which should guarnatee that scx_root won't turn NULL. This is also in line with other cgroup operations. As the code path is synchronized against scx_cgroup_init/exit() through scx_cgroup_rwsem, this shouldn't cause any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Introduce scx_schedTejun Heo
To support multiple scheduler instances, collect some of the global variables that should be specific to a scheduler instance into the new struct scx_sched. scx_root is the root scheduler instance and points to a static instance of struct scx_sched. Except for an extra dereference through the scx_root pointer, this patch makes no functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29Merge branch 'for-6.15-fixes' into for-6.16Tejun Heo
To receive e38be1c7647c ("sched_ext: Fix rq lock state in hotplug ops") to avoid conflicts with scx_sched related patches pending for for-6.16.
2025-04-29sched_ext: Fix rq lock state in hotplug opsAndrea Righi
The ops.cpu_online() and ops.cpu_offline() callbacks incorrectly assume that the rq involved in the operation is locked, which is not the case during hotplug, triggering the following warning: WARNING: CPU: 1 PID: 20 at kernel/sched/sched.h:1504 handle_hotplug+0x280/0x340 Fix by not tracking the target rq as locked in the context of ops.cpu_online() and ops.cpu_offline(). Fixes: 18853ba782bef ("sched_ext: Track currently locked rq") Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrea Righi <arighi@nvidia.com> Tested-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-26Merge tag 'sched-urgent-2025-04-26' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix sporadic crashes in dequeue_entities() due to ... bad math. [ Arguably if pick_eevdf()/pick_next_entity() was less trusting of complex math being correct it could have de-escalated a crash into a warning, but that's for a different patch ]" * tag 'sched-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/eevdf: Fix se->slice being set to U64_MAX and resulting crash
2025-04-26sched/eevdf: Fix se->slice being set to U64_MAX and resulting crashOmar Sandoval
There is a code path in dequeue_entities() that can set the slice of a sched_entity to U64_MAX, which sometimes results in a crash. The offending case is when dequeue_entities() is called to dequeue a delayed group entity, and then the entity's parent's dequeue is delayed. In that case: 1. In the if (entity_is_task(se)) else block at the beginning of dequeue_entities(), slice is set to cfs_rq_min_slice(group_cfs_rq(se)). If the entity was delayed, then it has no queued tasks, so cfs_rq_min_slice() returns U64_MAX. 2. The first for_each_sched_entity() loop dequeues the entity. 3. If the entity was its parent's only child, then the next iteration tries to dequeue the parent. 4. If the parent's dequeue needs to be delayed, then it breaks from the first for_each_sched_entity() loop _without updating slice_. 5. The second for_each_sched_entity() loop sets the parent's ->slice to the saved slice, which is still U64_MAX. This throws off subsequent calculations with potentially catastrophic results. A manifestation we saw in production was: 6. In update_entity_lag(), se->slice is used to calculate limit, which ends up as a huge negative number. 7. limit is used in se->vlag = clamp(vlag, -limit, limit). Because limit is negative, vlag > limit, so se->vlag is set to the same huge negative number. 8. In place_entity(), se->vlag is scaled, which overflows and results in another huge (positive or negative) number. 9. The adjusted lag is subtracted from se->vruntime, which increases or decreases se->vruntime by a huge number. 10. pick_eevdf() calls entity_eligible()/vruntime_eligible(), which incorrectly returns false because the vruntime is so far from the other vruntimes on the queue, causing the (vruntime - cfs_rq->min_vruntime) * load calulation to overflow. 11. Nothing appears to be eligible, so pick_eevdf() returns NULL. 12. pick_next_entity() tries to dereference the return value of pick_eevdf() and crashes. Dumping the cfs_rq states from the core dumps with drgn showed tell-tale huge vruntime ranges and bogus vlag values, and I also traced se->slice being set to U64_MAX on live systems (which was usually "benign" since the rest of the runqueue needed to be in a particular state to crash). Fix it in dequeue_entities() by always setting slice from the first non-empty cfs_rq. Fixes: aef6987d8954 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy") Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/f0c2d1072be229e1bdddc73c0703919a8b00c652.1745570998.git.osandov@fb.com
2025-04-25sched_ext: Remove duplicate BTF_ID_FLAGS definitionsAndrea Righi
Some kfuncs specific to the idle CPU selection policy are registered in both the scx_kfunc_ids_any and scx_kfunc_ids_idle blocks, even though they should only be defined in the latter. Remove the duplicates from scx_kfunc_ids_any. Fixes: 337d1b354a297 ("sched_ext: Move built-in idle CPU selection policy to a separate file") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-23sched_ext: Clarify CPU context for running/stopping callbacksAndrea Righi
The ops.running() and ops.stopping() callbacks can be invoked from a CPU other than the one the task is assigned to, particularly when a task property is changed, as both scx_next_task_scx() and dequeue_task_scx() may run on CPUs different from the task's target CPU. This behavior can lead to confusion or incorrect assumptions if not properly clarified, potentially resulting in bugs (see [1]). Therefore, update the documentation to clarify this aspect and advise users to use scx_bpf_task_cpu() to determine the actual CPU the task will run on or was running on. [1] https://github.com/sched-ext/scx/pull/1728 Cc: Jake Hillion <jake@hillion.co.uk> Signed-off-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>