summaryrefslogtreecommitdiff
path: root/kernel/sched/sched.h
AgeCommit message (Collapse)Author
2022-06-13sched: Fix balance_push() vs __sched_setscheduler()Peter Zijlstra
The purpose of balance_push() is to act as a filter on task selection in the case of CPU hotplug, specifically when taking the CPU out. It does this by (ab)using the balance callback infrastructure, with the express purpose of keeping all the unlikely/odd cases in a single place. In order to serve its purpose, the balance_push_callback needs to be (exclusively) on the callback list at all times (noting that the callback always places itself back on the list the moment it runs, also noting that when the CPU goes down, regular balancing concerns are moot, so ignoring them is fine). And here-in lies the problem, __sched_setscheduler()'s use of splice_balance_callbacks() takes the callbacks off the list across a lock-break, making it possible for, an interleaving, __schedule() to see an empty list and not get filtered. Fixes: ae7927023243 ("sched: Optimize finish_lock_switch()") Reported-by: Jing-Ting Wu <jing-ting.wu@mediatek.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Jing-Ting Wu <jing-ting.wu@mediatek.com> Link: https://lkml.kernel.org/r/20220519134706.GH2578@worktop.programming.kicks-ass.net
2022-05-26Merge tag 'sysctl-5.19-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull sysctl updates from Luis Chamberlain: "For two kernel releases now kernel/sysctl.c has been being cleaned up slowly, since the tables were grossly long, sprinkled with tons of #ifdefs and all this caused merge conflicts with one susbystem or another. This tree was put together to help try to avoid conflicts with these cleanups going on different trees at time. So nothing exciting on this pull request, just cleanups. Thanks a lot to the Uniontech and Huawei folks for doing some of this nasty work" * tag 'sysctl-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (28 commits) sched: Fix build warning without CONFIG_SYSCTL reboot: Fix build warning without CONFIG_SYSCTL kernel/kexec_core: move kexec_core sysctls into its own file sysctl: minor cleanup in new_dir() ftrace: fix building with SYSCTL=y but DYNAMIC_FTRACE=n fs/proc: Introduce list_for_each_table_entry for proc sysctl mm: fix unused variable kernel warning when SYSCTL=n latencytop: move sysctl to its own file ftrace: fix building with SYSCTL=n but DYNAMIC_FTRACE=y ftrace: Fix build warning ftrace: move sysctl_ftrace_enabled to ftrace.c kernel/do_mount_initrd: move real_root_dev sysctls to its own file kernel/delayacct: move delayacct sysctls to its own file kernel/acct: move acct sysctls to its own file kernel/panic: move panic sysctls to its own file kernel/lockdep: move lockdep sysctls to its own file mm: move page-writeback sysctls to their own file mm: move oom_kill sysctls to their own file kernel/reboot: move reboot sysctls to its own file sched: Move energy_aware sysctls to topology.c ...
2022-05-19sched: Reverse sched_class layoutPeter Zijlstra
Because GCC-12 is fully stupid about array bounds and it's just really hard to get a solid array definition from a linker script, flip the array order to avoid needing negative offsets :-/ This makes the whole relational pointer magic a little less obvious, but alas. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/YoOLLmLG7HRTXeEm@hirez.programming.kicks-ass.net
2022-05-11sched/core: Avoid obvious double update_rq_clock warningHao Jia
When we use raw_spin_rq_lock() to acquire the rq lock and have to update the rq clock while holding the lock, the kernel may issue a WARN_DOUBLE_CLOCK warning. Since we directly use raw_spin_rq_lock() to acquire rq lock instead of rq_lock(), there is no corresponding change to rq->clock_update_flags. In particular, we have obtained the rq lock of other CPUs, the rq->clock_update_flags of this CPU may be RQCF_UPDATED at this time, and then calling update_rq_clock() will trigger the WARN_DOUBLE_CLOCK warning. So we need to clear RQCF_UPDATED of rq->clock_update_flags to avoid the WARN_DOUBLE_CLOCK warning. For the sched_rt_period_timer() and migrate_task_rq_dl() cases we simply replace raw_spin_rq_lock()/raw_spin_rq_unlock() with rq_lock()/rq_unlock(). For the {pull,push}_{rt,dl}_task() cases, we add the double_rq_clock_clear_update() function to clear RQCF_UPDATED of rq->clock_update_flags, and call double_rq_clock_clear_update() before double_lock_balance()/double_rq_lock() returns to avoid the WARN_DOUBLE_CLOCK warning. Some call trace reports: Call Trace 1: <IRQ> sched_rt_period_timer+0x10f/0x3a0 ? enqueue_top_rt_rq+0x110/0x110 __hrtimer_run_queues+0x1a9/0x490 hrtimer_interrupt+0x10b/0x240 __sysvec_apic_timer_interrupt+0x8a/0x250 sysvec_apic_timer_interrupt+0x9a/0xd0 </IRQ> <TASK> asm_sysvec_apic_timer_interrupt+0x12/0x20 Call Trace 2: <TASK> activate_task+0x8b/0x110 push_rt_task.part.108+0x241/0x2c0 push_rt_tasks+0x15/0x30 finish_task_switch+0xaa/0x2e0 ? __switch_to+0x134/0x420 __schedule+0x343/0x8e0 ? hrtimer_start_range_ns+0x101/0x340 schedule+0x4e/0xb0 do_nanosleep+0x8e/0x160 hrtimer_nanosleep+0x89/0x120 ? hrtimer_init_sleeper+0x90/0x90 __x64_sys_nanosleep+0x96/0xd0 do_syscall_64+0x34/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Call Trace 3: <TASK> deactivate_task+0x93/0xe0 pull_rt_task+0x33e/0x400 balance_rt+0x7e/0x90 __schedule+0x62f/0x8e0 do_task_dead+0x3f/0x50 do_exit+0x7b8/0xbb0 do_group_exit+0x2d/0x90 get_signal+0x9df/0x9e0 ? preempt_count_add+0x56/0xa0 ? __remove_hrtimer+0x35/0x70 arch_do_signal_or_restart+0x36/0x720 ? nanosleep_copyout+0x39/0x50 ? do_nanosleep+0x131/0x160 ? audit_filter_inodes+0xf5/0x120 exit_to_user_mode_prepare+0x10f/0x1e0 syscall_exit_to_user_mode+0x17/0x30 do_syscall_64+0x40/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Call Trace 4: update_rq_clock+0x128/0x1a0 migrate_task_rq_dl+0xec/0x310 set_task_cpu+0x84/0x1e4 try_to_wake_up+0x1d8/0x5c0 wake_up_process+0x1c/0x30 hrtimer_wakeup+0x24/0x3c __hrtimer_run_queues+0x114/0x270 hrtimer_interrupt+0xe8/0x244 arch_timer_handler_phys+0x30/0x50 handle_percpu_devid_irq+0x88/0x140 generic_handle_domain_irq+0x40/0x60 gic_handle_irq+0x48/0xe0 call_on_irq_stack+0x2c/0x60 do_interrupt_handler+0x80/0x84 Steps to reproduce: 1. Enable CONFIG_SCHED_DEBUG when compiling the kernel 2. echo 1 > /sys/kernel/debug/clear_warn_once echo "WARN_DOUBLE_CLOCK" > /sys/kernel/debug/sched/features echo "NO_RT_PUSH_IPI" > /sys/kernel/debug/sched/features 3. Run some rt/dl tasks that periodically work and sleep, e.g. Create 2*n rt or dl (90% running) tasks via rt-app (on a system with n CPUs), and Dietmar Eggemann reports Call Trace 4 when running on PREEMPT_RT kernel. Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20220430085843.62939-2-jiahao.os@bytedance.com
2022-05-06Merge tag 'v5.18-rc5' into sched/core to pull in fixes & to resolve a conflictIngo Molnar
- sched/core is on a pretty old -rc1 base - refresh it to include recent fixes. - this also allows up to resolve a (trivial) .mailmap conflict Conflicts: .mailmap Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-05-01sched: Fix missing prototype warningsThomas Gleixner
A W=1 build emits more than a dozen missing prototype warnings related to scheduler and scheduler specific includes. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220413133024.249118058@linutronix.de
2022-04-22sched/fair: Fix cfs_rq_clock_pelt() for throttled cfs_rqChengming Zhou
Since commit 23127296889f ("sched/fair: Update scale invariance of PELT") change to use rq_clock_pelt() instead of rq_clock_task(), we should also use rq_clock_pelt() for throttled_clock_task_time and throttled_clock_task accounting to get correct cfs_rq_clock_pelt() of throttled cfs_rq. And rename throttled_clock_task(_time) to be clock_pelt rather than clock_task. Fixes: 23127296889f ("sched/fair: Update scale invariance of PELT") Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220408115309.81603-1-zhouchengming@bytedance.com
2022-04-06sched: Move rr_timeslice sysctls to rt.cZhen Ni
move rr_timeslice sysctls to rt.c and use the new register_sysctl_init() to register the sysctl interface. Signed-off-by: Zhen Ni <nizhen@uniontech.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-04-06sched: Move rt_period/runtime sysctls to rt.cZhen Ni
move rt_period/runtime sysctls to rt.c and use the new register_sysctl_init() to register the sysctl interface. Signed-off-by: Zhen Ni <nizhen@uniontech.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-04-06sched: Move child_runs_first sysctls to fair.cZhen Ni
move child_runs_first sysctls to fair.c and use the new register_sysctl_init() to register the sysctl interface. Signed-off-by: Zhen Ni <nizhen@uniontech.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-04-05sched/core: Fix forceidle balancingPeter Zijlstra
Steve reported that ChromeOS encounters the forceidle balancer being ran from rt_mutex_setprio()'s balance_callback() invocation and explodes. Now, the forceidle balancer gets queued every time the idle task gets selected, set_next_task(), which is strictly too often. rt_mutex_setprio() also uses set_next_task() in the 'change' pattern: queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */ running = task_current(rq, p); /* rq->curr == p */ if (queued) dequeue_task(...); if (running) put_prev_task(...); /* change task properties */ if (queued) enqueue_task(...); if (running) set_next_task(...); However, rt_mutex_setprio() will explicitly not run this pattern on the idle task (since priority boosting the idle task is quite insane). Most other 'change' pattern users are pidhash based and would also not apply to idle. Also, the change pattern doesn't contain a __balance_callback() invocation and hence we could have an out-of-band balance-callback, which *should* trigger the WARN in rq_pin_lock() (which guards against this exact anti-pattern). So while none of that explains how this happens, it does indicate that having it in set_next_task() might not be the most robust option. Instead, explicitly queue the forceidle balancer from pick_next_task() when it does indeed result in forceidle selection. Having it here, ensures it can only be triggered under the __schedule() rq->lock instance, and hence must be ran from that context. This also happens to clean up the code a little, so win-win. Fixes: d2dfa17bc7de ("sched: Trivial forced-newidle balancer") Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: T.J. Alumbaugh <talumbau@chromium.org> Link: https://lkml.kernel.org/r/20220330160535.GN8939@worktop.programming.kicks-ass.net
2022-03-15Merge branch 'sched/fast-headers' into sched/coreIngo Molnar
Merge the scheduler build speedup of the fast-headers tree. Cumulative scheduler (kernel/sched/) build time speedup on a Linux distribution's config, which enables all scheduler features, compared to the vanilla kernel: _____________________________________________________________________________ | | Vanilla kernel (v5.13-rc7): |_____________________________________________________________________________ | | Performance counter stats for 'make -j96 kernel/sched/' (3 runs): | | 126,975,564,374 instructions # 1.45 insn per cycle ( +- 0.00% ) | 87,637,847,671 cycles # 3.959 GHz ( +- 0.30% ) | 22,136.96 msec cpu-clock # 7.499 CPUs utilized ( +- 0.29% ) | | 2.9520 +- 0.0169 seconds time elapsed ( +- 0.57% ) |_____________________________________________________________________________ | | Patched kernel: |_____________________________________________________________________________ | | Performance counter stats for 'make -j96 kernel/sched/' (3 runs): | | 50,420,496,914 instructions # 1.47 insn per cycle ( +- 0.00% ) | 34,234,322,038 cycles # 3.946 GHz ( +- 0.31% ) | 8,675.81 msec cpu-clock # 3.053 CPUs utilized ( +- 0.45% ) | | 2.8420 +- 0.0181 seconds time elapsed ( +- 0.64% ) |_____________________________________________________________________________ Summary: - CPU time used to build the scheduler dropped by -60.9%, a reduction from 22.1 clock-seconds to 8.7 clock-seconds. - Wall-clock time to build the scheduler dropped by -3.9%, a reduction from 2.95 seconds to 2.84 seconds. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-03-08sched/deadline: Merge dl_task_can_attach() and dl_cpu_busy()Dietmar Eggemann
Both functions are doing almost the same, that is checking if admission control is still respected. With exclusive cpusets, dl_task_can_attach() checks if the destination cpuset (i.e. its root domain) has enough CPU capacity to accommodate the task. dl_cpu_busy() checks if there is enough CPU capacity in the cpuset in case the CPU is hot-plugged out. dl_task_can_attach() is used to check if a task can be admitted while dl_cpu_busy() is used to check if a CPU can be hotplugged out. Make dl_cpu_busy() able to deal with a task and use it instead of dl_task_can_attach() in task_can_attach(). Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20220302183433.333029-4-dietmar.eggemann@arm.com
2022-03-08sched/deadline: Move bandwidth mgmt and reclaim functions into sched class ↵Dietmar Eggemann
source file Move the deadline bandwidth management (admission control) functions __dl_add(), __dl_sub() and __dl_overflow() as well as the bandwidth reclaim function __dl_update() from private task scheduler header file to the deadline sched class source file. The functions are only used internally so they don't have to be exported. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20220302183433.333029-3-dietmar.eggemann@arm.com
2022-03-08sched/deadline: Remove unused def_dl_bandwidthDietmar Eggemann
Since commit 1724813d9f2c ("sched/deadline: Remove the sysctl_sched_dl knobs") the default deadline bandwidth control structure has no purpose. Remove it. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20220302183433.333029-2-dietmar.eggemann@arm.com
2022-02-23sched/headers: Reorganize, clean up and optimize kernel/sched/sched.h ↵Ingo Molnar
dependencies Remove all headers, except the ones required to make this header build standalone. Also include stats.h in sched.h explicitly - dependencies already require this. Summary of the build speedup gained through the last ~15 scheduler build & header dependency patches: Cumulative scheduler (kernel/sched/) build time speedup on a Linux distribution's config, which enables all scheduler features, compared to the vanilla kernel: _____________________________________________________________________________ | | Vanilla kernel (v5.13-rc7): |_____________________________________________________________________________ | | Performance counter stats for 'make -j96 kernel/sched/' (3 runs): | | 126,975,564,374 instructions # 1.45 insn per cycle ( +- 0.00% ) | 87,637,847,671 cycles # 3.959 GHz ( +- 0.30% ) | 22,136.96 msec cpu-clock # 7.499 CPUs utilized ( +- 0.29% ) | | 2.9520 +- 0.0169 seconds time elapsed ( +- 0.57% ) |_____________________________________________________________________________ | | Patched kernel: |_____________________________________________________________________________ | | Performance counter stats for 'make -j96 kernel/sched/' (3 runs): | | 50,420,496,914 instructions # 1.47 insn per cycle ( +- 0.00% ) | 34,234,322,038 cycles # 3.946 GHz ( +- 0.31% ) | 8,675.81 msec cpu-clock # 3.053 CPUs utilized ( +- 0.45% ) | | 2.8420 +- 0.0181 seconds time elapsed ( +- 0.64% ) |_____________________________________________________________________________ Summary: - CPU time used to build the scheduler dropped by -60.9%, a reduction from 22.1 clock-seconds to 8.7 clock-seconds. - Wall-clock time to build the scheduler dropped by -3.9%, a reduction from 2.95 seconds to 2.84 seconds. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Peter Zijlstra <peterz@infradead.org>
2022-02-23sched/headers: Standardize kernel/sched/sched.h header dependenciesIngo Molnar
kernel/sched/sched.h is a weird mix of ad-hoc headers included in the middle of the header. Two of them rely on being included in the middle of kernel/sched/sched.h, due to definitions they require: - "stat.h" needs the rq definitions. - "autogroup.h" needs the task_group definition. Move the inclusion of these two files out of kernel/sched/sched.h, and include them in all files that require them. Move of the rest of the header dependencies to the top of the kernel/sched/sched.h file. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Peter Zijlstra <peterz@infradead.org>
2022-02-23sched/headers: Introduce kernel/sched/build_policy.c and build multiple .c ↵Ingo Molnar
files there Similarly to kernel/sched/build_utility.c, collect all 'scheduling policy' related source code files into kernel/sched/build_policy.c: kernel/sched/idle.c kernel/sched/rt.c kernel/sched/cpudeadline.c kernel/sched/pelt.c kernel/sched/cputime.c kernel/sched/deadline.c With the exception of fair.c, which we continue to build as a separate file for build efficiency and parallelism reasons. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Peter Zijlstra <peterz@infradead.org>
2022-02-23sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c ↵Ingo Molnar
files there Collect all utility functionality source code files into a single kernel/sched/build_utility.c file, via #include-ing the .c files: kernel/sched/clock.c kernel/sched/completion.c kernel/sched/loadavg.c kernel/sched/swait.c kernel/sched/wait_bit.c kernel/sched/wait.c CONFIG_CPU_FREQ: kernel/sched/cpufreq.c CONFIG_CPU_FREQ_GOV_SCHEDUTIL: kernel/sched/cpufreq_schedutil.c CONFIG_CGROUP_CPUACCT: kernel/sched/cpuacct.c CONFIG_SCHED_DEBUG: kernel/sched/debug.c CONFIG_SCHEDSTATS: kernel/sched/stats.c CONFIG_SMP: kernel/sched/cpupri.c kernel/sched/stop_task.c kernel/sched/topology.c CONFIG_SCHED_CORE: kernel/sched/core_sched.c CONFIG_PSI: kernel/sched/psi.c CONFIG_MEMBARRIER: kernel/sched/membarrier.c CONFIG_CPU_ISOLATION: kernel/sched/isolation.c CONFIG_SCHED_AUTOGROUP: kernel/sched/autogroup.c The goal is to amortize the 60+ KLOC header bloat from over a dozen build units into a single build unit. The build time of build_utility.c also roughly matches the build time of core.c and fair.c - allowing better load-balancing of scheduler-only rebuilds. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Peter Zijlstra <peterz@infradead.org>
2022-02-23sched/headers: Add header guard to kernel/sched/sched.hIngo Molnar
Use the canonical header guard naming of the full path to the header. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Peter Zijlstra <peterz@infradead.org>
2022-02-16sched/numa: Fix NUMA topology for systems with CPU-less nodesHuang Ying
The NUMA topology parameters (sched_numa_topology_type, sched_domains_numa_levels, and sched_max_numa_distance, etc.) identified by scheduler may be wrong for systems with CPU-less nodes. For example, the ACPI SLIT of a system with CPU-less persistent memory (Intel Optane DCPMM) nodes is as follows, [000h 0000 4] Signature : "SLIT" [System Locality Information Table] [004h 0004 4] Table Length : 0000042C [008h 0008 1] Revision : 01 [009h 0009 1] Checksum : 59 [00Ah 0010 6] Oem ID : "XXXX" [010h 0016 8] Oem Table ID : "XXXXXXX" [018h 0024 4] Oem Revision : 00000001 [01Ch 0028 4] Asl Compiler ID : "INTL" [020h 0032 4] Asl Compiler Revision : 20091013 [024h 0036 8] Localities : 0000000000000004 [02Ch 0044 4] Locality 0 : 0A 15 11 1C [030h 0048 4] Locality 1 : 15 0A 1C 11 [034h 0052 4] Locality 2 : 11 1C 0A 1C [038h 0056 4] Locality 3 : 1C 11 1C 0A While the `numactl -H` output is as follows, available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 0 size: 64136 MB node 0 free: 5981 MB node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 node 1 size: 64466 MB node 1 free: 10415 MB node 2 cpus: node 2 size: 253952 MB node 2 free: 253920 MB node 3 cpus: node 3 size: 253952 MB node 3 free: 253951 MB node distances: node 0 1 2 3 0: 10 21 17 28 1: 21 10 28 17 2: 17 28 10 28 3: 28 17 28 10 In this system, there are only 2 sockets. In each memory controller, both DRAM and PMEM DIMMs are installed. Although the physical NUMA topology is simple, the logical NUMA topology becomes a little complex. Because both the distance(0, 1) and distance (1, 3) are less than the distance (0, 3), it appears that node 1 sits between node 0 and node 3. And the whole system appears to be a glueless mesh NUMA topology type. But it's definitely not, there is even no CPU in node 3. This isn't a practical problem now yet. Because the PMEM nodes (node 2 and node 3 in example system) are offlined by default during system boot. So init_numa_topology_type() called during system boot will ignore them and set sched_numa_topology_type to NUMA_DIRECT. And init_numa_topology_type() is only called at runtime when a CPU of a never-onlined-before node gets plugged in. And there's no CPU in the PMEM nodes. But it appears better to fix this to make the code more robust. To test the potential problem. We have used a debug patch to call init_numa_topology_type() when the PMEM node is onlined (in __set_migration_target_nodes()). With that, the NUMA parameters identified by scheduler is as follows, sched_numa_topology_type: NUMA_GLUELESS_MESH sched_domains_numa_levels: 4 sched_max_numa_distance: 28 To fix the issue, the CPU-less nodes are ignored when the NUMA topology parameters are identified. Because a node may become CPU-less or not at run time because of CPU hotplug, the NUMA topology parameters need to be re-initialized at runtime for CPU hotplug too. With the patch, the NUMA parameters identified for the example system above is as follows, sched_numa_topology_type: NUMA_DIRECT sched_domains_numa_levels: 2 sched_max_numa_distance: 21 Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220214121553.582248-1-ying.huang@intel.com
2022-01-27sched/sugov: Ignore 'busy' filter when rq is capped by uclamp_maxQais Yousef
sugov_update_single_{freq, perf}() contains a 'busy' filter that ensures we don't bring the frqeuency down if there's no idle time (CPU is busy). The problem is that with uclamp_max we will have scenarios where a busy task is capped to run at a lower frequency and this filter prevents applying the capping when this task starts running. We handle this by skipping the filter when uclamp is enabled and the rq is being capped by uclamp_max. We introduce a new function uclamp_rq_is_capped() to help detecting when this capping is taking effect. Some code shuffling was required to allow using cpu_util_{cfs, rt}() in this new function. On 2 Core SMT2 Intel laptop I see: Without this patch: uclampset -M 0 sysbench --test=cpu --threads = 4 run produces a score of ~3200 consistently. Which is the highest possible. Compiling the kernel also results in frequency running at max 3.1GHz all the time - running uclampset -M 400 to cap it has no effect without this patch. With this patch: uclampset -M 0 sysbench --test=cpu --threads = 4 run produces a score of ~1100 with some outliers in ~1700. Uclamp max aggregates the performance requirements, so having high values sometimes is expected if some other task happens to require that frequency starts running at the same time. When compiling the kernel with uclampset -M 400 I can see the frequencies mostly in the ~2GHz region. Helpful to conserve power and prevent heating when not plugged in. Fixes: 982d9cdc22c9 ("sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211216225320.2957053-2-qais.yousef@arm.com
2021-12-11sched/fair: Replace CFS internal cpu_util() with cpu_util_cfs()Dietmar Eggemann
cpu_util_cfs() was created by commit d4edd662ac16 ("sched/cpufreq: Use the DEADLINE utilization signal") to enable the access to CPU utilization from the Schedutil CPUfreq governor. Commit a07630b8b2c1 ("sched/cpufreq/schedutil: Use util_est for OPP selection") added util_est support later. The only thing cpu_util() is doing on top of what cpu_util_cfs() already does is to clamp the return value to the [0..capacity_orig] capacity range of the CPU. Integrating this into cpu_util_cfs() is not harming the existing users (Schedutil and CPUfreq cooling (latter via sched_cpu_util() wrapper)). For straightforwardness, prefer to keep using `int cpu` as the function parameter over using `struct rq *rq` which might avoid some calls to cpu_rq(cpu) -> per_cpu(runqueues, cpu) -> RELOC_HIDE(). Update cfs_util()'s documentation and reuse it for cpu_util_cfs(). Remove cpu_util(). Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20211118164240.623551-1-dietmar.eggemann@arm.com
2021-11-17sched/core: Forced idle accountingJosh Don
Adds accounting for "forced idle" time, which is time where a cookie'd task forces its SMT sibling to idle, despite the presence of runnable tasks. Forced idle time is one means to measure the cost of enabling core scheduling (ie. the capacity lost due to the need to force idle). Forced idle time is attributed to the thread responsible for causing the forced idle. A few details: - Forced idle time is displayed via /proc/PID/sched. It also requires that schedstats is enabled. - Forced idle is only accounted when a sibling hyperthread is held idle despite the presence of runnable tasks. No time is charged if a sibling is idle but has no runnable tasks. - Tasks with 0 cookie are never charged forced idle. - For SMT > 2, we scale the amount of forced idle charged based on the number of forced idle siblings. Additionally, we split the time up and evenly charge it to all running tasks, as each is equally responsible for the forced idle. Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211018203428.2025792-1-joshdon@google.com
2021-11-11sched/fair: Prevent dead task groups from regaining cfs_rq'sMathias Krause
Kevin is reporting crashes which point to a use-after-free of a cfs_rq in update_blocked_averages(). Initial debugging revealed that we've live cfs_rq's (on_list=1) in an about to be kfree()'d task group in free_fair_sched_group(). However, it was unclear how that can happen. His kernel config happened to lead to a layout of struct sched_entity that put the 'my_q' member directly into the middle of the object which makes it incidentally overlap with SLUB's freelist pointer. That, in combination with SLAB_FREELIST_HARDENED's freelist pointer mangling, leads to a reliable access violation in form of a #GP which made the UAF fail fast. Michal seems to have run into the same issue[1]. He already correctly diagnosed that commit a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") is causing the preconditions for the UAF to happen by re-adding cfs_rq's also to task groups that have no more running tasks, i.e. also to dead ones. His analysis, however, misses the real root cause and it cannot be seen from the crash backtrace only, as the real offender is tg_unthrottle_up() getting called via sched_cfs_period_timer() via the timer interrupt at an inconvenient time. When unregister_fair_sched_group() unlinks all cfs_rq's from the dying task group, it doesn't protect itself from getting interrupted. If the timer interrupt triggers while we iterate over all CPUs or after unregister_fair_sched_group() has finished but prior to unlinking the task group, sched_cfs_period_timer() will execute and walk the list of task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the dying task group. These will later -- in free_fair_sched_group() -- be kfree()'ed while still being linked, leading to the fireworks Kevin and Michal are seeing. To fix this race, ensure the dying task group gets unlinked first. However, simply switching the order of unregistering and unlinking the task group isn't sufficient, as concurrent RCU walkers might still see it, as can be seen below: CPU1: CPU2: : timer IRQ: : do_sched_cfs_period_timer(): : : : distribute_cfs_runtime(): : rcu_read_lock(); : : : unthrottle_cfs_rq(): sched_offline_group(): : : walk_tg_tree_from(…,tg_unthrottle_up,…): list_del_rcu(&tg->list); : (1) : list_for_each_entry_rcu(child, &parent->children, siblings) : : (2) list_del_rcu(&tg->siblings); : : tg_unthrottle_up(): unregister_fair_sched_group(): struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)]; : : list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); : : : : if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running) (3) : list_add_leaf_cfs_rq(cfs_rq); : : : : : : : : : : (4) : rcu_read_unlock(); CPU 2 walks the task group list in parallel to sched_offline_group(), specifically, it'll read the soon to be unlinked task group entry at (1). Unlinking it on CPU 1 at (2) therefore won't prevent CPU 2 from still passing it on to tg_unthrottle_up(). CPU 1 now tries to unlink all cfs_rq's via list_del_leaf_cfs_rq() in unregister_fair_sched_group(). Meanwhile CPU 2 will re-add some of these at (3), which is the cause of the UAF later on. To prevent this additional race from happening, we need to wait until walk_tg_tree_from() has finished traversing the task groups, i.e. after the RCU read critical section ends in (4). Afterwards we're safe to call unregister_fair_sched_group(), as each new walk won't see the dying task group any more. On top of that, we need to wait yet another RCU grace period after unregister_fair_sched_group() to ensure print_cfs_stats(), which might run concurrently, always sees valid objects, i.e. not already free'd ones. This patch survives Michal's reproducer[2] for 8h+ now, which used to trigger within minutes before. [1] https://lore.kernel.org/lkml/20211011172236.11223-1-mkoutny@suse.com/ [2] https://lore.kernel.org/lkml/20211102160228.GA57072@blackbody.suse.cz/ Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") [peterz: shuffle code around a bit] Reported-by: Kevin Tanguy <kevin.tanguy@corp.ovh.com> Signed-off-by: Mathias Krause <minipli@grsecurity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2021-11-01Merge tag 'cpu-to-thread_info-v5.16-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull thread_info update to move 'cpu' back from task_struct from Kees Cook: "Cross-architecture update to move task_struct::cpu back into thread_info on arm64, x86, s390, powerpc, and riscv. All Acked by arch maintainers. Quoting Ard Biesheuvel: 'Move task_struct::cpu back into thread_info Keeping CPU in task_struct is problematic for architectures that define raw_smp_processor_id() in terms of this field, as it requires linux/sched.h to be included, which causes a lot of pain in terms of circular dependencies (aka 'header soup') This series moves it back into thread_info (where it came from) for all architectures that enable THREAD_INFO_IN_TASK, addressing the header soup issue as well as some pointless differences in the implementations of task_cpu() and set_task_cpu()'" * tag 'cpu-to-thread_info-v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: riscv: rely on core code to keep thread_info::cpu updated powerpc: smp: remove hack to obtain offset of task_struct::cpu sched: move CPU field back into thread_info if THREAD_INFO_IN_TASK=y powerpc: add CPU field to struct thread_info s390: add CPU field to struct thread_info x86: add CPU field to struct thread_info arm64: add CPU field to struct thread_info
2021-11-01Merge tag 'sched-core-2021-11-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Thomas Gleixner: - Revert the printk format based wchan() symbol resolution as it can leak the raw value in case that the symbol is not resolvable. - Make wchan() more robust and work with all kind of unwinders by enforcing that the task stays blocked while unwinding is in progress. - Prevent sched_fork() from accessing an invalid sched_task_group - Improve asymmetric packing logic - Extend scheduler statistics to RT and DL scheduling classes and add statistics for bandwith burst to the SCHED_FAIR class. - Properly account SCHED_IDLE entities - Prevent a potential deadlock when initial priority is assigned to a newly created kthread. A recent change to plug a race between cpuset and __sched_setscheduler() introduced a new lock dependency which is now triggered. Break the lock dependency chain by moving the priority assignment to the thread function. - Fix the idle time reporting in /proc/uptime for NOHZ enabled systems. - Improve idle balancing in general and especially for NOHZ enabled systems. - Provide proper interfaces for live patching so it does not have to fiddle with scheduler internals. - Add cluster aware scheduling support. - A small set of tweaks for RT (irqwork, wait_task_inactive(), various scheduler options and delaying mmdrop) - The usual small tweaks and improvements all over the place * tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (69 commits) sched/fair: Cleanup newidle_balance sched/fair: Remove sysctl_sched_migration_cost condition sched/fair: Wait before decaying max_newidle_lb_cost sched/fair: Skip update_blocked_averages if we are defering load balance sched/fair: Account update_blocked_averages in newidle_balance cost x86: Fix __get_wchan() for !STACKTRACE sched,x86: Fix L2 cache mask sched/core: Remove rq_relock() sched: Improve wake_up_all_idle_cpus() take #2 irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RT irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support. sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ sched: Add cluster scheduler level for x86 sched: Add cluster scheduler level in core and related Kconfig for ARM64 topology: Represent clusters of CPUs within a die sched: Disable -Wunused-but-set-variable sched: Add wrapper for get_wchan() to keep task blocked x86: Fix get_wchan() to support the ORC unwinder proc: Use task_is_running() for wchan in /proc/$pid/stat ...
2021-10-22sched/core: Remove rq_relock()Peng Wang
After the removal of migrate_tasks(), there is no user of rq_relock() left, so remove it. Signed-off-by: Peng Wang <rocking@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/449948fdf9be4764b3929c52572917dd25eef758.1634611953.git.rocking@linux.alibaba.com
2021-10-18sched: move the <linux/blkdev.h> include out of kernel/sched/sched.hChristoph Hellwig
Only core.c needs blkdev.h, so move the #include statement there. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210920123328.1399408-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-05sched: Make cookie functions staticShaokun Zhang
Make cookie functions static as these are no longer invoked directly by other code. No functional change intended. Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210922085735.52812-1-zhangshaokun@hisilicon.com
2021-10-05sched/topology: Introduce sched_group::flagsRicardo Neri
There exist situations in which the load balance needs to know the properties of the CPUs in a scheduling group. When using asymmetric packing, for instance, the load balancer needs to know not only the state of dst_cpu but also of its SMT siblings, if any. Use the flags of the child scheduling domains to initialize scheduling group flags. This will reflect the properties of the CPUs in the group. A subsequent changeset will make use of these new flags. No functional changes are introduced. Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Len Brown <len.brown@intel.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210911011819.12184-3-ricardo.neri-calderon@linux.intel.com
2021-10-05sched: Remove unused inline function __rq_clock_broken()YueHaibing
These is no caller in tree since commit 523e979d3164 ("sched/core: Use PELT for scale_rt_capacity()") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210914095244.52780-1-yuehaibing@huawei.com
2021-10-05sched/fair: Add cfs bandwidth burst statisticsHuaixin Chang
Two new statistics are introduced to show the internal of burst feature and explain why burst helps or not. nr_bursts: number of periods bandwidth burst occurs burst_time: cumulative wall-time (in nanoseconds) that any cpus has used above quota in respective periods Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com> Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com> Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20210830032215.16302-2-changhuaixin@linux.alibaba.com
2021-10-05sched: reduce sched slice for SCHED_IDLE entitiesJosh Don
Use a small, non-scaled min granularity for SCHED_IDLE entities, when competing with normal entities. This reduces the latency of getting a normal entity back on cpu, at the expense of increased context switch frequency of SCHED_IDLE entities. The benefit of this change is to reduce the round-robin latency for normal entities when competing with a SCHED_IDLE entity. Example: on a machine with HZ=1000, spawned two threads, one of which is SCHED_IDLE, and affined to one cpu. Without this patch, the SCHED_IDLE thread runs for 4ms then waits for 1.4s. With this patch, it runs for 1ms and waits 340ms (as it round-robins with the other thread). Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210820010403.946838-4-joshdon@google.com
2021-10-05sched: Account number of SCHED_IDLE entities on each cfs_rqJosh Don
Adds cfs_rq->idle_nr_running, which accounts the number of idle entities directly enqueued on the cfs_rq. Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210820010403.946838-3-joshdon@google.com
2021-10-05sched/fair: Add NOHZ balancer flag for nohz.next_balance updatesValentin Schneider
A following patch will trigger NOHZ idle balances as a means to update nohz.next_balance. Vincent noted that blocked load updates can have non-negligible overhead, which should be avoided if the intent is to only update nohz.next_balance. Add a new NOHZ balance kick flag, NOHZ_NEXT_KICK. Gate NOHZ blocked load update by the presence of NOHZ_STATS_KICK - currently all NOHZ balance kicks will have the NOHZ_STATS_KICK flag set, so no change in behaviour is expected. Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210823111700.2842997-2-valentin.schneider@arm.com
2021-09-30sched: move CPU field back into thread_info if THREAD_INFO_IN_TASK=yArd Biesheuvel
THREAD_INFO_IN_TASK moved the CPU field out of thread_info, but this causes some issues on architectures that define raw_smp_processor_id() in terms of this field, due to the fact that #include'ing linux/sched.h to get at struct task_struct is problematic in terms of circular dependencies. Given that thread_info and task_struct are the same data structure anyway when THREAD_INFO_IN_TASK=y, let's move it back so that having access to the type definition of struct thread_info is sufficient to reference the CPU number of the current task. Note that this requires THREAD_INFO_IN_TASK's definition of the task_thread_info() helper to be updated, as task_cpu() takes a pointer-to-const, whereas task_thread_info() (which is used to generate lvalues as well), needs a non-const pointer. So make it a macro instead. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au>
2021-08-30Merge tag 'sched-core-2021-08-30' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - The biggest change in this cycle is scheduler support for asymmetric scheduling affinity, to support the execution of legacy 32-bit tasks on AArch32 systems that also have 64-bit-only CPUs. Architectures can fill in this functionality by defining their own task_cpu_possible_mask(p). When this is done, the scheduler will make sure the task will only be scheduled on CPUs that support it. (The actual arm64 specific changes are not part of this tree.) For other architectures there will be no change in functionality. - Add cgroup SCHED_IDLE support - Increase node-distance flexibility & delay determining it until a CPU is brought online. (This enables platforms where node distance isn't final until the CPU is only.) - Deadline scheduler enhancements & fixes - Misc fixes & cleanups. * tag 'sched-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits) eventfd: Make signal recursion protection a task bit sched/fair: Mark tg_is_idle() an inline in the !CONFIG_FAIR_GROUP_SCHED case sched: Introduce dl_task_check_affinity() to check proposed affinity sched: Allow task CPU affinity to be restricted on asymmetric systems sched: Split the guts of sched_setaffinity() into a helper function sched: Introduce task_struct::user_cpus_ptr to track requested affinity sched: Reject CPU affinity changes based on task_cpu_possible_mask() cpuset: Cleanup cpuset_cpus_allowed_fallback() use in select_fallback_rq() cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus() cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1 sched: Introduce task_cpu_possible_mask() to limit fallback rq selection sched: Cgroup SCHED_IDLE support sched/topology: Skip updating masks for non-online nodes sched: Replace deprecated CPU-hotplug functions. sched: Skip priority checks with SCHED_FLAG_KEEP_PARAMS sched: Fix UCLAMP_FLAG_IDLE setting sched/deadline: Fix missing clock update in migrate_task_rq_dl() sched/fair: Avoid a second scan of target in select_idle_cpu sched/fair: Use prev instead of new target as recent_used_cpu sched: Don't report SCHED_FLAG_SUGOV in sched_getattr() ...
2021-08-26sched: Fix get_push_task() vs migrate_disable()Sebastian Andrzej Siewior
push_rt_task() attempts to move the currently running task away if the next runnable task has migration disabled and therefore is pinned on the current CPU. The current task is retrieved via get_push_task() which only checks for nr_cpus_allowed == 1, but does not check whether the task has migration disabled and therefore cannot be moved either. The consequence is a pointless invocation of the migration thread which correctly observes that the task cannot be moved. Return NULL if the task has migration disabled and cannot be moved to another CPU. Fixes: a7c81556ec4d3 ("sched: Fix migrate_disable() vs rt/dl balancing") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210826133738.yiotqbtdaxzjsnfj@linutronix.de
2021-08-20sched: Allow task CPU affinity to be restricted on asymmetric systemsWill Deacon
Asymmetric systems may not offer the same level of userspace ISA support across all CPUs, meaning that some applications cannot be executed by some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do not feature support for 32-bit applications on both clusters. Although userspace can carefully manage the affinity masks for such tasks, one place where it is particularly problematic is execve() because the CPU on which the execve() is occurring may be incompatible with the new application image. In such a situation, it is desirable to restrict the affinity mask of the task and ensure that the new image is entered on a compatible CPU. From userspace's point of view, this looks the same as if the incompatible CPUs have been hotplugged off in the task's affinity mask. Similarly, if a subsequent execve() reverts to a compatible image, then the old affinity is restored if it is still valid. In preparation for restricting the affinity mask for compat tasks on arm64 systems without uniform support for 32-bit applications, introduce {force,relax}_compatible_cpus_allowed_ptr(), which respectively restrict and restore the affinity mask for a task based on the compatible CPUs. Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20210730112443.23245-9-will@kernel.org
2021-08-20sched: Cgroup SCHED_IDLE supportJosh Don
This extends SCHED_IDLE to cgroups. Interface: cgroup/cpu.idle. 0: default behavior 1: SCHED_IDLE Extending SCHED_IDLE to cgroups means that we incorporate the existing aspects of SCHED_IDLE; a SCHED_IDLE cgroup will count all of its descendant threads towards the idle_h_nr_running count of all of its ancestor cgroups. Thus, sched_idle_rq() will work properly. Additionally, SCHED_IDLE cgroups are configured with minimum weight. There are two key differences between the per-task and per-cgroup SCHED_IDLE interface: - The cgroup interface allows tasks within a SCHED_IDLE hierarchy to maintain their relative weights. The entity that is "idle" is the cgroup, not the tasks themselves. - Since the idle entity is the cgroup, our SCHED_IDLE wakeup preemption decision is not made by comparing the current task with the woken task, but rather by comparing their matching sched_entity. A typical use-case for this is a user that creates an idle and a non-idle subtree. The non-idle subtree will dominate competition vs the idle subtree, but the idle subtree will still be high priority vs other users on the system. The latter is accomplished via comparing matching sched_entity in the waken preemption path (this could also be improved by making the sched_idle_rq() decision dependent on the perspective of a specific task). For now, we maintain the existing SCHED_IDLE semantics. Future patches may make improvements that extend how we treat SCHED_IDLE entities. The per-task_group idle field is an integer that currently only holds either a 0 or a 1. This is explicitly typed as an integer to allow for further extensions to this API. For example, a negative value may indicate a highly latency-sensitive cgroup that should be preferred for preemption/placement/etc. Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210730020019.1487127-2-joshdon@google.com
2021-08-20sched: Fix Core-wide rq->lock for uninitialized CPUsPeter Zijlstra
Eugene tripped over the case where rq_lock(), as called in a for_each_possible_cpu() loop came apart because rq->core hadn't been setup yet. This is a somewhat unusual, but valid case. Rework things such that rq->core is initialized to point at itself. IOW initialize each CPU as a single threaded Core. CPU online will then join the new CPU (thread) to an existing Core where needed. For completeness sake, have CPU offline fully undo the state so as to not presume the topology will match the next time it comes online. Fixes: 9edeaea1bc45 ("sched: Core-wide rq->lock") Reported-by: Eugene Syromiatnikov <esyr@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Don <joshdon@google.com> Tested-by: Eugene Syromiatnikov <esyr@redhat.com> Link: https://lkml.kernel.org/r/YR473ZGeKqMs6kw+@hirez.programming.kicks-ass.net
2021-08-04sched/deadline: Fix reset_on_fork reporting of DL tasksQuentin Perret
It is possible for sched_getattr() to incorrectly report the state of the reset_on_fork flag when called on a deadline task. Indeed, if the flag was set on a deadline task using sched_setattr() with flags (SCHED_FLAG_RESET_ON_FORK | SCHED_FLAG_KEEP_PARAMS), then p->sched_reset_on_fork will be set, but __setscheduler() will bail out early, which means that the dl_se->flags will not get updated by __setscheduler_params()->__setparam_dl(). Consequently, if sched_getattr() is then called on the task, __getparam_dl() will override kattr.sched_flags with the now out-of-date copy in dl_se->flags and report the stale value to userspace. To fix this, make sure to only copy the flags that are relevant to sched_deadline to and from the dl_se->flags field. Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210727101103.2729607-2-qperret@google.com
2021-07-02sched/uclamp: Ignore max aggregation if rq is idleXuewen Yan
When a task wakes up on an idle rq, uclamp_rq_util_with() would max aggregate with rq value. But since there is no task enqueued yet, the values are stale based on the last task that was running. When the new task actually wakes up and enqueued, then the rq uclamp values should reflect that of the newly woken up task effective uclamp values. This is a problem particularly for uclamp_max because it default to 1024. If a task p with uclamp_max = 512 wakes up, then max aggregation would ignore the capping that should apply when this task is enqueued, which is wrong. Fix that by ignoring max aggregation if the rq is idle since in that case the effective uclamp value of the rq will be the ones of the task that will wake up. Fixes: 9d20ad7dfc9a ("sched/uclamp: Add uclamp_util_with()") Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> [qias: Changelog] Reviewed-by: Qais Yousef <qais.yousef@arm.com> Link: https://lore.kernel.org/r/20210630141204.8197-1-xuewen.yan94@gmail.com
2021-06-28sched/sysctl: Move extern sysctl declarations to sched.hHailong Liu
Since commit '8a99b6833c88(sched: Move SCHED_DEBUG sysctl to debugfs)', SCHED_DEBUG sysctls are moved to debugfs, so these extern sysctls in include/linux/sched/sysctl.h are no longer needed for sysctl.c, even some are no longer needed. So move those extern sysctls that needed by kernel/sched/debug.c to kernel/sched/sched.h, and remove others that are no longer needed. Signed-off-by: Hailong Liu <liu.hailong6@zte.com.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210606115451.26745-1-liuhailongg6@163.com
2021-06-24sched/fair: Introduce the burstable CFS controllerHuaixin Chang
The CFS bandwidth controller limits CPU requests of a task group to quota during each period. However, parallel workloads might be bursty so that they get throttled even when their average utilization is under quota. And they are latency sensitive at the same time so that throttling them is undesired. We borrow time now against our future underrun, at the cost of increased interference against the other system users. All nicely bounded. Traditional (UP-EDF) bandwidth control is something like: (U = \Sum u_i) <= 1 This guaranteeds both that every deadline is met and that the system is stable. After all, if U were > 1, then for every second of walltime, we'd have to run more than a second of program time, and obviously miss our deadline, but the next deadline will be further out still, there is never time to catch up, unbounded fail. This work observes that a workload doesn't always executes the full quota; this enables one to describe u_i as a statistical distribution. For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100) (the traditional WCET). This effectively allows u to be smaller, increasing the efficiency (we can pack more tasks in the system), but at the cost of missing deadlines when all the odds line up. However, it does maintain stability, since every overrun must be paired with an underrun as long as our x is above the average. That is, suppose we have 2 tasks, both specify a p(95) value, then we have a p(95)*p(95) = 90.25% chance both tasks are within their quota and everything is good. At the same time we have a p(5)p(5) = 0.25% chance both tasks will exceed their quota at the same time (guaranteed deadline fail). Somewhere in between there's a threshold where one exceeds and the other doesn't underrun enough to compensate; this depends on the specific CDFs. At the same time, we can say that the worst case deadline miss, will be \Sum e_i; that is, there is a bounded tardiness (under the assumption that x+e is indeed WCET). The benefit of burst is seen when testing with schbench. Default value of kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used. mkdir /sys/fs/cgroup/cpu/test echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us ./schbench -m 1 -t 3 -r 20 -c 80000 -R 10 The average CPU usage is at 80%. I run this for 10 times, and got long tail latency for 6 times and got throttled for 8 times. Tail latencies are shown below, and it wasn't the worst case. Latency percentiles (usec) 50.0000th: 19872 75.0000th: 21344 90.0000th: 22176 95.0000th: 22496 *99.0000th: 22752 99.5000th: 22752 99.9000th: 22752 min=0, max=22727 rps: 9.90 p95 (usec) 22496 p99 (usec) 22752 p95/cputime 28.12% p99/cputime 28.44% The interferenece when using burst is valued by the possibilities for missing the deadline and the average WCET. Test results showed that when there many cgroups or CPU is under utilized, the interference is limited. More details are shown in: https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/ Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com> Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com> Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20210621092800.23714-2-changhuaixin@linux.alibaba.com
2021-06-17sched/fair: Age the average idle timePeter Zijlstra
This is a partial forward-port of Peter Ziljstra's work first posted at: https://lore.kernel.org/lkml/20180530142236.667774973@infradead.org/ Currently select_idle_cpu()'s proportional scheme uses the average idle time *for when we are idle*, that is temporally challenged. When a CPU is not at all idle, we'll happily continue using whatever value we did see when the CPU goes idle. To fix this, introduce a separate average idle and age it (the existing value still makes sense for things like new-idle balancing, which happens when we do go idle). The overall goal is to not spend more time scanning for idle CPUs than we're idle for. Otherwise we're inhibiting work. This means that we need to consider the cost over all the wake-ups between consecutive idle periods. To track this, the scan cost is subtracted from the estimated average idle time. The impact of this patch is related to workloads that have domains that are fully busy or overloaded. Without the patch, the scan depth may be too high because a CPU is not reaching idle. Due to the nature of the patch, this is a regression magnet. It potentially wins when domains are almost fully busy or overloaded -- at that point searches are likely to fail but idle is not being aged as CPUs are active so search depth is too large and useless. It will potentially show regressions when there are idle CPUs and a deep search is beneficial. This tbench result on a 2-socket broadwell machine partially illustates the problem 5.13.0-rc2 5.13.0-rc2 vanilla sched-avgidle-v1r5 Hmean 1 445.02 ( 0.00%) 451.36 * 1.42%* Hmean 2 830.69 ( 0.00%) 846.03 * 1.85%* Hmean 4 1350.80 ( 0.00%) 1505.56 * 11.46%* Hmean 8 2888.88 ( 0.00%) 2586.40 * -10.47%* Hmean 16 5248.18 ( 0.00%) 5305.26 * 1.09%* Hmean 32 8914.03 ( 0.00%) 9191.35 * 3.11%* Hmean 64 10663.10 ( 0.00%) 10192.65 * -4.41%* Hmean 128 18043.89 ( 0.00%) 18478.92 * 2.41%* Hmean 256 16530.89 ( 0.00%) 17637.16 * 6.69%* Hmean 320 16451.13 ( 0.00%) 17270.97 * 4.98%* Note that 8 was a regression point where a deeper search would have helped but it gains for high thread counts when searches are useless. Hackbench is a more extreme example although not perfect as the tasks idle rapidly hackbench-process-pipes 5.13.0-rc2 5.13.0-rc2 vanilla sched-avgidle-v1r5 Amean 1 0.3950 ( 0.00%) 0.3887 ( 1.60%) Amean 4 0.9450 ( 0.00%) 0.9677 ( -2.40%) Amean 7 1.4737 ( 0.00%) 1.4890 ( -1.04%) Amean 12 2.3507 ( 0.00%) 2.3360 * 0.62%* Amean 21 4.0807 ( 0.00%) 4.0993 * -0.46%* Amean 30 5.6820 ( 0.00%) 5.7510 * -1.21%* Amean 48 8.7913 ( 0.00%) 8.7383 ( 0.60%) Amean 79 14.3880 ( 0.00%) 13.9343 * 3.15%* Amean 110 21.2233 ( 0.00%) 19.4263 * 8.47%* Amean 141 28.2930 ( 0.00%) 25.1003 * 11.28%* Amean 172 34.7570 ( 0.00%) 30.7527 * 11.52%* Amean 203 41.0083 ( 0.00%) 36.4267 * 11.17%* Amean 234 47.7133 ( 0.00%) 42.0623 * 11.84%* Amean 265 53.0353 ( 0.00%) 47.7720 * 9.92%* Amean 296 60.0170 ( 0.00%) 53.4273 * 10.98%* Stddev 1 0.0052 ( 0.00%) 0.0025 ( 51.57%) Stddev 4 0.0357 ( 0.00%) 0.0370 ( -3.75%) Stddev 7 0.0190 ( 0.00%) 0.0298 ( -56.64%) Stddev 12 0.0064 ( 0.00%) 0.0095 ( -48.38%) Stddev 21 0.0065 ( 0.00%) 0.0097 ( -49.28%) Stddev 30 0.0185 ( 0.00%) 0.0295 ( -59.54%) Stddev 48 0.0559 ( 0.00%) 0.0168 ( 69.92%) Stddev 79 0.1559 ( 0.00%) 0.0278 ( 82.17%) Stddev 110 1.1728 ( 0.00%) 0.0532 ( 95.47%) Stddev 141 0.7867 ( 0.00%) 0.0968 ( 87.69%) Stddev 172 1.0255 ( 0.00%) 0.0420 ( 95.91%) Stddev 203 0.8106 ( 0.00%) 0.1384 ( 82.92%) Stddev 234 1.1949 ( 0.00%) 0.1328 ( 88.89%) Stddev 265 0.9231 ( 0.00%) 0.0820 ( 91.11%) Stddev 296 1.0456 ( 0.00%) 0.1327 ( 87.31%) Again, higher thread counts benefit and the standard deviation shows that results are also a lot more stable when the idle time is aged. The patch potentially matters when a socket was multiple LLCs as the maximum search depth is lower. However, some of the test results were suspiciously good (e.g. specjbb2005 gaining 50% on a Zen1 machine) and other results were not dramatically different to other mcahines. Given the nature of the patch, Peter's full series is not being forward ported as each part should stand on its own. Preferably they would be merged at different times to reduce the risk of false bisections. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210615111611.GH30378@techsingularity.net
2021-05-12sched: Make multiple runqueue task counters 32-bitAlexey Dobriyan
Make: struct dl_rq::dl_nr_migratory struct dl_rq::dl_nr_running struct rt_rq::rt_nr_boosted struct rt_rq::rt_nr_migratory struct rt_rq::rt_nr_total struct rq::nr_uninterruptible 32-bit. If total number of tasks can't exceed 2**32 (and less due to futex pid limits), then per-runqueue counters can't as well. This patchset has been sponsored by REX Prefix Eradication Society. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210422200228.1423391-4-adobriyan@gmail.com
2021-05-12sched: Trivial core scheduling cookie managementPeter Zijlstra
In order to not have to use pid_struct, create a new, smaller, structure to manage task cookies for core scheduling. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.919768100@infradead.org
2021-05-12sched: Migration changes for core schedulingAubrey Li
- Don't migrate if there is a cookie mismatch Load balance tries to move task from busiest CPU to the destination CPU. When core scheduling is enabled, if the task's cookie does not match with the destination CPU's core cookie, this task may be skipped by this CPU. This mitigates the forced idle time on the destination CPU. - Select cookie matched idle CPU In the fast path of task wakeup, select the first cookie matched idle CPU instead of the first idle CPU. - Find cookie matched idlest CPU In the slow path of task wakeup, find the idlest CPU whose core cookie matches with task's cookie Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.860083871@infradead.org