From d83a7cb375eec21f04c83542395d08b2f6641da2 Mon Sep 17 00:00:00 2001 From: Josh Poimboeuf Date: Mon, 13 Feb 2017 19:42:40 -0600 Subject: livepatch: change to a per-task consistency model Change livepatch to use a basic per-task consistency model. This is the foundation which will eventually enable us to patch those ~10% of security patches which change function or data semantics. This is the biggest remaining piece needed to make livepatch more generally useful. This code stems from the design proposal made by Vojtech [1] in November 2014. It's a hybrid of kGraft and kpatch: it uses kGraft's per-task consistency and syscall barrier switching combined with kpatch's stack trace switching. There are also a number of fallback options which make it quite flexible. Patches are applied on a per-task basis, when the task is deemed safe to switch over. When a patch is enabled, livepatch enters into a transition state where tasks are converging to the patched state. Usually this transition state can complete in a few seconds. The same sequence occurs when a patch is disabled, except the tasks converge from the patched state to the unpatched state. An interrupt handler inherits the patched state of the task it interrupts. The same is true for forked tasks: the child inherits the patched state of the parent. Livepatch uses several complementary approaches to determine when it's safe to patch tasks: 1. The first and most effective approach is stack checking of sleeping tasks. If no affected functions are on the stack of a given task, the task is patched. In most cases this will patch most or all of the tasks on the first try. Otherwise it'll keep trying periodically. This option is only available if the architecture has reliable stacks (HAVE_RELIABLE_STACKTRACE). 2. The second approach, if needed, is kernel exit switching. A task is switched when it returns to user space from a system call, a user space IRQ, or a signal. It's useful in the following cases: a) Patching I/O-bound user tasks which are sleeping on an affected function. In this case you have to send SIGSTOP and SIGCONT to force it to exit the kernel and be patched. b) Patching CPU-bound user tasks. If the task is highly CPU-bound then it will get patched the next time it gets interrupted by an IRQ. c) In the future it could be useful for applying patches for architectures which don't yet have HAVE_RELIABLE_STACKTRACE. In this case you would have to signal most of the tasks on the system. However this isn't supported yet because there's currently no way to patch kthreads without HAVE_RELIABLE_STACKTRACE. 3. For idle "swapper" tasks, since they don't ever exit the kernel, they instead have a klp_update_patch_state() call in the idle loop which allows them to be patched before the CPU enters the idle state. (Note there's not yet such an approach for kthreads.) All the above approaches may be skipped by setting the 'immediate' flag in the 'klp_patch' struct, which will disable per-task consistency and patch all tasks immediately. This can be useful if the patch doesn't change any function or data semantics. Note that, even with this flag set, it's possible that some tasks may still be running with an old version of the function, until that function returns. There's also an 'immediate' flag in the 'klp_func' struct which allows you to specify that certain functions in the patch can be applied without per-task consistency. This might be useful if you want to patch a common function like schedule(), and the function change doesn't need consistency but the rest of the patch does. For architectures which don't have HAVE_RELIABLE_STACKTRACE, the user must set patch->immediate which causes all tasks to be patched immediately. This option should be used with care, only when the patch doesn't change any function or data semantics. In the future, architectures which don't have HAVE_RELIABLE_STACKTRACE may be allowed to use per-task consistency if we can come up with another way to patch kthreads. The /sys/kernel/livepatch//transition file shows whether a patch is in transition. Only a single patch (the topmost patch on the stack) can be in transition at a given time. A patch can remain in transition indefinitely, if any of the tasks are stuck in the initial patch state. A transition can be reversed and effectively canceled by writing the opposite value to the /sys/kernel/livepatch//enabled file while the transition is in progress. Then all the tasks will attempt to converge back to the original patch state. [1] https://lkml.kernel.org/r/20141107140458.GA21774@suse.cz Signed-off-by: Josh Poimboeuf Acked-by: Miroslav Benes Acked-by: Ingo Molnar # for the scheduler changes Signed-off-by: Jiri Kosina --- kernel/sched/idle.c | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'kernel/sched') diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index ac6d5176463d..2a25a9ec2c6e 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -10,6 +10,7 @@ #include #include #include +#include #include @@ -265,6 +266,9 @@ static void do_idle(void) sched_ttwu_pending(); schedule_preempt_disabled(); + + if (unlikely(klp_patch_pending(current))) + klp_update_patch_state(current); } bool cpu_in_idle(unsigned long pc) -- cgit From 3e777f9909483b603946685d88acfae89f31b07b Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Tue, 28 Feb 2017 15:50:30 -0500 Subject: sched/rt: Add comments describing the RT IPI pull method While looking into optimizations for the RT scheduler IPI logic, I realized that the comments are lacking to describe it efficiently. It deserves a lengthy description describing its design. Signed-off-by: Steven Rostedt (VMware) Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Clark Williams Cc: Daniel Bristot de Oliveira Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/20170228155030.30c69068@gandalf.local.home [ Small typographical edits. ] Signed-off-by: Ingo Molnar --- kernel/sched/rt.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 81 insertions(+) (limited to 'kernel/sched') diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 9f3e40226dec..979b7341008a 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1927,6 +1927,87 @@ static int find_next_push_cpu(struct rq *rq) #define RT_PUSH_IPI_EXECUTING 1 #define RT_PUSH_IPI_RESTART 2 +/* + * When a high priority task schedules out from a CPU and a lower priority + * task is scheduled in, a check is made to see if there's any RT tasks + * on other CPUs that are waiting to run because a higher priority RT task + * is currently running on its CPU. In this case, the CPU with multiple RT + * tasks queued on it (overloaded) needs to be notified that a CPU has opened + * up that may be able to run one of its non-running queued RT tasks. + * + * On large CPU boxes, there's the case that several CPUs could schedule + * a lower priority task at the same time, in which case it will look for + * any overloaded CPUs that it could pull a task from. To do this, the runqueue + * lock must be taken from that overloaded CPU. Having 10s of CPUs all fighting + * for a single overloaded CPU's runqueue lock can produce a large latency. + * (This has actually been observed on large boxes running cyclictest). + * Instead of taking the runqueue lock of the overloaded CPU, each of the + * CPUs that scheduled a lower priority task simply sends an IPI to the + * overloaded CPU. An IPI is much cheaper than taking an runqueue lock with + * lots of contention. The overloaded CPU will look to push its non-running + * RT task off, and if it does, it can then ignore the other IPIs coming + * in, and just pass those IPIs off to any other overloaded CPU. + * + * When a CPU schedules a lower priority task, it only sends an IPI to + * the "next" CPU that has overloaded RT tasks. This prevents IPI storms, + * as having 10 CPUs scheduling lower priority tasks and 10 CPUs with + * RT overloaded tasks, would cause 100 IPIs to go out at once. + * + * The overloaded RT CPU, when receiving an IPI, will try to push off its + * overloaded RT tasks and then send an IPI to the next CPU that has + * overloaded RT tasks. This stops when all CPUs with overloaded RT tasks + * have completed. Just because a CPU may have pushed off its own overloaded + * RT task does not mean it should stop sending the IPI around to other + * overloaded CPUs. There may be another RT task waiting to run on one of + * those CPUs that are of higher priority than the one that was just + * pushed. + * + * An optimization that could possibly be made is to make a CPU array similar + * to the cpupri array mask of all running RT tasks, but for the overloaded + * case, then the IPI could be sent to only the CPU with the highest priority + * RT task waiting, and that CPU could send off further IPIs to the CPU with + * the next highest waiting task. Since the overloaded case is much less likely + * to happen, the complexity of this implementation may not be worth it. + * Instead, just send an IPI around to all overloaded CPUs. + * + * The rq->rt.push_flags holds the status of the IPI that is going around. + * A run queue can only send out a single IPI at a time. The possible flags + * for rq->rt.push_flags are: + * + * (None or zero): No IPI is going around for the current rq + * RT_PUSH_IPI_EXECUTING: An IPI for the rq is being passed around + * RT_PUSH_IPI_RESTART: The priority of the running task for the rq + * has changed, and the IPI should restart + * circulating the overloaded CPUs again. + * + * rq->rt.push_cpu contains the CPU that is being sent the IPI. It is updated + * before sending to the next CPU. + * + * Instead of having all CPUs that schedule a lower priority task send + * an IPI to the same "first" CPU in the RT overload mask, they send it + * to the next overloaded CPU after their own CPU. This helps distribute + * the work when there's more than one overloaded CPU and multiple CPUs + * scheduling in lower priority tasks. + * + * When a rq schedules a lower priority task than what was currently + * running, the next CPU with overloaded RT tasks is examined first. + * That is, if CPU 1 and 5 are overloaded, and CPU 3 schedules a lower + * priority task, it will send an IPI first to CPU 5, then CPU 5 will + * send to CPU 1 if it is still overloaded. CPU 1 will clear the + * rq->rt.push_flags if RT_PUSH_IPI_RESTART is not set. + * + * The first CPU to notice IPI_RESTART is set, will clear that flag and then + * send an IPI to the next overloaded CPU after the rq->cpu and not the next + * CPU after push_cpu. That is, if CPU 1, 4 and 5 are overloaded when CPU 3 + * schedules a lower priority task, and the IPI_RESTART gets set while the + * handling is being done on CPU 5, it will clear the flag and send it back to + * CPU 4 instead of CPU 1. + * + * Note, the above logic can be disabled by turning off the sched_feature + * RT_PUSH_IPI. Then the rq lock of the overloaded CPU will simply be + * taken by the CPU requesting a pull and the waiting RT task will be pulled + * by that CPU. This may be fine for machines with few CPUs. + */ static void tell_cpu_to_push(struct rq *rq) { int cpu; -- cgit From 26ae58d23b94a075ae724fd18783a3773131cfbc Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Mon, 3 Oct 2016 16:53:49 +0200 Subject: sched/core: Add WARNING for multiple update_rq_clock() calls Now that we have no missing calls, add a warning to find multiple calls. By having only a single update_rq_clock() call per rq-lock section, the section appears 'atomic' wrt time. Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 3 +++ kernel/sched/features.h | 7 +++++++ 2 files changed, 10 insertions(+) (limited to 'kernel/sched') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 3b31fc05a0f1..1bd15d0d0307 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -233,8 +233,11 @@ void update_rq_clock(struct rq *rq) return; #ifdef CONFIG_SCHED_DEBUG + if (sched_feat(WARN_DOUBLE_CLOCK)) + SCHED_WARN_ON(rq->clock_update_flags & RQCF_UPDATED); rq->clock_update_flags |= RQCF_UPDATED; #endif + delta = sched_clock_cpu(cpu_of(rq)) - rq->clock; if (delta < 0) return; diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 1b3c8189b286..11192e0cb122 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -56,6 +56,13 @@ SCHED_FEAT(TTWU_QUEUE, true) */ SCHED_FEAT(SIS_AVG_CPU, false) +/* + * Issue a WARN when we do multiple update_rq_clock() calls + * in a single rq->lock section. Default disabled because the + * annotations are not complete. + */ +SCHED_FEAT(WARN_DOUBLE_CLOCK, false) + #ifdef HAVE_RT_PUSH_IPI /* * In order to avoid a thundering herd attack of CPUs that are -- cgit From 8a8c69c32778865affcedc2111bb5d938b50516f Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 4 Oct 2016 16:04:35 +0200 Subject: sched/core: Add rq->lock wrappers The missing update_rq_clock() check can work with partial rq->lock wrappery, since a missing wrapper can cause the warning to not be emitted when it should have, but cannot cause the warning to trigger when it should not have. The duplicate update_rq_clock() check however can cause false warnings to trigger. Therefore add more comprehensive rq->lock wrappery. Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 155 ++++++++++++++++++++++++--------------------------- kernel/sched/fair.c | 71 ++++++++++++----------- kernel/sched/sched.h | 57 +++++++++++++++++++ 3 files changed, 171 insertions(+), 112 deletions(-) (limited to 'kernel/sched') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 1bd15d0d0307..c5a514b1668d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -85,21 +85,6 @@ int sysctl_sched_rt_runtime = 950000; /* CPUs with isolated domains */ cpumask_var_t cpu_isolated_map; -/* - * this_rq_lock - lock this runqueue and disable interrupts. - */ -static struct rq *this_rq_lock(void) - __acquires(rq->lock) -{ - struct rq *rq; - - local_irq_disable(); - rq = this_rq(); - raw_spin_lock(&rq->lock); - - return rq; -} - /* * __task_rq_lock - lock the rq @p resides on. */ @@ -264,13 +249,14 @@ static void hrtick_clear(struct rq *rq) static enum hrtimer_restart hrtick(struct hrtimer *timer) { struct rq *rq = container_of(timer, struct rq, hrtick_timer); + struct rq_flags rf; WARN_ON_ONCE(cpu_of(rq) != smp_processor_id()); - raw_spin_lock(&rq->lock); + rq_lock(rq, &rf); update_rq_clock(rq); rq->curr->sched_class->task_tick(rq, rq->curr, 1); - raw_spin_unlock(&rq->lock); + rq_unlock(rq, &rf); return HRTIMER_NORESTART; } @@ -290,11 +276,12 @@ static void __hrtick_restart(struct rq *rq) static void __hrtick_start(void *arg) { struct rq *rq = arg; + struct rq_flags rf; - raw_spin_lock(&rq->lock); + rq_lock(rq, &rf); __hrtick_restart(rq); rq->hrtick_csd_pending = 0; - raw_spin_unlock(&rq->lock); + rq_unlock(rq, &rf); } /* @@ -949,18 +936,19 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags) * * Returns (locked) new rq. Old rq's lock is released. */ -static struct rq *move_queued_task(struct rq *rq, struct task_struct *p, int new_cpu) +static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf, + struct task_struct *p, int new_cpu) { lockdep_assert_held(&rq->lock); p->on_rq = TASK_ON_RQ_MIGRATING; dequeue_task(rq, p, 0); set_task_cpu(p, new_cpu); - raw_spin_unlock(&rq->lock); + rq_unlock(rq, rf); rq = cpu_rq(new_cpu); - raw_spin_lock(&rq->lock); + rq_lock(rq, rf); BUG_ON(task_cpu(p) != new_cpu); enqueue_task(rq, p, 0); p->on_rq = TASK_ON_RQ_QUEUED; @@ -983,7 +971,8 @@ struct migration_arg { * So we race with normal scheduler movements, but that's OK, as long * as the task is no longer on this CPU. */ -static struct rq *__migrate_task(struct rq *rq, struct task_struct *p, int dest_cpu) +static struct rq *__migrate_task(struct rq *rq, struct rq_flags *rf, + struct task_struct *p, int dest_cpu) { if (unlikely(!cpu_active(dest_cpu))) return rq; @@ -992,7 +981,7 @@ static struct rq *__migrate_task(struct rq *rq, struct task_struct *p, int dest_ if (!cpumask_test_cpu(dest_cpu, &p->cpus_allowed)) return rq; - rq = move_queued_task(rq, p, dest_cpu); + rq = move_queued_task(rq, rf, p, dest_cpu); return rq; } @@ -1007,6 +996,7 @@ static int migration_cpu_stop(void *data) struct migration_arg *arg = data; struct task_struct *p = arg->task; struct rq *rq = this_rq(); + struct rq_flags rf; /* * The original target CPU might have gone down and we might @@ -1021,7 +1011,7 @@ static int migration_cpu_stop(void *data) sched_ttwu_pending(); raw_spin_lock(&p->pi_lock); - raw_spin_lock(&rq->lock); + rq_lock(rq, &rf); /* * If task_rq(p) != rq, it cannot be migrated here, because we're * holding rq->lock, if p->on_rq == 0 it cannot get enqueued because @@ -1029,11 +1019,11 @@ static int migration_cpu_stop(void *data) */ if (task_rq(p) == rq) { if (task_on_rq_queued(p)) - rq = __migrate_task(rq, p, arg->dest_cpu); + rq = __migrate_task(rq, &rf, p, arg->dest_cpu); else p->wake_cpu = arg->dest_cpu; } - raw_spin_unlock(&rq->lock); + rq_unlock(rq, &rf); raw_spin_unlock(&p->pi_lock); local_irq_enable(); @@ -1153,9 +1143,7 @@ static int __set_cpus_allowed_ptr(struct task_struct *p, * OK, since we're going to drop the lock immediately * afterwards anyway. */ - rq_unpin_lock(rq, &rf); - rq = move_queued_task(rq, p, dest_cpu); - rq_repin_lock(rq, &rf); + rq = move_queued_task(rq, &rf, p, dest_cpu); } out: task_rq_unlock(rq, p, &rf); @@ -1220,16 +1208,24 @@ static void __migrate_swap_task(struct task_struct *p, int cpu) { if (task_on_rq_queued(p)) { struct rq *src_rq, *dst_rq; + struct rq_flags srf, drf; src_rq = task_rq(p); dst_rq = cpu_rq(cpu); + rq_pin_lock(src_rq, &srf); + rq_pin_lock(dst_rq, &drf); + p->on_rq = TASK_ON_RQ_MIGRATING; deactivate_task(src_rq, p, 0); set_task_cpu(p, cpu); activate_task(dst_rq, p, 0); p->on_rq = TASK_ON_RQ_QUEUED; check_preempt_curr(dst_rq, p, 0); + + rq_unpin_lock(dst_rq, &drf); + rq_unpin_lock(src_rq, &srf); + } else { /* * Task isn't running anymore; make it appear like we migrated @@ -1729,14 +1725,12 @@ void sched_ttwu_pending(void) struct rq *rq = this_rq(); struct llist_node *llist = llist_del_all(&rq->wake_list); struct task_struct *p; - unsigned long flags; struct rq_flags rf; if (!llist) return; - raw_spin_lock_irqsave(&rq->lock, flags); - rq_pin_lock(rq, &rf); + rq_lock_irqsave(rq, &rf); while (llist) { int wake_flags = 0; @@ -1750,8 +1744,7 @@ void sched_ttwu_pending(void) ttwu_do_activate(rq, p, wake_flags, &rf); } - rq_unpin_lock(rq, &rf); - raw_spin_unlock_irqrestore(&rq->lock, flags); + rq_unlock_irqrestore(rq, &rf); } void scheduler_ipi(void) @@ -1809,7 +1802,7 @@ static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags) void wake_up_if_idle(int cpu) { struct rq *rq = cpu_rq(cpu); - unsigned long flags; + struct rq_flags rf; rcu_read_lock(); @@ -1819,11 +1812,11 @@ void wake_up_if_idle(int cpu) if (set_nr_if_polling(rq->idle)) { trace_sched_wake_idle_without_ipi(cpu); } else { - raw_spin_lock_irqsave(&rq->lock, flags); + rq_lock_irqsave(rq, &rf); if (is_idle_task(rq->curr)) smp_send_reschedule(cpu); /* Else CPU is not idle, do nothing here: */ - raw_spin_unlock_irqrestore(&rq->lock, flags); + rq_unlock_irqrestore(rq, &rf); } out: @@ -1849,11 +1842,9 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags) } #endif - raw_spin_lock(&rq->lock); - rq_pin_lock(rq, &rf); + rq_lock(rq, &rf); ttwu_do_activate(rq, p, wake_flags, &rf); - rq_unpin_lock(rq, &rf); - raw_spin_unlock(&rq->lock); + rq_unlock(rq, &rf); } /* @@ -2100,11 +2091,9 @@ static void try_to_wake_up_local(struct task_struct *p, struct rq_flags *rf) * disabled avoiding further scheduler activity on it and we've * not yet picked a replacement task. */ - rq_unpin_lock(rq, rf); - raw_spin_unlock(&rq->lock); + rq_unlock(rq, rf); raw_spin_lock(&p->pi_lock); - raw_spin_lock(&rq->lock); - rq_repin_lock(rq, rf); + rq_relock(rq, rf); } if (!(p->state & TASK_NORMAL)) @@ -2778,9 +2767,9 @@ static void __balance_callback(struct rq *rq) { struct callback_head *head, *next; void (*func)(struct rq *rq); - unsigned long flags; + struct rq_flags rf; - raw_spin_lock_irqsave(&rq->lock, flags); + rq_lock_irqsave(rq, &rf); head = rq->balance_callback; rq->balance_callback = NULL; while (head) { @@ -2791,7 +2780,7 @@ static void __balance_callback(struct rq *rq) func(rq); } - raw_spin_unlock_irqrestore(&rq->lock, flags); + rq_unlock_irqrestore(rq, &rf); } static inline void balance_callback(struct rq *rq) @@ -3096,15 +3085,18 @@ void scheduler_tick(void) int cpu = smp_processor_id(); struct rq *rq = cpu_rq(cpu); struct task_struct *curr = rq->curr; + struct rq_flags rf; sched_clock_tick(); - raw_spin_lock(&rq->lock); + rq_lock(rq, &rf); + update_rq_clock(rq); curr->sched_class->task_tick(rq, curr, 0); cpu_load_update_active(rq); calc_global_load_tick(rq); - raw_spin_unlock(&rq->lock); + + rq_unlock(rq, &rf); perf_event_task_tick(); @@ -3389,8 +3381,7 @@ static void __sched notrace __schedule(bool preempt) * done by the caller to avoid the race with signal_wake_up(). */ smp_mb__before_spinlock(); - raw_spin_lock(&rq->lock); - rq_pin_lock(rq, &rf); + rq_lock(rq, &rf); /* Promote REQ to ACT */ rq->clock_update_flags <<= 1; @@ -3442,8 +3433,7 @@ static void __sched notrace __schedule(bool preempt) rq = context_switch(rq, prev, next, &rf); } else { rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP); - rq_unpin_lock(rq, &rf); - raw_spin_unlock_irq(&rq->lock); + rq_unlock_irq(rq, &rf); } balance_callback(rq); @@ -4926,7 +4916,12 @@ SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len, */ SYSCALL_DEFINE0(sched_yield) { - struct rq *rq = this_rq_lock(); + struct rq_flags rf; + struct rq *rq; + + local_irq_disable(); + rq = this_rq(); + rq_lock(rq, &rf); schedstat_inc(rq->yld_count); current->sched_class->yield_task(rq); @@ -4935,9 +4930,8 @@ SYSCALL_DEFINE0(sched_yield) * Since we are going to call schedule() anyway, there's * no need to preempt or enable interrupts: */ - __release(rq->lock); - spin_release(&rq->lock.dep_map, 1, _THIS_IP_); - do_raw_spin_unlock(&rq->lock); + preempt_disable(); + rq_unlock(rq, &rf); sched_preempt_enable_no_resched(); schedule(); @@ -5582,11 +5576,11 @@ static struct task_struct fake_task = { * there's no concurrency possible, we hold the required locks anyway * because of lock validation efforts. */ -static void migrate_tasks(struct rq *dead_rq) +static void migrate_tasks(struct rq *dead_rq, struct rq_flags *rf) { struct rq *rq = dead_rq; struct task_struct *next, *stop = rq->stop; - struct rq_flags rf; + struct rq_flags orf = *rf; int dest_cpu; /* @@ -5605,9 +5599,7 @@ static void migrate_tasks(struct rq *dead_rq) * class method both need to have an up-to-date * value of rq->clock[_task] */ - rq_pin_lock(rq, &rf); update_rq_clock(rq); - rq_unpin_lock(rq, &rf); for (;;) { /* @@ -5620,8 +5612,7 @@ static void migrate_tasks(struct rq *dead_rq) /* * pick_next_task() assumes pinned rq->lock: */ - rq_repin_lock(rq, &rf); - next = pick_next_task(rq, &fake_task, &rf); + next = pick_next_task(rq, &fake_task, rf); BUG_ON(!next); next->sched_class->put_prev_task(rq, next); @@ -5634,10 +5625,9 @@ static void migrate_tasks(struct rq *dead_rq) * because !cpu_active at this point, which means load-balance * will not interfere. Also, stop-machine. */ - rq_unpin_lock(rq, &rf); - raw_spin_unlock(&rq->lock); + rq_unlock(rq, rf); raw_spin_lock(&next->pi_lock); - raw_spin_lock(&rq->lock); + rq_relock(rq, rf); /* * Since we're inside stop-machine, _nothing_ should have @@ -5651,12 +5641,12 @@ static void migrate_tasks(struct rq *dead_rq) /* Find suitable destination for @next, with force if needed. */ dest_cpu = select_fallback_rq(dead_rq->cpu, next); - - rq = __migrate_task(rq, next, dest_cpu); + rq = __migrate_task(rq, rf, next, dest_cpu); if (rq != dead_rq) { - raw_spin_unlock(&rq->lock); + rq_unlock(rq, rf); rq = dead_rq; - raw_spin_lock(&rq->lock); + *rf = orf; + rq_relock(rq, rf); } raw_spin_unlock(&next->pi_lock); } @@ -5769,7 +5759,7 @@ static int cpuset_cpu_inactive(unsigned int cpu) int sched_cpu_activate(unsigned int cpu) { struct rq *rq = cpu_rq(cpu); - unsigned long flags; + struct rq_flags rf; set_cpu_active(cpu, true); @@ -5787,12 +5777,12 @@ int sched_cpu_activate(unsigned int cpu) * 2) At runtime, if cpuset_cpu_active() fails to rebuild the * domains. */ - raw_spin_lock_irqsave(&rq->lock, flags); + rq_lock_irqsave(rq, &rf); if (rq->rd) { BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span)); set_rq_online(rq); } - raw_spin_unlock_irqrestore(&rq->lock, flags); + rq_unlock_irqrestore(rq, &rf); update_max_interval(); @@ -5850,18 +5840,20 @@ int sched_cpu_starting(unsigned int cpu) int sched_cpu_dying(unsigned int cpu) { struct rq *rq = cpu_rq(cpu); - unsigned long flags; + struct rq_flags rf; /* Handle pending wakeups and then migrate everything off */ sched_ttwu_pending(); - raw_spin_lock_irqsave(&rq->lock, flags); + + rq_lock_irqsave(rq, &rf); if (rq->rd) { BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span)); set_rq_offline(rq); } - migrate_tasks(rq); + migrate_tasks(rq, &rf); BUG_ON(rq->nr_running != 1); - raw_spin_unlock_irqrestore(&rq->lock, flags); + rq_unlock_irqrestore(rq, &rf); + calc_load_migrate(rq); update_max_interval(); nohz_balance_exit_idle(cpu); @@ -7011,14 +7003,15 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota) for_each_online_cpu(i) { struct cfs_rq *cfs_rq = tg->cfs_rq[i]; struct rq *rq = cfs_rq->rq; + struct rq_flags rf; - raw_spin_lock_irq(&rq->lock); + rq_lock_irq(rq, &rf); cfs_rq->runtime_enabled = runtime_enabled; cfs_rq->runtime_remaining = 0; if (cfs_rq->throttled) unthrottle_cfs_rq(cfs_rq); - raw_spin_unlock_irq(&rq->lock); + rq_unlock_irq(rq, &rf); } if (runtime_was_enabled && !runtime_enabled) cfs_bandwidth_usage_dec(); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index dea138964b91..72b081b9a249 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4271,8 +4271,9 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, list_for_each_entry_rcu(cfs_rq, &cfs_b->throttled_cfs_rq, throttled_list) { struct rq *rq = rq_of(cfs_rq); + struct rq_flags rf; - raw_spin_lock(&rq->lock); + rq_lock(rq, &rf); if (!cfs_rq_throttled(cfs_rq)) goto next; @@ -4289,7 +4290,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, unthrottle_cfs_rq(cfs_rq); next: - raw_spin_unlock(&rq->lock); + rq_unlock(rq, &rf); if (!remaining) break; @@ -5097,15 +5098,16 @@ void cpu_load_update_nohz_stop(void) unsigned long curr_jiffies = READ_ONCE(jiffies); struct rq *this_rq = this_rq(); unsigned long load; + struct rq_flags rf; if (curr_jiffies == this_rq->last_load_update_tick) return; load = weighted_cpuload(cpu_of(this_rq)); - raw_spin_lock(&this_rq->lock); + rq_lock(this_rq, &rf); update_rq_clock(this_rq); cpu_load_update_nohz(this_rq, curr_jiffies, load); - raw_spin_unlock(&this_rq->lock); + rq_unlock(this_rq, &rf); } #else /* !CONFIG_NO_HZ_COMMON */ static inline void cpu_load_update_nohz(struct rq *this_rq, @@ -6913,9 +6915,11 @@ static void attach_task(struct rq *rq, struct task_struct *p) */ static void attach_one_task(struct rq *rq, struct task_struct *p) { - raw_spin_lock(&rq->lock); + struct rq_flags rf; + + rq_lock(rq, &rf); attach_task(rq, p); - raw_spin_unlock(&rq->lock); + rq_unlock(rq, &rf); } /* @@ -6926,8 +6930,9 @@ static void attach_tasks(struct lb_env *env) { struct list_head *tasks = &env->tasks; struct task_struct *p; + struct rq_flags rf; - raw_spin_lock(&env->dst_rq->lock); + rq_lock(env->dst_rq, &rf); while (!list_empty(tasks)) { p = list_first_entry(tasks, struct task_struct, se.group_node); @@ -6936,7 +6941,7 @@ static void attach_tasks(struct lb_env *env) attach_task(env->dst_rq, p); } - raw_spin_unlock(&env->dst_rq->lock); + rq_unlock(env->dst_rq, &rf); } #ifdef CONFIG_FAIR_GROUP_SCHED @@ -6944,9 +6949,9 @@ static void update_blocked_averages(int cpu) { struct rq *rq = cpu_rq(cpu); struct cfs_rq *cfs_rq; - unsigned long flags; + struct rq_flags rf; - raw_spin_lock_irqsave(&rq->lock, flags); + rq_lock_irqsave(rq, &rf); update_rq_clock(rq); /* @@ -6965,7 +6970,7 @@ static void update_blocked_averages(int cpu) if (cfs_rq->tg->se[cpu]) update_load_avg(cfs_rq->tg->se[cpu], 0); } - raw_spin_unlock_irqrestore(&rq->lock, flags); + rq_unlock_irqrestore(rq, &rf); } /* @@ -7019,12 +7024,12 @@ static inline void update_blocked_averages(int cpu) { struct rq *rq = cpu_rq(cpu); struct cfs_rq *cfs_rq = &rq->cfs; - unsigned long flags; + struct rq_flags rf; - raw_spin_lock_irqsave(&rq->lock, flags); + rq_lock_irqsave(rq, &rf); update_rq_clock(rq); update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true); - raw_spin_unlock_irqrestore(&rq->lock, flags); + rq_unlock_irqrestore(rq, &rf); } static unsigned long task_h_load(struct task_struct *p) @@ -8042,7 +8047,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, struct sched_domain *sd_parent = sd->parent; struct sched_group *group; struct rq *busiest; - unsigned long flags; + struct rq_flags rf; struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask); struct lb_env env = { @@ -8105,7 +8110,7 @@ redo: env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running); more_balance: - raw_spin_lock_irqsave(&busiest->lock, flags); + rq_lock_irqsave(busiest, &rf); update_rq_clock(busiest); /* @@ -8122,14 +8127,14 @@ more_balance: * See task_rq_lock() family for the details. */ - raw_spin_unlock(&busiest->lock); + rq_unlock(busiest, &rf); if (cur_ld_moved) { attach_tasks(&env); ld_moved += cur_ld_moved; } - local_irq_restore(flags); + local_irq_restore(rf.flags); if (env.flags & LBF_NEED_BREAK) { env.flags &= ~LBF_NEED_BREAK; @@ -8207,6 +8212,8 @@ more_balance: sd->nr_balance_failed++; if (need_active_balance(&env)) { + unsigned long flags; + raw_spin_lock_irqsave(&busiest->lock, flags); /* don't kick the active_load_balance_cpu_stop, @@ -8444,8 +8451,9 @@ static int active_load_balance_cpu_stop(void *data) struct rq *target_rq = cpu_rq(target_cpu); struct sched_domain *sd; struct task_struct *p = NULL; + struct rq_flags rf; - raw_spin_lock_irq(&busiest_rq->lock); + rq_lock_irq(busiest_rq, &rf); /* make sure the requested cpu hasn't gone down in the meantime */ if (unlikely(busiest_cpu != smp_processor_id() || @@ -8496,7 +8504,7 @@ static int active_load_balance_cpu_stop(void *data) rcu_read_unlock(); out_unlock: busiest_rq->active_balance = 0; - raw_spin_unlock(&busiest_rq->lock); + rq_unlock(busiest_rq, &rf); if (p) attach_one_task(target_rq, p); @@ -8794,10 +8802,13 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) * do the balance. */ if (time_after_eq(jiffies, rq->next_balance)) { - raw_spin_lock_irq(&rq->lock); + struct rq_flags rf; + + rq_lock_irq(rq, &rf); update_rq_clock(rq); cpu_load_update_idle(rq); - raw_spin_unlock_irq(&rq->lock); + rq_unlock_irq(rq, &rf); + rebalance_domains(rq, CPU_IDLE); } @@ -8988,8 +8999,9 @@ static void task_fork_fair(struct task_struct *p) struct cfs_rq *cfs_rq; struct sched_entity *se = &p->se, *curr; struct rq *rq = this_rq(); + struct rq_flags rf; - raw_spin_lock(&rq->lock); + rq_lock(rq, &rf); update_rq_clock(rq); cfs_rq = task_cfs_rq(current); @@ -9010,7 +9022,7 @@ static void task_fork_fair(struct task_struct *p) } se->vruntime -= cfs_rq->min_vruntime; - raw_spin_unlock(&rq->lock); + rq_unlock(rq, &rf); } /* @@ -9372,7 +9384,6 @@ static DEFINE_MUTEX(shares_mutex); int sched_group_set_shares(struct task_group *tg, unsigned long shares) { int i; - unsigned long flags; /* * We can't change the weight of the root cgroup. @@ -9389,19 +9400,17 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares) tg->shares = shares; for_each_possible_cpu(i) { struct rq *rq = cpu_rq(i); - struct sched_entity *se; + struct sched_entity *se = tg->se[i]; + struct rq_flags rf; - se = tg->se[i]; /* Propagate contribution to hierarchy */ - raw_spin_lock_irqsave(&rq->lock, flags); - - /* Possible calls to update_curr() need rq clock */ + rq_lock_irqsave(rq, &rf); update_rq_clock(rq); for_each_sched_entity(se) { update_load_avg(se, UPDATE_TG); update_cfs_shares(se); } - raw_spin_unlock_irqrestore(&rq->lock, flags); + rq_unlock_irqrestore(rq, &rf); } done: diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 5cbf92214ad8..7d4f69329634 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1624,6 +1624,7 @@ static inline void sched_avg_update(struct rq *rq) { } struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf) __acquires(rq->lock); + struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf) __acquires(p->pi_lock) __acquires(rq->lock); @@ -1645,6 +1646,62 @@ task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf) raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags); } +static inline void +rq_lock_irqsave(struct rq *rq, struct rq_flags *rf) + __acquires(rq->lock) +{ + raw_spin_lock_irqsave(&rq->lock, rf->flags); + rq_pin_lock(rq, rf); +} + +static inline void +rq_lock_irq(struct rq *rq, struct rq_flags *rf) + __acquires(rq->lock) +{ + raw_spin_lock_irq(&rq->lock); + rq_pin_lock(rq, rf); +} + +static inline void +rq_lock(struct rq *rq, struct rq_flags *rf) + __acquires(rq->lock) +{ + raw_spin_lock(&rq->lock); + rq_pin_lock(rq, rf); +} + +static inline void +rq_relock(struct rq *rq, struct rq_flags *rf) + __acquires(rq->lock) +{ + raw_spin_lock(&rq->lock); + rq_repin_lock(rq, rf); +} + +static inline void +rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf) + __releases(rq->lock) +{ + rq_unpin_lock(rq, rf); + raw_spin_unlock_irqrestore(&rq->lock, rf->flags); +} + +static inline void +rq_unlock_irq(struct rq *rq, struct rq_flags *rf) + __releases(rq->lock) +{ + rq_unpin_lock(rq, rf); + raw_spin_unlock_irq(&rq->lock); +} + +static inline void +rq_unlock(struct rq *rq, struct rq_flags *rf) + __releases(rq->lock) +{ + rq_unpin_lock(rq, rf); + raw_spin_unlock(&rq->lock); +} + #ifdef CONFIG_SMP #ifdef CONFIG_PREEMPT -- cgit From 0a67d1ee30ef1efe6a412b3590e08734902aed43 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 4 Oct 2016 16:29:45 +0200 Subject: sched/core: Add {EN,DE}QUEUE_NOCLOCK flags Currently {en,de}queue_task() do an unconditional update_rq_clock(). However since we want to avoid duplicate updates, so that each rq->lock section appears atomic in time, we need to be able to skip these clock updates. Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 10 ++++++++-- kernel/sched/sched.h | 8 +++++--- 2 files changed, 13 insertions(+), 5 deletions(-) (limited to 'kernel/sched') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c5a514b1668d..ce363bdc7e6b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -752,17 +752,23 @@ static void set_load_weight(struct task_struct *p) static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags) { - update_rq_clock(rq); + if (!(flags & ENQUEUE_NOCLOCK)) + update_rq_clock(rq); + if (!(flags & ENQUEUE_RESTORE)) sched_info_queued(rq, p); + p->sched_class->enqueue_task(rq, p, flags); } static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags) { - update_rq_clock(rq); + if (!(flags & DEQUEUE_NOCLOCK)) + update_rq_clock(rq); + if (!(flags & DEQUEUE_SAVE)) sched_info_dequeued(rq, p); + p->sched_class->dequeue_task(rq, p, flags); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 7d4f69329634..de4b934ba974 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1331,15 +1331,17 @@ extern const u32 sched_prio_to_wmult[40]; #define DEQUEUE_SLEEP 0x01 #define DEQUEUE_SAVE 0x02 /* matches ENQUEUE_RESTORE */ #define DEQUEUE_MOVE 0x04 /* matches ENQUEUE_MOVE */ +#define DEQUEUE_NOCLOCK 0x08 /* matches ENQUEUE_NOCLOCK */ #define ENQUEUE_WAKEUP 0x01 #define ENQUEUE_RESTORE 0x02 #define ENQUEUE_MOVE 0x04 +#define ENQUEUE_NOCLOCK 0x08 -#define ENQUEUE_HEAD 0x08 -#define ENQUEUE_REPLENISH 0x10 +#define ENQUEUE_HEAD 0x10 +#define ENQUEUE_REPLENISH 0x20 #ifdef CONFIG_SMP -#define ENQUEUE_MIGRATED 0x20 +#define ENQUEUE_MIGRATED 0x40 #else #define ENQUEUE_MIGRATED 0x00 #endif -- cgit From 7134b3e941613dcb959b4b178cc4a35e45cbbc0d Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 21 Feb 2017 14:23:38 +0100 Subject: sched/core: Add ENQUEUE_NOCLOCK to ENQUEUE_RESTORE In all cases, ENQUEUE_RESTORE should also have ENQUEUE_NOCLOCK because DEQUEUE_SAVE will have done an update_rq_clock(). Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'kernel/sched') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ce363bdc7e6b..247d0a0c319e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1070,7 +1070,7 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask) p->sched_class->set_cpus_allowed(p, new_mask); if (queued) - enqueue_task(rq, p, ENQUEUE_RESTORE); + enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK); if (running) set_curr_task(rq, p); } @@ -3815,7 +3815,7 @@ void set_user_nice(struct task_struct *p, long nice) delta = p->prio - old_prio; if (queued) { - enqueue_task(rq, p, ENQUEUE_RESTORE); + enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK); /* * If the task increased its priority or is running and * lowered its priority, then reschedule its CPU: @@ -5517,7 +5517,7 @@ void sched_setnuma(struct task_struct *p, int nid) p->numa_preferred_nid = nid; if (queued) - enqueue_task(rq, p, ENQUEUE_RESTORE); + enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK); if (running) set_curr_task(rq, p); task_rq_unlock(rq, p, &rf); @@ -6431,7 +6431,7 @@ void sched_move_task(struct task_struct *tsk) sched_change_group(tsk, TASK_MOVE_GROUP); if (queued) - enqueue_task(rq, tsk, ENQUEUE_RESTORE | ENQUEUE_MOVE); + enqueue_task(rq, tsk, ENQUEUE_RESTORE | ENQUEUE_MOVE | ENQUEUE_NOCLOCK); if (running) set_curr_task(rq, tsk); -- cgit From 77558e4d01ac0c7fa8cb1af4a61c2ab508d79f30 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 21 Feb 2017 14:36:23 +0100 Subject: sched/core: Make sched_ttwu_pending() atomic in time Since all tasks on the wake_list are woken under a single rq->lock avoid calling update_rq_clock() for each task. Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'kernel/sched') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 247d0a0c319e..dead90d680fd 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1685,7 +1685,7 @@ static void ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, struct rq_flags *rf) { - int en_flags = ENQUEUE_WAKEUP; + int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK; lockdep_assert_held(&rq->lock); @@ -1737,6 +1737,7 @@ void sched_ttwu_pending(void) return; rq_lock_irqsave(rq, &rf); + update_rq_clock(rq); while (llist) { int wake_flags = 0; @@ -1849,6 +1850,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags) #endif rq_lock(rq, &rf); + update_rq_clock(rq); ttwu_do_activate(rq, p, wake_flags, &rf); rq_unlock(rq, &rf); } -- cgit From bce4dc80c66ad355c74e876c82ce371020754627 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 21 Feb 2017 14:40:35 +0100 Subject: sched/core: Simplify update_rq_clock() in __schedule() Instead of relying on deactivate_task() to call update_rq_clock() and handling the case where it didn't happen (task_on_rq_queued), unconditionally do update_rq_clock() and skip any further updates. This also avoids a double update on deactivate_task() + ttwu_local(). Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) (limited to 'kernel/sched') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index dead90d680fd..179a6c928bf1 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2114,7 +2114,7 @@ static void try_to_wake_up_local(struct task_struct *p, struct rq_flags *rf) delayacct_blkio_end(); atomic_dec(&rq->nr_iowait); } - ttwu_activate(rq, p, ENQUEUE_WAKEUP); + ttwu_activate(rq, p, ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK); } ttwu_do_wakeup(rq, p, 0, rf); @@ -3393,13 +3393,14 @@ static void __sched notrace __schedule(bool preempt) /* Promote REQ to ACT */ rq->clock_update_flags <<= 1; + update_rq_clock(rq); switch_count = &prev->nivcsw; if (!preempt && prev->state) { if (unlikely(signal_pending_state(prev->state, prev))) { prev->state = TASK_RUNNING; } else { - deactivate_task(rq, prev, DEQUEUE_SLEEP); + deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK); prev->on_rq = 0; if (prev->in_iowait) { @@ -3423,9 +3424,6 @@ static void __sched notrace __schedule(bool preempt) switch_count = &prev->nvcsw; } - if (task_on_rq_queued(prev)) - update_rq_clock(rq); - next = pick_next_task(rq, prev, &rf); clear_tsk_need_resched(prev); clear_preempt_need_resched(); -- cgit From 7a57f32a4d5c80c7790929dd7f4441bb6bff7480 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 21 Feb 2017 14:47:02 +0100 Subject: sched/core: Avoid obvious double update_rq_clock() Add DEQUEUE_NOCLOCK to all places where we just did an update_rq_clock() already. Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) (limited to 'kernel/sched') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 179a6c928bf1..c6be770d6e68 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1062,7 +1062,7 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask) * holding rq->lock. */ lockdep_assert_held(&rq->lock); - dequeue_task(rq, p, DEQUEUE_SAVE); + dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK); } if (running) put_prev_task(rq, p); @@ -2555,7 +2555,7 @@ void wake_up_new_task(struct task_struct *p) update_rq_clock(rq); post_init_entity_util_avg(&p->se); - activate_task(rq, p, 0); + activate_task(rq, p, ENQUEUE_NOCLOCK); p->on_rq = TASK_ON_RQ_QUEUED; trace_sched_wakeup_new(p); check_preempt_curr(rq, p, WF_FORK); @@ -3683,7 +3683,8 @@ EXPORT_SYMBOL(default_wake_function); */ void rt_mutex_setprio(struct task_struct *p, int prio) { - int oldprio, queued, running, queue_flag = DEQUEUE_SAVE | DEQUEUE_MOVE; + int oldprio, queued, running, queue_flag = + DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK; const struct sched_class *prev_class; struct rq_flags rf; struct rq *rq; @@ -3804,7 +3805,7 @@ void set_user_nice(struct task_struct *p, long nice) queued = task_on_rq_queued(p); running = task_current(rq, p); if (queued) - dequeue_task(rq, p, DEQUEUE_SAVE); + dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK); if (running) put_prev_task(rq, p); @@ -4125,7 +4126,7 @@ static int __sched_setscheduler(struct task_struct *p, const struct sched_class *prev_class; struct rq_flags rf; int reset_on_fork; - int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE; + int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK; struct rq *rq; /* May grab non-irq protected spin_locks: */ @@ -6413,7 +6414,8 @@ static void sched_change_group(struct task_struct *tsk, int type) */ void sched_move_task(struct task_struct *tsk) { - int queued, running; + int queued, running, queue_flags = + DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK; struct rq_flags rf; struct rq *rq; @@ -6424,14 +6426,14 @@ void sched_move_task(struct task_struct *tsk) queued = task_on_rq_queued(tsk); if (queued) - dequeue_task(rq, tsk, DEQUEUE_SAVE | DEQUEUE_MOVE); + dequeue_task(rq, tsk, queue_flags); if (running) put_prev_task(rq, tsk); sched_change_group(tsk, TASK_MOVE_GROUP); if (queued) - enqueue_task(rq, tsk, ENQUEUE_RESTORE | ENQUEUE_MOVE | ENQUEUE_NOCLOCK); + enqueue_task(rq, tsk, queue_flags); if (running) set_curr_task(rq, tsk); -- cgit From 5704ac0ae7f59581a264f45ddfc0ab4235aa052a Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 21 Feb 2017 17:15:21 +0100 Subject: sched/core: Fix double update_rq_clock) calls in attach_task()/detach_task() Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'kernel/sched') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 72b081b9a249..2805bd7c8994 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6771,7 +6771,7 @@ static void detach_task(struct task_struct *p, struct lb_env *env) lockdep_assert_held(&env->src_rq->lock); p->on_rq = TASK_ON_RQ_MIGRATING; - deactivate_task(env->src_rq, p, 0); + deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK); set_task_cpu(p, env->dst_cpu); } @@ -6904,7 +6904,7 @@ static void attach_task(struct rq *rq, struct task_struct *p) lockdep_assert_held(&rq->lock); BUG_ON(task_rq(p) != rq); - activate_task(rq, p, 0); + activate_task(rq, p, ENQUEUE_NOCLOCK); p->on_rq = TASK_ON_RQ_QUEUED; check_preempt_curr(rq, p, 0); } @@ -6918,6 +6918,7 @@ static void attach_one_task(struct rq *rq, struct task_struct *p) struct rq_flags rf; rq_lock(rq, &rf); + update_rq_clock(rq); attach_task(rq, p); rq_unlock(rq, &rf); } @@ -6933,6 +6934,7 @@ static void attach_tasks(struct lb_env *env) struct rq_flags rf; rq_lock(env->dst_rq, &rf); + update_rq_clock(env->dst_rq); while (!list_empty(tasks)) { p = list_first_entry(tasks, struct task_struct, se.group_node); -- cgit From 15ff991e8047561bb4a4e800ec60f60939be5fd4 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Wed, 5 Oct 2016 17:59:32 +0200 Subject: sched/core: Avoid double update_rq_clock() in move_queued_task() Address this case: WARNING: CPU: 0 PID: 2070 at ../kernel/sched/core.c:109 update_rq_clock+0x74/0x80 rq->clock_update_flags & RQCF_UPDATED Call Trace: update_rq_clock() move_queued_task() __set_cpus_allowed_ptr() ... Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel/sched') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c6be770d6e68..c762f627b9f2 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -948,7 +948,7 @@ static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf, lockdep_assert_held(&rq->lock); p->on_rq = TASK_ON_RQ_MIGRATING; - dequeue_task(rq, p, 0); + dequeue_task(rq, p, DEQUEUE_NOCLOCK); set_task_cpu(p, new_cpu); rq_unlock(rq, rf); @@ -987,6 +987,7 @@ static struct rq *__migrate_task(struct rq *rq, struct rq_flags *rf, if (!cpumask_test_cpu(dest_cpu, &p->cpus_allowed)) return rq; + update_rq_clock(rq); rq = move_queued_task(rq, rf, p, dest_cpu); return rq; -- cgit From d7921a5ddab8d30e06e321f37eec629f23797486 Mon Sep 17 00:00:00 2001 From: Wanpeng Li Date: Thu, 16 Mar 2017 19:45:19 -0700 Subject: sched/core: Fix rq lock pinning warning after call balance callbacks This can be reproduced by running rt-migrate-test: WARNING: CPU: 2 PID: 2195 at kernel/locking/lockdep.c:3670 lock_unpin_lock() unpinning an unpinned lock ... Call Trace: dump_stack() __warn() warn_slowpath_fmt() lock_unpin_lock() __balance_callback() __schedule() schedule() futex_wait_queue_me() futex_wait() do_futex() SyS_futex() do_syscall_64() entry_SYSCALL64_slow_path() Revert the rq_lock_irqsave() usage here, the whole point of the balance_callback() was to allow dropping rq->lock. Reported-by: Fengguang Wu Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Fixes: 8a8c69c32778 ("sched/core: Add rq->lock wrappers") Link: http://lkml.kernel.org/r/1489718719-3951-1-git-send-email-wanpeng.li@hotmail.com Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'kernel/sched') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c762f627b9f2..ab9f6ac099a7 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2776,9 +2776,9 @@ static void __balance_callback(struct rq *rq) { struct callback_head *head, *next; void (*func)(struct rq *rq); - struct rq_flags rf; + unsigned long flags; - rq_lock_irqsave(rq, &rf); + raw_spin_lock_irqsave(&rq->lock, flags); head = rq->balance_callback; rq->balance_callback = NULL; while (head) { @@ -2789,7 +2789,7 @@ static void __balance_callback(struct rq *rq) func(rq); } - rq_unlock_irqrestore(rq, &rf); + raw_spin_unlock_irqrestore(&rq->lock, flags); } static inline void balance_callback(struct rq *rq) -- cgit From bc4278987e3874da62edf585fe8b3bdd9b53f638 Mon Sep 17 00:00:00 2001 From: Vincent Guittot Date: Fri, 17 Mar 2017 14:47:22 +0100 Subject: sched/fair: Fix FTQ noise bench regression A regression of the FTQ noise has been reported by Ying Huang, on the following hardware: 8 threads Intel(R) Core(TM)i7-4770 CPU @ 3.40GHz with 8G memory ... which was caused by this commit: commit 4e5160766fcc ("sched/fair: Propagate asynchrous detach") The only part of the patch that can increase the noise is the update of blocked load of group entity in update_blocked_averages(). We can optimize this call and skip the update of group entity if its load and utilization are already null and there is no pending propagation of load in the task group. This optimization partly restores the noise score. A more agressive optimization has been tried but has shown worse score. Reported-by: ying.huang@linux.intel.com Signed-off-by: Vincent Guittot Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: dietmar.eggemann@arm.com Cc: ying.huang@intel.com Fixes: 4e5160766fcc ("sched/fair: Propagate asynchrous detach") Link: http://lkml.kernel.org/r/1489758442-2877-1-git-send-email-vincent.guittot@linaro.org [ Fixed typos, improved layout. ] Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 39 ++++++++++++++++++++++++++++++++++++--- 1 file changed, 36 insertions(+), 3 deletions(-) (limited to 'kernel/sched') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2805bd7c8994..03adf9fb48b1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3173,6 +3173,36 @@ static inline int propagate_entity_load_avg(struct sched_entity *se) return 1; } +/* + * Check if we need to update the load and the utilization of a blocked + * group_entity: + */ +static inline bool skip_blocked_update(struct sched_entity *se) +{ + struct cfs_rq *gcfs_rq = group_cfs_rq(se); + + /* + * If sched_entity still have not zero load or utilization, we have to + * decay it: + */ + if (se->avg.load_avg || se->avg.util_avg) + return false; + + /* + * If there is a pending propagation, we have to update the load and + * the utilization of the sched_entity: + */ + if (gcfs_rq->propagate_avg) + return false; + + /* + * Otherwise, the load and the utilization of the sched_entity is + * already zero and there is no pending propagation, so it will be a + * waste of time to try to decay it: + */ + return true; +} + #else /* CONFIG_FAIR_GROUP_SCHED */ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) {} @@ -6961,6 +6991,8 @@ static void update_blocked_averages(int cpu) * list_add_leaf_cfs_rq() for details. */ for_each_leaf_cfs_rq(rq, cfs_rq) { + struct sched_entity *se; + /* throttled entities do not contribute to load */ if (throttled_hierarchy(cfs_rq)) continue; @@ -6968,9 +7000,10 @@ static void update_blocked_averages(int cpu) if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true)) update_tg_load_avg(cfs_rq, 0); - /* Propagate pending load changes to the parent */ - if (cfs_rq->tg->se[cpu]) - update_load_avg(cfs_rq->tg->se[cpu], 0); + /* Propagate pending load changes to the parent, if any: */ + se = cfs_rq->tg->se[cpu]; + if (se && !skip_blocked_update(se)) + update_load_avg(se, 0); } rq_unlock_irqrestore(rq, &rf); } -- cgit From 0ba42a599fbf59a55c1ffedb980be3726c734433 Mon Sep 17 00:00:00 2001 From: Masanari Iida Date: Tue, 7 Mar 2017 20:48:02 +0900 Subject: treewide: Fix typo in xml/driver-api/basics.xml This patch fix spelling typos found in Documentation/output/xml/driver-api/basics.xml. It is because the xml file was generated from comments in source, so I had to fix the comments. Signed-off-by: Masanari Iida Signed-off-by: Jiri Kosina --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel/sched') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index dea138964b91..3ae75f5a69fc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7597,7 +7597,7 @@ next_group: /** * check_asym_packing - Check to see if the group is packed into the - * sched doman. + * sched domain. * * This is primarily intended to used at the sibling level. Some * cores like POWER7 prefer to use lower numbered SMT threads. In the -- cgit From 05b40e057734811ce452344fb3690d09965a7b6a Mon Sep 17 00:00:00 2001 From: Srikar Dronamraju Date: Wed, 22 Mar 2017 23:27:50 +0530 Subject: sched/fair: Prefer sibiling only if local group is under-utilized If the child domain prefers tasks to go siblings, the local group could end up pulling tasks to itself even if the local group is almost equally loaded as the source group. Lets assume a 4 core,smt==2 machine running 5 thread ebizzy workload. Everytime, local group has capacity and source group has atleast 2 threads, local group tries to pull the task. This causes the threads to constantly move between different cores. This is even more profound if the cores have more threads, like in Power 8, smt 8 mode. Fix this by only allowing local group to pull a task, if the source group has more number of tasks than the local group. Here are the relevant perf stat numbers of a 22 core,smt 8 Power 8 machine. Without patch: Performance counter stats for 'ebizzy -t 22 -S 100' (5 runs): 1,440 context-switches # 0.001 K/sec ( +- 1.26% ) 366 cpu-migrations # 0.000 K/sec ( +- 5.58% ) 3,933 page-faults # 0.002 K/sec ( +- 11.08% ) Performance counter stats for 'ebizzy -t 48 -S 100' (5 runs): 6,287 context-switches # 0.001 K/sec ( +- 3.65% ) 3,776 cpu-migrations # 0.001 K/sec ( +- 4.84% ) 5,702 page-faults # 0.001 K/sec ( +- 9.36% ) Performance counter stats for 'ebizzy -t 96 -S 100' (5 runs): 8,776 context-switches # 0.001 K/sec ( +- 0.73% ) 2,790 cpu-migrations # 0.000 K/sec ( +- 0.98% ) 10,540 page-faults # 0.001 K/sec ( +- 3.12% ) With patch: Performance counter stats for 'ebizzy -t 22 -S 100' (5 runs): 1,133 context-switches # 0.001 K/sec ( +- 4.72% ) 123 cpu-migrations # 0.000 K/sec ( +- 3.42% ) 3,858 page-faults # 0.002 K/sec ( +- 8.52% ) Performance counter stats for 'ebizzy -t 48 -S 100' (5 runs): 2,169 context-switches # 0.000 K/sec ( +- 6.19% ) 189 cpu-migrations # 0.000 K/sec ( +- 12.75% ) 5,917 page-faults # 0.001 K/sec ( +- 8.09% ) Performance counter stats for 'ebizzy -t 96 -S 100' (5 runs): 5,333 context-switches # 0.001 K/sec ( +- 5.91% ) 506 cpu-migrations # 0.000 K/sec ( +- 3.35% ) 10,792 page-faults # 0.001 K/sec ( +- 7.75% ) Which show that in these workloads CPU migrations get reduced significantly. Signed-off-by: Srikar Dronamraju Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Vincent Guittot Link: http://lkml.kernel.org/r/1490205470-10249-1-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) (limited to 'kernel/sched') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 03adf9fb48b1..31453d57e8f5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7565,6 +7565,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd { struct sched_domain *child = env->sd->child; struct sched_group *sg = env->sd->groups; + struct sg_lb_stats *local = &sds->local_stat; struct sg_lb_stats tmp_sgs; int load_idx, prefer_sibling = 0; bool overload = false; @@ -7581,7 +7582,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd local_group = cpumask_test_cpu(env->dst_cpu, sched_group_cpus(sg)); if (local_group) { sds->local = sg; - sgs = &sds->local_stat; + sgs = local; if (env->idle != CPU_NEWLY_IDLE || time_after_eq(jiffies, sg->sgc->next_update)) @@ -7605,8 +7606,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd * the tasks on the system). */ if (prefer_sibling && sds->local && - group_has_capacity(env, &sds->local_stat) && - (sgs->sum_nr_running > 1)) { + group_has_capacity(env, local) && + (sgs->sum_nr_running > local->sum_nr_running + 1)) { sgs->group_no_capacity = 1; sgs->group_type = group_classify(sg, sgs); } -- cgit From 0ccb977f4c80b921a8bf6a2c4b8ea0c1fed6553c Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 28 Mar 2017 11:08:20 +0200 Subject: sched/fair: Explicitly generate __update_load_avg() instances The __update_load_avg() function is an __always_inline because its used with constant propagation to generate different variants of the code without having to duplicate it (which would be prone to bugs). Explicitly instantiate the 3 variants. Note that most of this is called from rather hot paths, so reducing branches is good. Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 56 ++++++++++++++++++++++++++++++++++++----------------- 1 file changed, 38 insertions(+), 18 deletions(-) (limited to 'kernel/sched') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 31453d57e8f5..2ac00cfbf29f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2849,7 +2849,7 @@ static u32 __compute_runnable_contrib(u64 n) * = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}] */ static __always_inline int -__update_load_avg(u64 now, int cpu, struct sched_avg *sa, +___update_load_avg(u64 now, int cpu, struct sched_avg *sa, unsigned long weight, int running, struct cfs_rq *cfs_rq) { u64 delta, scaled_delta, periods; @@ -2953,6 +2953,28 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, return decayed; } +static int +__update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se) +{ + return ___update_load_avg(now, cpu, &se->avg, 0, 0, NULL); +} + +static int +__update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se) +{ + return ___update_load_avg(now, cpu, &se->avg, + se->on_rq * scale_load_down(se->load.weight), + cfs_rq->curr == se, NULL); +} + +static int +__update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq) +{ + return ___update_load_avg(now, cpu, &cfs_rq->avg, + scale_load_down(cfs_rq->load.weight), + cfs_rq->curr != NULL, cfs_rq); +} + /* * Signed add and clamp on underflow. * @@ -3014,6 +3036,9 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) void set_task_rq_fair(struct sched_entity *se, struct cfs_rq *prev, struct cfs_rq *next) { + u64 p_last_update_time; + u64 n_last_update_time; + if (!sched_feat(ATTACH_AGE_LOAD)) return; @@ -3024,11 +3049,11 @@ void set_task_rq_fair(struct sched_entity *se, * time. This will result in the wakee task is less decayed, but giving * the wakee more load sounds not bad. */ - if (se->avg.last_update_time && prev) { - u64 p_last_update_time; - u64 n_last_update_time; + if (!(se->avg.last_update_time && prev)) + return; #ifndef CONFIG_64BIT + { u64 p_last_update_time_copy; u64 n_last_update_time_copy; @@ -3043,14 +3068,13 @@ void set_task_rq_fair(struct sched_entity *se, } while (p_last_update_time != p_last_update_time_copy || n_last_update_time != n_last_update_time_copy); + } #else - p_last_update_time = prev->avg.last_update_time; - n_last_update_time = next->avg.last_update_time; + p_last_update_time = prev->avg.last_update_time; + n_last_update_time = next->avg.last_update_time; #endif - __update_load_avg(p_last_update_time, cpu_of(rq_of(prev)), - &se->avg, 0, 0, NULL); - se->avg.last_update_time = n_last_update_time; - } + __update_load_avg_blocked_se(p_last_update_time, cpu_of(rq_of(prev)), se); + se->avg.last_update_time = n_last_update_time; } /* Take into account change of utilization of a child task group */ @@ -3295,8 +3319,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq) set_tg_cfs_propagate(cfs_rq); } - decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa, - scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL, cfs_rq); + decayed = __update_load_avg_cfs_rq(now, cpu_of(rq_of(cfs_rq)), cfs_rq); #ifndef CONFIG_64BIT smp_wmb(); @@ -3328,11 +3351,8 @@ static inline void update_load_avg(struct sched_entity *se, int flags) * Track task load average for carrying it to new CPU after migrated, and * track group sched_entity load average for task_h_load calc in migration */ - if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) { - __update_load_avg(now, cpu, &se->avg, - se->on_rq * scale_load_down(se->load.weight), - cfs_rq->curr == se, NULL); - } + if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) + __update_load_avg_se(now, cpu, cfs_rq, se); decayed = update_cfs_rq_load_avg(now, cfs_rq, true); decayed |= propagate_entity_load_avg(se); @@ -3437,7 +3457,7 @@ void sync_entity_load_avg(struct sched_entity *se) u64 last_update_time; last_update_time = cfs_rq_last_update_time(cfs_rq); - __update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0, NULL); + __update_load_avg_blocked_se(last_update_time, cpu_of(rq_of(cfs_rq)), se); } /* -- cgit From a481db34b9beb7a9647c23f2320dd38a2b1d681f Mon Sep 17 00:00:00 2001 From: Yuyang Du Date: Mon, 13 Feb 2017 05:44:23 +0800 Subject: sched/fair: Optimize ___update_sched_avg() The main PELT function ___update_load_avg(), which implements the accumulation and progression of the geometric average series, is implemented along the following lines for the scenario where the time delta spans all 3 possible sections (see figure below): 1. add the remainder of the last incomplete period 2. decay old sum 3. accumulate new sum in full periods since last_update_time 4. accumulate the current incomplete period 5. update averages Or: d1 d2 d3 ^ ^ ^ | | | |<->|<----------------->|<--->| ... |---x---|------| ... |------|-----x (now) load_sum' = (load_sum + weight * scale * d1) * y^(p+1) + (1,2) p weight * scale * 1024 * \Sum y^n + (3) n=1 weight * scale * d3 * y^0 (4) load_avg' = load_sum' / LOAD_AVG_MAX (5) Where: d1 - is the delta part completing the remainder of the last incomplete period, d2 - is the delta part spannind complete periods, and d3 - is the delta part starting the current incomplete period. We can simplify the code in two steps; the first step is to separate the first term into new and old parts like: (load_sum + weight * scale * d1) * y^(p+1) = load_sum * y^(p+1) + weight * scale * d1 * y^(p+1) Once we've done that, its easy to see that all new terms carry the common factors: weight * scale If we factor those out, we arrive at the form: load_sum' = load_sum * y^(p+1) + weight * scale * (d1 * y^(p+1) + p 1024 * \Sum y^n + n=1 d3 * y^0) Which results in a simpler, smaller and faster implementation. Signed-off-by: Yuyang Du Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: matt@codeblueprint.co.uk Cc: morten.rasmussen@arm.com Cc: pjt@google.com Cc: umgwanakikbuti@gmail.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1486935863-25251-3-git-send-email-yuyang.du@intel.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 212 +++++++++++++++++++++++++++++----------------------- 1 file changed, 118 insertions(+), 94 deletions(-) (limited to 'kernel/sched') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2ac00cfbf29f..76f67b3e34d6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2767,7 +2767,7 @@ static const u32 __accumulated_sum_N32[] = { * Approximate: * val * y^n, where y^32 ~= 0.5 (~1 scheduling period) */ -static __always_inline u64 decay_load(u64 val, u64 n) +static u64 decay_load(u64 val, u64 n) { unsigned int local_n; @@ -2795,31 +2795,112 @@ static __always_inline u64 decay_load(u64 val, u64 n) return val; } -/* - * For updates fully spanning n periods, the contribution to runnable - * average will be: \Sum 1024*y^n - * - * We can compute this reasonably efficiently by combining: - * y^PERIOD = 1/2 with precomputed \Sum 1024*y^n {for n = LOAD_AVG_MAX_N)) + if (!periods) + return remainder - period_contrib; + + if (unlikely(periods >= LOAD_AVG_MAX_N)) return LOAD_AVG_MAX; - /* Since n < LOAD_AVG_MAX_N, n/LOAD_AVG_PERIOD < 11 */ - contrib = __accumulated_sum_N32[n/LOAD_AVG_PERIOD]; - n %= LOAD_AVG_PERIOD; - contrib = decay_load(contrib, n); - return contrib + runnable_avg_yN_sum[n]; + /* + * c1 = d1 y^(p+1) + */ + c1 = decay_load((u64)(1024 - period_contrib), periods); + + periods -= 1; + /* + * For updates fully spanning n periods, the contribution to runnable + * average will be: + * + * c2 = 1024 \Sum y^n + * + * We can compute this reasonably efficiently by combining: + * + * y^PERIOD = 1/2 with precomputed 1024 \Sum y^n {for: n < PERIOD} + */ + if (likely(periods <= LOAD_AVG_PERIOD)) { + c2 = runnable_avg_yN_sum[periods]; + } else { + c2 = __accumulated_sum_N32[periods/LOAD_AVG_PERIOD]; + periods %= LOAD_AVG_PERIOD; + c2 = decay_load(c2, periods); + c2 += runnable_avg_yN_sum[periods]; + } + + return c1 + c2 + c3; } #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT) +/* + * Accumulate the three separate parts of the sum; d1 the remainder + * of the last (incomplete) period, d2 the span of full periods and d3 + * the remainder of the (incomplete) current period. + * + * d1 d2 d3 + * ^ ^ ^ + * | | | + * |<->|<----------------->|<--->| + * ... |---x---|------| ... |------|-----x (now) + * + * p + * u' = (u + d1) y^(p+1) + 1024 \Sum y^n + d3 y^0 + * n=1 + * + * = u y^(p+1) + (Step 1) + * + * p + * d1 y^(p+1) + 1024 \Sum y^n + d3 y^0 (Step 2) + * n=1 + */ +static __always_inline u32 +accumulate_sum(u64 delta, int cpu, struct sched_avg *sa, + unsigned long weight, int running, struct cfs_rq *cfs_rq) +{ + unsigned long scale_freq, scale_cpu; + u64 periods; + u32 contrib; + + scale_freq = arch_scale_freq_capacity(NULL, cpu); + scale_cpu = arch_scale_cpu_capacity(NULL, cpu); + + delta += sa->period_contrib; + periods = delta / 1024; /* A period is 1024us (~1ms) */ + + /* + * Step 1: decay old *_sum if we crossed period boundaries. + */ + if (periods) { + sa->load_sum = decay_load(sa->load_sum, periods); + if (cfs_rq) { + cfs_rq->runnable_load_sum = + decay_load(cfs_rq->runnable_load_sum, periods); + } + sa->util_sum = decay_load((u64)(sa->util_sum), periods); + } + + /* + * Step 2 + */ + delta %= 1024; + contrib = __accumulate_sum(periods, sa->period_contrib, delta); + sa->period_contrib = delta; + + contrib = cap_scale(contrib, scale_freq); + if (weight) { + sa->load_sum += weight * contrib; + if (cfs_rq) + cfs_rq->runnable_load_sum += weight * contrib; + } + if (running) + sa->util_sum += contrib * scale_cpu; + + return periods; +} + /* * We can represent the historical contribution to runnable average as the * coefficients of a geometric series. To do this we sub-divide our runnable @@ -2852,10 +2933,7 @@ static __always_inline int ___update_load_avg(u64 now, int cpu, struct sched_avg *sa, unsigned long weight, int running, struct cfs_rq *cfs_rq) { - u64 delta, scaled_delta, periods; - u32 contrib; - unsigned int delta_w, scaled_delta_w, decayed = 0; - unsigned long scale_freq, scale_cpu; + u64 delta; delta = now - sa->last_update_time; /* @@ -2876,81 +2954,27 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa, return 0; sa->last_update_time = now; - scale_freq = arch_scale_freq_capacity(NULL, cpu); - scale_cpu = arch_scale_cpu_capacity(NULL, cpu); - - /* delta_w is the amount already accumulated against our next period */ - delta_w = sa->period_contrib; - if (delta + delta_w >= 1024) { - decayed = 1; - - /* how much left for next period will start over, we don't know yet */ - sa->period_contrib = 0; - - /* - * Now that we know we're crossing a period boundary, figure - * out how much from delta we need to complete the current - * period and accrue it. - */ - delta_w = 1024 - delta_w; - scaled_delta_w = cap_scale(delta_w, scale_freq); - if (weight) { - sa->load_sum += weight * scaled_delta_w; - if (cfs_rq) { - cfs_rq->runnable_load_sum += - weight * scaled_delta_w; - } - } - if (running) - sa->util_sum += scaled_delta_w * scale_cpu; - - delta -= delta_w; - - /* Figure out how many additional periods this update spans */ - periods = delta / 1024; - delta %= 1024; - - sa->load_sum = decay_load(sa->load_sum, periods + 1); - if (cfs_rq) { - cfs_rq->runnable_load_sum = - decay_load(cfs_rq->runnable_load_sum, periods + 1); - } - sa->util_sum = decay_load((u64)(sa->util_sum), periods + 1); - - /* Efficiently calculate \sum (1..n_period) 1024*y^i */ - contrib = __compute_runnable_contrib(periods); - contrib = cap_scale(contrib, scale_freq); - if (weight) { - sa->load_sum += weight * contrib; - if (cfs_rq) - cfs_rq->runnable_load_sum += weight * contrib; - } - if (running) - sa->util_sum += contrib * scale_cpu; - } - - /* Remainder of delta accrued against u_0` */ - scaled_delta = cap_scale(delta, scale_freq); - if (weight) { - sa->load_sum += weight * scaled_delta; - if (cfs_rq) - cfs_rq->runnable_load_sum += weight * scaled_delta; - } - if (running) - sa->util_sum += scaled_delta * scale_cpu; - - sa->period_contrib += delta; + /* + * Now we know we crossed measurement unit boundaries. The *_avg + * accrues by two steps: + * + * Step 1: accumulate *_sum since last_update_time. If we haven't + * crossed period boundaries, finish. + */ + if (!accumulate_sum(delta, cpu, sa, weight, running, cfs_rq)) + return 0; - if (decayed) { - sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX); - if (cfs_rq) { - cfs_rq->runnable_load_avg = - div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX); - } - sa->util_avg = sa->util_sum / LOAD_AVG_MAX; + /* + * Step 2: update *_avg. + */ + sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX); + if (cfs_rq) { + cfs_rq->runnable_load_avg = + div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX); } + sa->util_avg = sa->util_sum / LOAD_AVG_MAX; - return decayed; + return 1; } static int -- cgit From e96a7705e7d3fef96aec9b590c63b2f6f7d2ba22 Mon Sep 17 00:00:00 2001 From: Xunlei Pang Date: Thu, 23 Mar 2017 15:56:08 +0100 Subject: sched/rtmutex/deadline: Fix a PI crash for deadline tasks A crash happened while I was playing with deadline PI rtmutex. BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 IP: [] rt_mutex_get_top_task+0x1f/0x30 PGD 232a75067 PUD 230947067 PMD 0 Oops: 0000 [#1] SMP CPU: 1 PID: 10994 Comm: a.out Not tainted Call Trace: [] enqueue_task+0x2c/0x80 [] activate_task+0x23/0x30 [] pull_dl_task+0x1d5/0x260 [] pre_schedule_dl+0x16/0x20 [] __schedule+0xd3/0x900 [] schedule+0x29/0x70 [] __rt_mutex_slowlock+0x4b/0xc0 [] rt_mutex_slowlock+0xd1/0x190 [] rt_mutex_timed_lock+0x53/0x60 [] futex_lock_pi.isra.18+0x28c/0x390 [] do_futex+0x190/0x5b0 [] SyS_futex+0x80/0x180 This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi() are only protected by pi_lock when operating pi waiters, while rt_mutex_get_top_task(), will access them with rq lock held but not holding pi_lock. In order to tackle it, we introduce new "pi_top_task" pointer cached in task_struct, and add new rt_mutex_update_top_task() to update its value, it can be called by rt_mutex_setprio() which held both owner's pi_lock and rq lock. Thus "pi_top_task" can be safely accessed by enqueue_task_dl() under rq lock. Originally-From: Peter Zijlstra Signed-off-by: Xunlei Pang Signed-off-by: Peter Zijlstra (Intel) Acked-by: Steven Rostedt Reviewed-by: Thomas Gleixner Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170323150216.157682758@infradead.org Signed-off-by: Thomas Gleixner --- kernel/sched/core.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel/sched') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ab9f6ac099a7..e1f44ec701c8 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3713,6 +3713,8 @@ void rt_mutex_setprio(struct task_struct *p, int prio) goto out_unlock; } + rt_mutex_update_top_task(p); + trace_sched_pi_setprio(p, prio); oldprio = p->prio; -- cgit From acd58620e415aee4a43a808d7d2fd87259ee0001 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Thu, 23 Mar 2017 15:56:11 +0100 Subject: sched/rtmutex: Refactor rt_mutex_setprio() With the introduction of SCHED_DEADLINE the whole notion that priority is a single number is gone, therefore the @prio argument to rt_mutex_setprio() doesn't make sense anymore. So rework the code to pass a pi_task instead. Note this also fixes a problem with pi_top_task caching; previously we would not set the pointer (call rt_mutex_update_top_task) if the priority didn't change, this could lead to a stale pointer. As for the XXX, I think its fine to use pi_task->prio, because if it differs from waiter->prio, a PI chain update is immenent. Signed-off-by: Peter Zijlstra (Intel) Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170323150216.303827095@infradead.org Signed-off-by: Thomas Gleixner --- kernel/sched/core.c | 66 ++++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 53 insertions(+), 13 deletions(-) (limited to 'kernel/sched') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e1f44ec701c8..37b152a782ae 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3671,10 +3671,25 @@ EXPORT_SYMBOL(default_wake_function); #ifdef CONFIG_RT_MUTEXES +static inline int __rt_effective_prio(struct task_struct *pi_task, int prio) +{ + if (pi_task) + prio = min(prio, pi_task->prio); + + return prio; +} + +static inline int rt_effective_prio(struct task_struct *p, int prio) +{ + struct task_struct *pi_task = rt_mutex_get_top_task(p); + + return __rt_effective_prio(pi_task, prio); +} + /* * rt_mutex_setprio - set the current priority of a task - * @p: task - * @prio: prio value (kernel-internal form) + * @p: task to boost + * @pi_task: donor task * * This function changes the 'effective' priority of a task. It does * not touch ->normal_prio like __setscheduler(). @@ -3682,18 +3697,42 @@ EXPORT_SYMBOL(default_wake_function); * Used by the rt_mutex code to implement priority inheritance * logic. Call site only calls if the priority of the task changed. */ -void rt_mutex_setprio(struct task_struct *p, int prio) +void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task) { - int oldprio, queued, running, queue_flag = + int prio, oldprio, queued, running, queue_flag = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK; const struct sched_class *prev_class; struct rq_flags rf; struct rq *rq; - BUG_ON(prio > MAX_PRIO); + /* XXX used to be waiter->prio, not waiter->task->prio */ + prio = __rt_effective_prio(pi_task, p->normal_prio); + + /* + * If nothing changed; bail early. + */ + if (p->pi_top_task == pi_task && prio == p->prio && !dl_prio(prio)) + return; rq = __task_rq_lock(p, &rf); update_rq_clock(rq); + /* + * Set under pi_lock && rq->lock, such that the value can be used under + * either lock. + * + * Note that there is loads of tricky to make this pointer cache work + * right. rt_mutex_slowunlock()+rt_mutex_postunlock() work together to + * ensure a task is de-boosted (pi_task is set to NULL) before the + * task is allowed to run again (and can exit). This ensures the pointer + * points to a blocked task -- which guaratees the task is present. + */ + p->pi_top_task = pi_task; + + /* + * For FIFO/RR we only need to set prio, if that matches we're done. + */ + if (prio == p->prio && !dl_prio(prio)) + goto out_unlock; /* * Idle task boosting is a nono in general. There is one @@ -3713,9 +3752,7 @@ void rt_mutex_setprio(struct task_struct *p, int prio) goto out_unlock; } - rt_mutex_update_top_task(p); - - trace_sched_pi_setprio(p, prio); + trace_sched_pi_setprio(p, prio); /* broken */ oldprio = p->prio; if (oldprio == prio) @@ -3739,7 +3776,6 @@ void rt_mutex_setprio(struct task_struct *p, int prio) * running task */ if (dl_prio(prio)) { - struct task_struct *pi_task = rt_mutex_get_top_task(p); if (!dl_prio(p->normal_prio) || (pi_task && dl_entity_preempt(&pi_task->dl, &p->dl))) { p->dl.dl_boosted = 1; @@ -3777,6 +3813,11 @@ out_unlock: balance_callback(rq); preempt_enable(); } +#else +static inline int rt_effective_prio(struct task_struct *p, int prio) +{ + return prio; +} #endif void set_user_nice(struct task_struct *p, long nice) @@ -4023,10 +4064,9 @@ static void __setscheduler(struct rq *rq, struct task_struct *p, * Keep a potential priority boosting if called from * sched_setscheduler(). */ + p->prio = normal_prio(p); if (keep_boost) - p->prio = rt_mutex_get_effective_prio(p, normal_prio(p)); - else - p->prio = normal_prio(p); + p->prio = rt_effective_prio(p, p->prio); if (dl_prio(p->prio)) p->sched_class = &dl_sched_class; @@ -4313,7 +4353,7 @@ change: * the runqueue. This will be done when the task deboost * itself. */ - new_effective_prio = rt_mutex_get_effective_prio(p, newprio); + new_effective_prio = rt_effective_prio(p, newprio); if (new_effective_prio == oldprio) queue_flags &= ~DEQUEUE_MOVE; } -- cgit From b91473ff6e979c0028f02f90e40c844959c736d8 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Thu, 23 Mar 2017 15:56:12 +0100 Subject: sched,tracing: Update trace_sched_pi_setprio() Pass the PI donor task, instead of a numerical priority. Numerical priorities are not sufficient to describe state ever since SCHED_DEADLINE. Annotate all sched tracepoints that are currently broken; fixing them will bork userspace. *hate*. Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Steven Rostedt Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170323150216.353599881@infradead.org Signed-off-by: Thomas Gleixner --- kernel/sched/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel/sched') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 37b152a782ae..0c443bbdaaca 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3752,7 +3752,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task) goto out_unlock; } - trace_sched_pi_setprio(p, prio); /* broken */ + trace_sched_pi_setprio(p, pi_task); oldprio = p->prio; if (oldprio == prio) -- cgit From 30e03acda5fd9c77ec9bf8b3c5def9540c6b0486 Mon Sep 17 00:00:00 2001 From: Rakib Mullick Date: Sun, 9 Apr 2017 07:36:14 +0600 Subject: cpuset: Remove cpuset_update_active_cpus()'s parameter. In cpuset_update_active_cpus(), cpu_online isn't used anymore. Remove it. Signed-off-by: Rakib Mullick Acked-by: Zefan Li Signed-off-by: Tejun Heo --- kernel/sched/core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel/sched') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 956383844116..e62802c044e1 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5727,7 +5727,7 @@ static void cpuset_cpu_active(void) * cpuset configurations. */ } - cpuset_update_active_cpus(true); + cpuset_update_active_cpus(); } static int cpuset_cpu_inactive(unsigned int cpu) @@ -5750,7 +5750,7 @@ static int cpuset_cpu_inactive(unsigned int cpu) if (overflow) return -EBUSY; - cpuset_update_active_cpus(false); + cpuset_update_active_cpus(); } else { num_cpus_frozen++; partition_sched_domains(1, NULL, NULL); -- cgit From 05296e7535d67ba4926b543a09cf5d430a815cb6 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Fri, 31 Mar 2017 10:51:41 +0200 Subject: sched/fair: Fix corner case in __accumulate_sum() Paul noticed that in the (periods >= LOAD_AVG_MAX_N) case in __accumulate_sum(), the returned contribution value (LOAD_AVG_MAX) is incorrect. This is because at this point, the decay_load() on the old state -- the first step in accumulate_sum() -- will not have resulted in 0, and will therefore result in a sum larger than the maximum value of our series. Obviously broken. Note that: decay_load(LOAD_AVG_MAX, LOAD_AVG_MAX_N) = 1 (345 / 32) 47742 * - ^ = ~27 2 Not to mention that any further contribution from the d3 segment (our new period) would also push it over the maximum. Solve this by noting that we can write our c2 term: p c2 = 1024 \Sum y^n n=1 In terms of our maximum value: inf inf p max = 1024 \Sum y^n = 1024 ( \Sum y^n + \Sum y^n + y^0 ) n=0 n=p+1 n=1 Further note that: inf inf inf ( \Sum y^n ) y^p = \Sum y^(n+p) = \Sum y^n n=0 n=0 n=p Combined that gives us: p c2 = 1024 \Sum y^n n=1 inf inf = 1024 ( \Sum y^n - \Sum y^n - y^0 ) n=0 n=p+1 = max - (max y^(p+1)) - 1024 Further simplify things by dealing with p=0 early on. Reported-by: Paul Turner Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Yuyang Du Cc: linux-kernel@vger.kernel.org Fixes: a481db34b9be ("sched/fair: Optimize ___update_sched_avg()") Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 75 ++++++++++++++--------------------------------------- 1 file changed, 19 insertions(+), 56 deletions(-) (limited to 'kernel/sched') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 76f67b3e34d6..1e5f58081762 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -727,7 +727,6 @@ static unsigned long task_h_load(struct task_struct *p); */ #define LOAD_AVG_PERIOD 32 #define LOAD_AVG_MAX 47742 /* maximum possible load avg */ -#define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_AVG_MAX */ /* Give new sched_entity start runnable values to heavy its load in infant time */ void init_entity_runnable_average(struct sched_entity *se) @@ -2743,26 +2742,6 @@ static const u32 runnable_avg_yN_inv[] = { 0x85aac367, 0x82cd8698, }; -/* - * Precomputed \Sum y^k { 1<=k<=n }. These are floor(true_value) to prevent - * over-estimates when re-combining. - */ -static const u32 runnable_avg_yN_sum[] = { - 0, 1002, 1982, 2941, 3880, 4798, 5697, 6576, 7437, 8279, 9103, - 9909,10698,11470,12226,12966,13690,14398,15091,15769,16433,17082, - 17718,18340,18949,19545,20128,20698,21256,21802,22336,22859,23371, -}; - -/* - * Precomputed \Sum y^k { 1<=k<=n, where n%32=0). Values are rolled down to - * lower integers. See Documentation/scheduler/sched-avg.txt how these - * were generated: - */ -static const u32 __accumulated_sum_N32[] = { - 0, 23371, 35056, 40899, 43820, 45281, - 46011, 46376, 46559, 46650, 46696, 46719, -}; - /* * Approximate: * val * y^n, where y^32 ~= 0.5 (~1 scheduling period) @@ -2771,9 +2750,7 @@ static u64 decay_load(u64 val, u64 n) { unsigned int local_n; - if (!n) - return val; - else if (unlikely(n > LOAD_AVG_PERIOD * 63)) + if (unlikely(n > LOAD_AVG_PERIOD * 63)) return 0; /* after bounds checking we can collapse to 32-bit */ @@ -2795,40 +2772,25 @@ static u64 decay_load(u64 val, u64 n) return val; } -static u32 __accumulate_sum(u64 periods, u32 period_contrib, u32 remainder) +static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3) { - u32 c1, c2, c3 = remainder; /* y^0 == 1 */ - - if (!periods) - return remainder - period_contrib; - - if (unlikely(periods >= LOAD_AVG_MAX_N)) - return LOAD_AVG_MAX; + u32 c1, c2, c3 = d3; /* y^0 == 1 */ /* * c1 = d1 y^(p+1) */ - c1 = decay_load((u64)(1024 - period_contrib), periods); + c1 = decay_load((u64)d1, periods); - periods -= 1; /* - * For updates fully spanning n periods, the contribution to runnable - * average will be: + * p + * c2 = 1024 \Sum y^n + * n=1 * - * c2 = 1024 \Sum y^n - * - * We can compute this reasonably efficiently by combining: - * - * y^PERIOD = 1/2 with precomputed 1024 \Sum y^n {for: n < PERIOD} + * inf inf + * = 1024 ( \Sum y^n - \Sum y^n - y^0 ) + * n=0 n=p+1 */ - if (likely(periods <= LOAD_AVG_PERIOD)) { - c2 = runnable_avg_yN_sum[periods]; - } else { - c2 = __accumulated_sum_N32[periods/LOAD_AVG_PERIOD]; - periods %= LOAD_AVG_PERIOD; - c2 = decay_load(c2, periods); - c2 += runnable_avg_yN_sum[periods]; - } + c2 = LOAD_AVG_MAX - decay_load(LOAD_AVG_MAX, periods) - 1024; return c1 + c2 + c3; } @@ -2861,8 +2823,8 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa, unsigned long weight, int running, struct cfs_rq *cfs_rq) { unsigned long scale_freq, scale_cpu; + u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */ u64 periods; - u32 contrib; scale_freq = arch_scale_freq_capacity(NULL, cpu); scale_cpu = arch_scale_cpu_capacity(NULL, cpu); @@ -2880,13 +2842,14 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa, decay_load(cfs_rq->runnable_load_sum, periods); } sa->util_sum = decay_load((u64)(sa->util_sum), periods); - } - /* - * Step 2 - */ - delta %= 1024; - contrib = __accumulate_sum(periods, sa->period_contrib, delta); + /* + * Step 2 + */ + delta %= 1024; + contrib = __accumulate_pelt_segments(periods, + 1024 - sa->period_contrib, delta); + } sa->period_contrib = delta; contrib = cap_scale(contrib, scale_freq); -- cgit From 3841cdc31099fe3b84c93903c63e3d60348c0ea1 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Mon, 10 Apr 2017 12:47:33 +0200 Subject: sched/fair: Fix comments Historically our periods (or p) argument in PELT denoted the number of full periods (what is now d2). However recent patches have changed this to the total decay (previously p+1), leading to a confusing discrepancy between comments and code. Try and clarify things by making periods (in code) and p (in comments) be the same thing (again). Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) (limited to 'kernel/sched') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1e5f58081762..d43e9ac9c3c5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2777,18 +2777,18 @@ static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3) u32 c1, c2, c3 = d3; /* y^0 == 1 */ /* - * c1 = d1 y^(p+1) + * c1 = d1 y^p */ c1 = decay_load((u64)d1, periods); /* - * p + * p-1 * c2 = 1024 \Sum y^n * n=1 * * inf inf * = 1024 ( \Sum y^n - \Sum y^n - y^0 ) - * n=0 n=p+1 + * n=0 n=p */ c2 = LOAD_AVG_MAX - decay_load(LOAD_AVG_MAX, periods) - 1024; @@ -2808,15 +2808,15 @@ static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3) * |<->|<----------------->|<--->| * ... |---x---|------| ... |------|-----x (now) * - * p - * u' = (u + d1) y^(p+1) + 1024 \Sum y^n + d3 y^0 - * n=1 + * p-1 + * u' = (u + d1) y^p + 1024 \Sum y^n + d3 y^0 + * n=1 * - * = u y^(p+1) + (Step 1) + * = u y^p + (Step 1) * - * p - * d1 y^(p+1) + 1024 \Sum y^n + d3 y^0 (Step 2) - * n=1 + * p-1 + * d1 y^p + 1024 \Sum y^n + d3 y^0 (Step 2) + * n=1 */ static __always_inline u32 accumulate_sum(u64 delta, int cpu, struct sched_avg *sa, -- cgit From bb0bd044e65c2bf0f26b29613fcc441dfdeedf14 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Mon, 10 Apr 2017 13:20:45 +0200 Subject: sched/fair: Increase PELT accuracy for small tasks We truncate (and loose) the lower 10 bits of runtime in ___update_load_avg(), this means there's a consistent bias to under-account tasks. This is esp. significant for small tasks. Cure this by only forwarding last_update_time to the point we've actually accounted for, leaving the remainder for the next time. Reported-by: Morten Rasmussen Signed-off-by: Peter Zijlstra (Intel) Acked-by: Morten Rasmussen Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel/sched') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d43e9ac9c3c5..1e3b99a9ab69 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2915,7 +2915,8 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa, delta >>= 10; if (!delta) return 0; - sa->last_update_time = now; + + sa->last_update_time += delta << 10; /* * Now we know we crossed measurement unit boundaries. The *_avg -- cgit From 283e2ed3990c36c00403b62b264ebfabaf931104 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 11 Apr 2017 11:08:42 +0200 Subject: sched/fair: Move the PELT constants into a generated header Now that we have a tool to generate the PELT constants in C form, use its output as a separate header. Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 21 +++------------------ kernel/sched/sched-pelt.h | 13 +++++++++++++ 2 files changed, 16 insertions(+), 18 deletions(-) create mode 100644 kernel/sched/sched-pelt.h (limited to 'kernel/sched') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1e3b99a9ab69..a903276fcb62 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -717,17 +717,12 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se) } #ifdef CONFIG_SMP + +#include "sched-pelt.h" + static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu); static unsigned long task_h_load(struct task_struct *p); -/* - * We choose a half-life close to 1 scheduling period. - * Note: The tables runnable_avg_yN_inv and runnable_avg_yN_sum are - * dependent on this value. - */ -#define LOAD_AVG_PERIOD 32 -#define LOAD_AVG_MAX 47742 /* maximum possible load avg */ - /* Give new sched_entity start runnable values to heavy its load in infant time */ void init_entity_runnable_average(struct sched_entity *se) { @@ -2732,16 +2727,6 @@ static inline void update_cfs_shares(struct sched_entity *se) #endif /* CONFIG_FAIR_GROUP_SCHED */ #ifdef CONFIG_SMP -/* Precomputed fixed inverse multiplies for multiplication by y^n */ -static const u32 runnable_avg_yN_inv[] = { - 0xffffffff, 0xfa83b2da, 0xf5257d14, 0xefe4b99a, 0xeac0c6e6, 0xe5b906e6, - 0xe0ccdeeb, 0xdbfbb796, 0xd744fcc9, 0xd2a81d91, 0xce248c14, 0xc9b9bd85, - 0xc5672a10, 0xc12c4cc9, 0xbd08a39e, 0xb8fbaf46, 0xb504f333, 0xb123f581, - 0xad583ee9, 0xa9a15ab4, 0xa5fed6a9, 0xa2704302, 0x9ef5325f, 0x9b8d39b9, - 0x9837f050, 0x94f4efa8, 0x91c3d373, 0x8ea4398a, 0x8b95c1e3, 0x88980e80, - 0x85aac367, 0x82cd8698, -}; - /* * Approximate: * val * y^n, where y^32 ~= 0.5 (~1 scheduling period) diff --git a/kernel/sched/sched-pelt.h b/kernel/sched/sched-pelt.h new file mode 100644 index 000000000000..cd200d16529e --- /dev/null +++ b/kernel/sched/sched-pelt.h @@ -0,0 +1,13 @@ +/* Generated by Documentation/scheduler/sched-pelt; do not modify. */ + +static const u32 runnable_avg_yN_inv[] = { + 0xffffffff, 0xfa83b2da, 0xf5257d14, 0xefe4b99a, 0xeac0c6e6, 0xe5b906e6, + 0xe0ccdeeb, 0xdbfbb796, 0xd744fcc9, 0xd2a81d91, 0xce248c14, 0xc9b9bd85, + 0xc5672a10, 0xc12c4cc9, 0xbd08a39e, 0xb8fbaf46, 0xb504f333, 0xb123f581, + 0xad583ee9, 0xa9a15ab4, 0xa5fed6a9, 0xa2704302, 0x9ef5325f, 0x9b8d39b9, + 0x9837f050, 0x94f4efa8, 0x91c3d373, 0x8ea4398a, 0x8b95c1e3, 0x88980e80, + 0x85aac367, 0x82cd8698, +}; + +#define LOAD_AVG_PERIOD 32 +#define LOAD_AVG_MAX 47742 -- cgit From bcbfdd01dce5556a952fae84ef16fd0f12525e7b Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 11 Apr 2017 15:50:41 -0700 Subject: rcu: Make non-preemptive schedule be Tasks RCU quiescent state Currently, a call to schedule() acts as a Tasks RCU quiescent state only if a context switch actually takes place. However, just the call to schedule() guarantees that the calling task has moved off of whatever tracing trampoline that it might have been one previously. This commit therefore plumbs schedule()'s "preempt" parameter into rcu_note_context_switch(), which then records the Tasks RCU quiescent state, but only if this call to schedule() was -not- due to a preemption. To avoid adding overhead to the common-case context-switch path, this commit hides the rcu_note_context_switch() check under an existing non-common-case check. Suggested-by: Steven Rostedt Signed-off-by: Paul E. McKenney --- kernel/sched/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel/sched') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 3b31fc05a0f1..2adf7b6c04e7 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3378,7 +3378,7 @@ static void __sched notrace __schedule(bool preempt) hrtick_clear(rq); local_irq_disable(); - rcu_note_context_switch(); + rcu_note_context_switch(preempt); /* * Make sure that signal_pending_state()->signal_pending() below -- cgit From 25e2d8c1b9e327ed260edd13169cc22bc7a78bc6 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Tue, 25 Apr 2017 16:10:48 +0200 Subject: sched/cputime: Fix ksoftirqd cputime accounting regression irq_time_read() returns the irqtime minus the ksoftirqd time. This is necessary because irq_time_read() is used to substract the IRQ time from the sum_exec_runtime of a task. If we were to include the softirq time of ksoftirqd, this task would substract its own CPU time everytime it updates ksoftirqd->sum_exec_runtime which would therefore never progress. But this behaviour got broken by: a499a5a14db ("sched/cputime: Increment kcpustat directly on irqtime account") ... which now includes ksoftirqd softirq time in the time returned by irq_time_read(). This has resulted in wrong ksoftirqd cputime reported to userspace through /proc/stat and thus "top" not showing ksoftirqd when it should after intense networking load. ksoftirqd->stime happens to be correct but it gets scaled down by sum_exec_runtime through task_cputime_adjusted(). To fix this, just account the strict IRQ time in a separate counter and use it to report the IRQ time. Reported-and-tested-by: Jesper Dangaard Brouer Signed-off-by: Frederic Weisbecker Reviewed-by: Rik van Riel Acked-by: Jesper Dangaard Brouer Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Stanislaw Gruszka Cc: Thomas Gleixner Cc: Wanpeng Li Link: http://lkml.kernel.org/r/1493129448-5356-1-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar --- kernel/sched/cputime.c | 27 ++++++++++++++++----------- kernel/sched/sched.h | 9 +++++++-- 2 files changed, 23 insertions(+), 13 deletions(-) (limited to 'kernel/sched') diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index f3778e2b46c8..aea3135c5d90 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -34,6 +34,18 @@ void disable_sched_clock_irqtime(void) sched_clock_irqtime = 0; } +static void irqtime_account_delta(struct irqtime *irqtime, u64 delta, + enum cpu_usage_stat idx) +{ + u64 *cpustat = kcpustat_this_cpu->cpustat; + + u64_stats_update_begin(&irqtime->sync); + cpustat[idx] += delta; + irqtime->total += delta; + irqtime->tick_delta += delta; + u64_stats_update_end(&irqtime->sync); +} + /* * Called before incrementing preempt_count on {soft,}irq_enter * and before decrementing preempt_count on {soft,}irq_exit. @@ -41,7 +53,6 @@ void disable_sched_clock_irqtime(void) void irqtime_account_irq(struct task_struct *curr) { struct irqtime *irqtime = this_cpu_ptr(&cpu_irqtime); - u64 *cpustat = kcpustat_this_cpu->cpustat; s64 delta; int cpu; @@ -52,22 +63,16 @@ void irqtime_account_irq(struct task_struct *curr) delta = sched_clock_cpu(cpu) - irqtime->irq_start_time; irqtime->irq_start_time += delta; - u64_stats_update_begin(&irqtime->sync); /* * We do not account for softirq time from ksoftirqd here. * We want to continue accounting softirq time to ksoftirqd thread * in that case, so as not to confuse scheduler with a special task * that do not consume any time, but still wants to run. */ - if (hardirq_count()) { - cpustat[CPUTIME_IRQ] += delta; - irqtime->tick_delta += delta; - } else if (in_serving_softirq() && curr != this_cpu_ksoftirqd()) { - cpustat[CPUTIME_SOFTIRQ] += delta; - irqtime->tick_delta += delta; - } - - u64_stats_update_end(&irqtime->sync); + if (hardirq_count()) + irqtime_account_delta(irqtime, delta, CPUTIME_IRQ); + else if (in_serving_softirq() && curr != this_cpu_ksoftirqd()) + irqtime_account_delta(irqtime, delta, CPUTIME_SOFTIRQ); } EXPORT_SYMBOL_GPL(irqtime_account_irq); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 5cbf92214ad8..767aab3505a8 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1869,6 +1869,7 @@ static inline void nohz_balance_exit_idle(unsigned int cpu) { } #ifdef CONFIG_IRQ_TIME_ACCOUNTING struct irqtime { + u64 total; u64 tick_delta; u64 irq_start_time; struct u64_stats_sync sync; @@ -1876,16 +1877,20 @@ struct irqtime { DECLARE_PER_CPU(struct irqtime, cpu_irqtime); +/* + * Returns the irqtime minus the softirq time computed by ksoftirqd. + * Otherwise ksoftirqd's sum_exec_runtime is substracted its own runtime + * and never move forward. + */ static inline u64 irq_time_read(int cpu) { struct irqtime *irqtime = &per_cpu(cpu_irqtime, cpu); - u64 *cpustat = kcpustat_cpu(cpu).cpustat; unsigned int seq; u64 total; do { seq = __u64_stats_fetch_begin(&irqtime->sync); - total = cpustat[CPUTIME_SOFTIRQ] + cpustat[CPUTIME_IRQ]; + total = irqtime->total; } while (__u64_stats_fetch_retry(&irqtime->sync, seq)); return total; -- cgit From d86ab9cff8b936aadde444d0e263a8db5ff0349b Mon Sep 17 00:00:00 2001 From: Juri Lelli Date: Wed, 3 May 2017 14:30:48 +0100 Subject: cpufreq: schedutil: use now as reference when aggregating shared policy requests Currently, sugov_next_freq_shared() uses last_freq_update_time as a reference to decide when to start considering CPU contributions as stale. However, since last_freq_update_time is set by the last CPU that issued a frequency transition, this might cause problems in certain cases. In practice, the detection of stale utilization values fails whenever the CPU with such values was the last to update the policy. For example (and please note again that the SCHED_CPUFREQ_RT flag is not the problem here, but only the detection of after how much time that flag has to be considered stale), suppose a policy with 2 CPUs: CPU0 | CPU1 | | RT task scheduled | SCHED_CPUFREQ_RT is set | CPU1->last_update = now | freq transition to max | last_freq_update_time = now | more than TICK_NSEC nsecs | a small CFS wakes up | CPU0->last_update = now1 | delta_ns(CPU0) < TICK_NSEC* | CPU0's util is considered | delta_ns(CPU1) = | last_freq_update_time - | CPU1->last_update = 0 | < TICK_NSEC | CPU1 is still considered | CPU1->SCHED_CPUFREQ_RT is set | we stay at max (until CPU1 | exits from idle) | * delta_ns is actually negative as now1 > last_freq_update_time While last_freq_update_time is a sensible reference for rate limiting, it doesn't seem to be useful for working around stale CPU states. Fix the problem by always considering now (time) as the reference for deciding when CPUs have stale contributions. Signed-off-by: Juri Lelli Acked-by: Vincent Guittot Acked-by: Viresh Kumar Signed-off-by: Rafael J. Wysocki --- kernel/sched/cpufreq_schedutil.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) (limited to 'kernel/sched') diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 76877a62b5fa..622eed1b7658 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -245,11 +245,10 @@ static void sugov_update_single(struct update_util_data *hook, u64 time, sugov_update_commit(sg_policy, time, next_f); } -static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu) +static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time) { struct sugov_policy *sg_policy = sg_cpu->sg_policy; struct cpufreq_policy *policy = sg_policy->policy; - u64 last_freq_update_time = sg_policy->last_freq_update_time; unsigned long util = 0, max = 1; unsigned int j; @@ -265,7 +264,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu) * enough, don't take the CPU into account as it probably is * idle now (and clear iowait_boost for it). */ - delta_ns = last_freq_update_time - j_sg_cpu->last_update; + delta_ns = time - j_sg_cpu->last_update; if (delta_ns > TICK_NSEC) { j_sg_cpu->iowait_boost = 0; continue; @@ -309,7 +308,7 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time, if (flags & SCHED_CPUFREQ_RT_DL) next_f = sg_policy->policy->cpuinfo.max_freq; else - next_f = sugov_next_freq_shared(sg_cpu); + next_f = sugov_next_freq_shared(sg_cpu, time); sugov_update_commit(sg_policy, time, next_f); } -- cgit From 8663effb24f9430394d3bf1ed2dac42a771421d1 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Fri, 14 Apr 2017 08:48:09 -0400 Subject: sched/core: Call __schedule() from do_idle() without enabling preemption I finally got around to creating trampolines for dynamically allocated ftrace_ops with using synchronize_rcu_tasks(). For users of the ftrace function hook callbacks, like perf, that allocate the ftrace_ops descriptor via kmalloc() and friends, ftrace was not able to optimize the functions being traced to use a trampoline because they would also need to be allocated dynamically. The problem is that they cannot be freed when CONFIG_PREEMPT is set, as there's no way to tell if a task was preempted on the trampoline. That was before Paul McKenney implemented synchronize_rcu_tasks() that would make sure all tasks (except idle) have scheduled out or have entered user space. While testing this, I triggered this bug: BUG: unable to handle kernel paging request at ffffffffa0230077 ... RIP: 0010:0xffffffffa0230077 ... Call Trace: schedule+0x5/0xe0 schedule_preempt_disabled+0x18/0x30 do_idle+0x172/0x220 What happened was that the idle task was preempted on the trampoline. As synchronize_rcu_tasks() ignores the idle thread, there's nothing that lets ftrace know that the idle task was preempted on a trampoline. The idle task shouldn't need to ever enable preemption. The idle task is simply a loop that calls schedule or places the cpu into idle mode. In fact, having preemption enabled is inefficient, because it can happen when idle is just about to call schedule anyway, which would cause schedule to be called twice. Once for when the interrupt came in and was returning back to normal context, and then again in the normal path that the idle loop is running in, which would be pointless, as it had already scheduled. The only reason schedule_preempt_disable() enables preemption is to be able to call sched_submit_work(), which requires preemption enabled. As this is a nop when the task is in the RUNNING state, and idle is always in the running state, there's no reason that idle needs to enable preemption. But that means it cannot use schedule_preempt_disable() as other callers of that function require calling sched_submit_work(). Adding a new function local to kernel/sched/ that allows idle to call the scheduler without enabling preemption, fixes the synchronize_rcu_tasks() issue, as well as removes the pointless spurious schedule calls caused by interrupts happening in the brief window where preemption is enabled just before it calls schedule. Reviewed: Thomas Gleixner Signed-off-by: Steven Rostedt (VMware) Signed-off-by: Peter Zijlstra (Intel) Acked-by: Paul E. McKenney Cc: Andrew Morton Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/20170414084809.3dacde2a@gandalf.local.home Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 25 +++++++++++++++++++++++++ kernel/sched/idle.c | 2 +- kernel/sched/sched.h | 2 ++ 3 files changed, 28 insertions(+), 1 deletion(-) (limited to 'kernel/sched') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 759f4bd52cd6..803c3bc274c4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3502,6 +3502,31 @@ asmlinkage __visible void __sched schedule(void) } EXPORT_SYMBOL(schedule); +/* + * synchronize_rcu_tasks() makes sure that no task is stuck in preempted + * state (have scheduled out non-voluntarily) by making sure that all + * tasks have either left the run queue or have gone into user space. + * As idle tasks do not do either, they must not ever be preempted + * (schedule out non-voluntarily). + * + * schedule_idle() is similar to schedule_preempt_disable() except that it + * never enables preemption because it does not call sched_submit_work(). + */ +void __sched schedule_idle(void) +{ + /* + * As this skips calling sched_submit_work(), which the idle task does + * regardless because that function is a nop when the task is in a + * TASK_RUNNING state, make sure this isn't used someplace that the + * current task can be in any other state. Note, idle is always in the + * TASK_RUNNING state. + */ + WARN_ON_ONCE(current->state); + do { + __schedule(false); + } while (need_resched()); +} + #ifdef CONFIG_CONTEXT_TRACKING asmlinkage __visible void __sched schedule_user(void) { diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 2a25a9ec2c6e..ef63adce0c9c 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -265,7 +265,7 @@ static void do_idle(void) smp_mb__after_atomic(); sched_ttwu_pending(); - schedule_preempt_disabled(); + schedule_idle(); if (unlikely(klp_patch_pending(current))) klp_update_patch_state(current); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 7808ab050599..6dda2aab731e 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1467,6 +1467,8 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq) } #endif +extern void schedule_idle(void); + extern void sysrq_sched_debug_show(void); extern void sched_init_granularity(void); extern void update_max_interval(void); -- cgit