workqueue: Fix WARN_ON_ONCE() triggers in worker_enter_idle()

Currently, pool->nr_running can be modified from timer tick, that means the timer tick can run nested inside a not-irq-protected section that's in the process of modifying nr_running. Consider the following scenario: CPU0 kworker/0:2 (events) worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND); ->pool->nr_running++; (1) process_one_work() ->worker->current_func(work); ->schedule() ->wq_worker_sleeping() ->worker->sleeping = 1; ->pool->nr_running--; (0) .... ->wq_worker_running() .... CPU0 by interrupt: wq_worker_tick() ->worker_set_flags(worker, WORKER_CPU_INTENSIVE); ->pool->nr_running--; (-1) ->worker->flags |= WORKER_CPU_INTENSIVE; .... ->if (!(worker->flags & WORKER_NOT_RUNNING)) ->pool->nr_running++; (will not execute) ->worker->sleeping = 0; .... ->worker_clr_flags(worker, WORKER_CPU_INTENSIVE); ->pool->nr_running++; (0) .... worker_set_flags(worker, WORKER_PREP); ->pool->nr_running--; (-1) .... worker_enter_idle() ->WARN_ON_ONCE(pool->nr_workers == pool->nr_idle && pool->nr_running); if the nr_workers is equal to nr_idle, due to the nr_running is not zero, will trigger WARN_ON_ONCE(). [ 2.460602] WARNING: CPU: 0 PID: 63 at kernel/workqueue.c:1999 worker_enter_idle+0xb2/0xc0 [ 2.462163] Modules linked in: [ 2.463401] CPU: 0 PID: 63 Comm: kworker/0:2 Not tainted 6.4.0-rc2-next-20230519 #1 [ 2.463771] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014 [ 2.465127] Workqueue: 0x0 (events) [ 2.465678] RIP: 0010:worker_enter_idle+0xb2/0xc0 ... [ 2.472614] Call Trace: [ 2.473152] <TASK> [ 2.474182] worker_thread+0x71/0x430 [ 2.474992] ? _raw_spin_unlock_irqrestore+0x28/0x50 [ 2.475263] kthread+0x103/0x120 [ 2.475493] ? __pfx_worker_thread+0x10/0x10 [ 2.476355] ? __pfx_kthread+0x10/0x10 [ 2.476635] ret_from_fork+0x2c/0x50 [ 2.477051] </TASK> This commit therefore add the check of worker->sleeping in wq_worker_tick(), if the worker->sleeping is not zero, directly return. tj: Updated comment and description. Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Anders Roxell <anders.roxell@linaro.org> Closes: https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20230519/testrun/17078554/suite/boot/test/clang-nightly-lkftconfig/log Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
author: Zqiang <qiang.zhang1211@gmail.com> 2023-05-24 11:53:39 +0800
committer: Tejun Heo <tj@kernel.org> 2023-05-24 11:57:26 -1000
commit: c8f6219be2e58d7f676935ae90b64abef5d0966a (patch)
tree: 413ea94ccb43432d69dab4ef0e66b9237b463481 /kernel/workqueue.c
parent: 525ff9c2965770762b81d679820552a208070d59 (diff)
1 files changed, 12 insertions, 5 deletions
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ee16ddb0647c..3ad6806c7161 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1051,7 +1051,7 @@ void wq_worker_running(struct task_struct *task)
 {
 	struct worker *worker = kthread_data(task);
 
-	if (!worker->sleeping)
+	if (!READ_ONCE(worker->sleeping))
 		return;
 
 	/*
@@ -1071,7 +1071,7 @@ void wq_worker_running(struct task_struct *task)
 	 */
 	worker->current_at = worker->task->se.sum_exec_runtime;
 
-	worker->sleeping = 0;
+	WRITE_ONCE(worker->sleeping, 0);
 }
 
 /**
@@ -1097,10 +1097,10 @@ void wq_worker_sleeping(struct task_struct *task)
 	pool = worker->pool;
 
 	/* Return if preempted before wq_worker_running() was reached */
-	if (worker->sleeping)
+	if (READ_ONCE(worker->sleeping))
 		return;
 
-	worker->sleeping = 1;
+	WRITE_ONCE(worker->sleeping, 1);
 	raw_spin_lock_irq(&pool->lock);
 
 	/*
@@ -1143,8 +1143,15 @@ void wq_worker_tick(struct task_struct *task)
 	 * If the current worker is concurrency managed and hogged the CPU for
 	 * longer than wq_cpu_intensive_thresh_us, it's automatically marked
 	 * CPU_INTENSIVE to avoid stalling other concurrency-managed work items.
+	 *
+	 * Set @worker->sleeping means that @worker is in the process of
+	 * switching out voluntarily and won't be contributing to
+	 * @pool->nr_running until it wakes up. As wq_worker_sleeping() also
+	 * decrements ->nr_running, setting CPU_INTENSIVE here can lead to
+	 * double decrements. The task is releasing the CPU anyway. Let's skip.
+	 * We probably want to make this prettier in the future.
 	 */
-	if ((worker->flags & WORKER_NOT_RUNNING) ||
+	if ((worker->flags & WORKER_NOT_RUNNING) || READ_ONCE(worker->sleeping) ||
 	    worker->task->se.sum_exec_runtime - worker->current_at <
 	    wq_cpu_intensive_thresh_us * NSEC_PER_USEC)
 		return;
author	Zqiang <qiang.zhang1211@gmail.com>	2023-05-24 11:53:39 +0800
committer	Tejun Heo <tj@kernel.org>	2023-05-24 11:57:26 -1000
commit	c8f6219be2e58d7f676935ae90b64abef5d0966a (patch)
tree	413ea94ccb43432d69dab4ef0e66b9237b463481 /kernel/workqueue.c
parent	525ff9c2965770762b81d679820552a208070d59 (diff)