summaryrefslogtreecommitdiff
path: root/kernel/workqueue.c
AgeCommit message (Collapse)Author
2024-09-11workqueue: Clear worker->pool in the worker thread contextLai Jiangshan
Marc Hartmayer reported: [ 23.133876] Unable to handle kernel pointer dereference in virtual kernel address space [ 23.133950] Failing address: 0000000000000000 TEID: 0000000000000483 [ 23.133954] Fault in home space mode while using kernel ASCE. [ 23.133957] AS:000000001b8f0007 R3:0000000056cf4007 S:0000000056cf3800 P:000000000000003d [ 23.134207] Oops: 0004 ilc:2 [#1] SMP (snip) [ 23.134516] Call Trace: [ 23.134520] [<0000024e326caf28>] worker_thread+0x48/0x430 [ 23.134525] ([<0000024e326caf18>] worker_thread+0x38/0x430) [ 23.134528] [<0000024e326d3a3e>] kthread+0x11e/0x130 [ 23.134533] [<0000024e3264b0dc>] __ret_from_fork+0x3c/0x60 [ 23.134536] [<0000024e333fb37a>] ret_from_fork+0xa/0x38 [ 23.134552] Last Breaking-Event-Address: [ 23.134553] [<0000024e333f4c04>] mutex_unlock+0x24/0x30 [ 23.134562] Kernel panic - not syncing: Fatal exception: panic_on_oops With debuging and analysis, worker_thread() accesses to the nullified worker->pool when the newly created worker is destroyed before being waken-up, in which case worker_thread() can see the result detach_worker() reseting worker->pool to NULL at the begining. Move the code "worker->pool = NULL;" out from detach_worker() to fix the problem. worker->pool had been designed to be constant for regular workers and changeable for rescuer. To share attaching/detaching code for regular and rescuer workers and to avoid worker->pool being accessed inadvertently when the worker has been detached, worker->pool is reset to NULL when detached no matter the worker is rescuer or not. To maintain worker->pool being reset after detached, move the code "worker->pool = NULL;" in the worker thread context after detached. It is either be in the regular worker thread context after PF_WQ_WORKER is cleared or in rescuer worker thread context with wq_pool_attach_mutex held. So it is safe to do so. Cc: Marc Hartmayer <mhartmay@linux.ibm.com> Link: https://lore.kernel.org/lkml/87wmjj971b.fsf@linux.ibm.com/ Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com> Fixes: f4b7b53c94af ("workqueue: Detach workers directly in idle_cull_fn()") Cc: stable@vger.kernel.org # v6.11+ Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-05workqueue: Correct declaration of cpu_pwq in struct workqueue_structUros Bizjak
cpu_pwq is used in various percpu functions that expect variable in __percpu address space. Correct the declaration of cpu_pwq to struct pool_workqueue __rcu * __percpu *cpu_pwq to declare the variable as __percpu pointer. The patch also fixes following sparse errors: workqueue.c:380:37: warning: duplicate [noderef] workqueue.c:380:37: error: multiple address spaces given: __rcu & __percpu workqueue.c:2271:15: error: incompatible types in comparison expression (different address spaces): workqueue.c:2271:15: struct pool_workqueue [noderef] __rcu * workqueue.c:2271:15: struct pool_workqueue [noderef] __percpu * and uncovers a couple of exisiting "incorrect type in assignment" warnings (from __rcu address space), which this patch does not address. Found by GCC's named address space checks. There were no changes in the resulting object files. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-05workqueue: Fix spruious data race in __flush_work()Tejun Heo
When flushing a work item for cancellation, __flush_work() knows that it exclusively owns the work item through its PENDING bit. 134874e2eee9 ("workqueue: Allow cancel_work_sync() and disable_work() from atomic contexts on BH work items") added a read of @work->data to determine whether to use busy wait for BH work items that are being canceled. While the read is safe when @from_cancel, @work->data was read before testing @from_cancel to simplify code structure: data = *work_data_bits(work); if (from_cancel && !WARN_ON_ONCE(data & WORK_STRUCT_PWQ) && (data & WORK_OFFQ_BH)) { While the read data was never used if !@from_cancel, this could trigger KCSAN data race detection spuriously: ================================================================== BUG: KCSAN: data-race in __flush_work / __flush_work write to 0xffff8881223aa3e8 of 8 bytes by task 3998 on cpu 0: instrument_write include/linux/instrumented.h:41 [inline] ___set_bit include/asm-generic/bitops/instrumented-non-atomic.h:28 [inline] insert_wq_barrier kernel/workqueue.c:3790 [inline] start_flush_work kernel/workqueue.c:4142 [inline] __flush_work+0x30b/0x570 kernel/workqueue.c:4178 flush_work kernel/workqueue.c:4229 [inline] ... read to 0xffff8881223aa3e8 of 8 bytes by task 50 on cpu 1: __flush_work+0x42a/0x570 kernel/workqueue.c:4188 flush_work kernel/workqueue.c:4229 [inline] flush_delayed_work+0x66/0x70 kernel/workqueue.c:4251 ... value changed: 0x0000000000400000 -> 0xffff88810006c00d Reorganize the code so that @from_cancel is tested before @work->data is accessed. The only problem is triggering KCSAN detection spuriously. This shouldn't need READ_ONCE() or other access qualifiers. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: syzbot+b3e4f2f51ed645fd5df2@syzkaller.appspotmail.com Fixes: 134874e2eee9 ("workqueue: Allow cancel_work_sync() and disable_work() from atomic contexts on BH work items") Link: http://lkml.kernel.org/r/000000000000ae429e061eea2157@google.com Cc: Jens Axboe <axboe@kernel.dk>
2024-08-05workqueue: Remove incorrect "WARN_ON_ONCE(!list_empty(&worker->entry));" ↵Lai Jiangshan
from dying worker The commit 68f83057b913 ("workqueue: Reap workers via kthread_stop() and remove detach_completion") changes the procedure of destroying workers; the dying workers are kept in the cull_list in wake_dying_workers() with the pool lock held and removed from the cull_list by the newly added reap_dying_workers() without the pool lock. This can cause a warning if the dying worker is wokenup earlier than reaped as reported by Marc: 2024/07/23 18:01:21 [M83LP63]: [ 157.267727] ------------[ cut here ]------------ 2024/07/23 18:01:21 [M83LP63]: [ 157.267735] WARNING: CPU: 21 PID: 725 at kernel/workqueue.c:3340 worker_thread+0x54e/0x558 2024/07/23 18:01:21 [M83LP63]: [ 157.267746] Modules linked in: binfmt_misc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables sunrpc dm_service_time s390_trng vfio_ccw mdev vfio_iommu_type1 vfio sch_fq_codel 2024/07/23 18:01:21 [M83LP63]: loop dm_multipath configfs nfnetlink lcs ctcm fsm zfcp scsi_transport_fc ghash_s390 prng chacha_s390 libchacha aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common scm_block eadm_sch scsi_dh_rdac scsi_dh_emc scsi_dh_alua pkey zcrypt rng_core autofs4 2024/07/23 18:01:21 [M83LP63]: [ 157.267792] CPU: 21 PID: 725 Comm: kworker/dying Not tainted 6.10.0-rc2-00239-g68f83057b913 #95 2024/07/23 18:01:21 [M83LP63]: [ 157.267796] Hardware name: IBM 3906 M04 704 (LPAR) 2024/07/23 18:01:21 [M83LP63]: [ 157.267802] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3 2024/07/23 18:01:21 [M83LP63]: [ 157.267797] Krnl PSW : 0704d00180000000 000003d600fcd9fa (worker_thread+0x552/0x558) 2024/07/23 18:01:21 [M83LP63]: [ 157.267806] Krnl GPRS: 6479696e6700776f 000002c901b62780 000003d602493ec8 000002c914954600 2024/07/23 18:01:21 [M83LP63]: [ 157.267809] 0000000000000000 0000000000000008 000002c901a85400 000002c90719e840 2024/07/23 18:01:21 [M83LP63]: [ 157.267811] 000002c90719e880 000002c901a85420 000002c91127adf0 000002c901a85400 2024/07/23 18:01:21 [M83LP63]: [ 157.267813] 000002c914954600 0000000000000000 000003d600fcd772 000003560452bd98 2024/07/23 18:01:21 [M83LP63]: [ 157.267822] Krnl Code: 000003d600fcd9ec: c0e500674262 brasl %r14,000003d601cb5eb0 2024/07/23 18:01:21 [M83LP63]: [ 157.267822] 000003d600fcd9f2: a7f4ffc8 brc 15,000003d600fcd982 2024/07/23 18:01:21 [M83LP63]: [ 157.267822] #000003d600fcd9f6: af000000 mc 0,0 2024/07/23 18:01:21 [M83LP63]: [ 157.267822] >000003d600fcd9fa: a7f4fec2 brc 15,000003d600fcd77e 2024/07/23 18:01:21 [M83LP63]: [ 157.267822] 000003d600fcd9fe: 0707 bcr 0,%r7 2024/07/23 18:01:21 [M83LP63]: [ 157.267822] 000003d600fcda00: c00400682e10 brcl 0,000003d601cd3620 2024/07/23 18:01:21 [M83LP63]: [ 157.267822] 000003d600fcda06: eb7ff0500024 stmg %r7,%r15,80(%r15) 2024/07/23 18:01:21 [M83LP63]: [ 157.267822] 000003d600fcda0c: b90400ef lgr %r14,%r15 2024/07/23 18:01:21 [M83LP63]: [ 157.267853] Call Trace: 2024/07/23 18:01:21 [M83LP63]: [ 157.267855] [<000003d600fcd9fa>] worker_thread+0x552/0x558 2024/07/23 18:01:21 [M83LP63]: [ 157.267859] ([<000003d600fcd772>] worker_thread+0x2ca/0x558) 2024/07/23 18:01:21 [M83LP63]: [ 157.267862] [<000003d600fd6c80>] kthread+0x120/0x128 2024/07/23 18:01:21 [M83LP63]: [ 157.267865] [<000003d600f5305c>] __ret_from_fork+0x3c/0x58 2024/07/23 18:01:21 [M83LP63]: [ 157.267868] [<000003d601cc746a>] ret_from_fork+0xa/0x30 2024/07/23 18:01:21 [M83LP63]: [ 157.267873] Last Breaking-Event-Address: 2024/07/23 18:01:21 [M83LP63]: [ 157.267874] [<000003d600fcd778>] worker_thread+0x2d0/0x558 Since the procedure of destroying workers is changed, the WARN_ON_ONCE() becomes incorrect and should be removed. Cc: Marc Hartmayer <mhartmay@linux.ibm.com> Link: https://lore.kernel.org/lkml/87le1sjd2e.fsf@linux.ibm.com/ Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com> Fixes: 68f83057b913 ("workqueue: Reap workers via kthread_stop() and remove detach_completion") Cc: stable@vger.kernel.org # v6.11+ Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-05workqueue: Fix UBSAN 'subtraction overflow' error in shift_and_mask()Will Deacon
UBSAN reports the following 'subtraction overflow' error when booting in a virtual machine on Android: | Internal error: UBSAN: integer subtraction overflow: 00000000f2005515 [#1] PREEMPT SMP | Modules linked in: | CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.10.0-00006-g3cbe9e5abd46-dirty #4 | Hardware name: linux,dummy-virt (DT) | pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) | pc : cancel_delayed_work+0x34/0x44 | lr : cancel_delayed_work+0x2c/0x44 | sp : ffff80008002ba60 | x29: ffff80008002ba60 x28: 0000000000000000 x27: 0000000000000000 | x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000 | x23: 0000000000000000 x22: 0000000000000000 x21: ffff1f65014cd3c0 | x20: ffffc0e84c9d0da0 x19: ffffc0e84cab3558 x18: ffff800080009058 | x17: 00000000247ee1f8 x16: 00000000247ee1f8 x15: 00000000bdcb279d | x14: 0000000000000001 x13: 0000000000000075 x12: 00000a0000000000 | x11: ffff1f6501499018 x10: 00984901651fffff x9 : ffff5e7cc35af000 | x8 : 0000000000000001 x7 : 3d4d455453595342 x6 : 000000004e514553 | x5 : ffff1f6501499265 x4 : ffff1f650ff60b10 x3 : 0000000000000620 | x2 : ffff80008002ba78 x1 : 0000000000000000 x0 : 0000000000000000 | Call trace: | cancel_delayed_work+0x34/0x44 | deferred_probe_extend_timeout+0x20/0x70 | driver_register+0xa8/0x110 | __platform_driver_register+0x28/0x3c | syscon_init+0x24/0x38 | do_one_initcall+0xe4/0x338 | do_initcall_level+0xac/0x178 | do_initcalls+0x5c/0xa0 | do_basic_setup+0x20/0x30 | kernel_init_freeable+0x8c/0xf8 | kernel_init+0x28/0x1b4 | ret_from_fork+0x10/0x20 | Code: f9000fbf 97fffa2f 39400268 37100048 (d42aa2a0) | ---[ end trace 0000000000000000 ]--- | Kernel panic - not syncing: UBSAN: integer subtraction overflow: Fatal exception This is due to shift_and_mask() using a signed immediate to construct the mask and being called with a shift of 31 (WORK_OFFQ_POOL_SHIFT) so that it ends up decrementing from INT_MIN. Use an unsigned constant '1U' to generate the mask in shift_and_mask(). Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Fixes: 1211f3b21c2a ("workqueue: Preserve OFFQ bits in cancel[_sync] paths") Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-15workqueue: Remove unneeded lockdep_assert_cpus_held()Lai Jiangshan
The commit 19af45757383 ("workqueue: Remove cpus_read_lock() from apply_wqattrs_lock()") removes the unneed cpus_read_lock() after the pwq creations and installations have been reworked based on wq_online_cpumask rather than cpu_online_mask making cpus_read_lock() is unneeded during wqattrs changes. But it desn't remove the lockdep_assert_cpus_held() checks during wqattrs changes, which leads to complaints from lockdep reported by kernel test robot: [ 15.726567][ T131] ------------[ cut here ]------------ [ 15.728117][ T131] WARNING: CPU: 1 PID: 131 at kernel/cpu.c:525 lockdep_assert_cpus_held (kernel/cpu.c:525) [ 15.731191][ T131] Modules linked in: floppy(+) parport_pc(+) parport qemu_fw_cfg rtc_cmos [ 15.733423][ T131] CPU: 1 PID: 131 Comm: systemd-udevd Tainted: G T 6.10.0-rc2-00254-g19af45757383 #1 df6f039f42e8818bf9a534449362ebad1aad32e2 [ 15.737011][ T131] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 15.739760][ T131] EIP: lockdep_assert_cpus_held (kernel/cpu.c:525) [ 15.741326][ T131] Code: 97 c2 03 72 20 83 3d f4 73 97 c2 00 74 17 55 89 e5 b8 fc bd 4d c2 ba ff ff ff ff e8 e4 57 d1 00 85 c0 74 06 5d 31 c0 31 d2 c3 <0f> 0b eb f6 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 89 e5 b8 Fix it by removing the unneeded lockdep_assert_cpus_held(). Also remove the unneed cpus_read_lock() from wq_affn_dfl_set(). tj: Dropped the removal of cpus_read_lock/unlock() in wq_affn_dfl_set() to keep this patch fix only. Cc: kernel test robot <oliver.sang@intel.com> Fixes: 19af45757383("workqueue: Remove cpus_read_lock() from apply_wqattrs_lock()") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202407141846.665c0446-lkp@intel.com Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-15Merge tag 'wq-for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
Pull workqueue updates from Tejun Heo: - Lai fixed a bug where CPU hotplug and workqueue attribute changes race leaving some workqueues not fully updated. This involved refactoring and changing how online CPUs are tracked. The resulting code is cleaner. - Workqueue watchdog touch operation was causing too much cacheline contention on very large machines. Nicholas improved scalabililty by avoiding unnecessary global updates. - Code cleanups and minor rescuer behavior improvement. - The last commit 58629d4871e8 ("workqueue: Always queue work items to the newest PWQ for order workqueues") is a cherry-picked straggler commit from for-6.10-fixes, a fix for a bug which may not actually trigger. * tag 'wq-for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (24 commits) workqueue: Always queue work items to the newest PWQ for order workqueues workqueue: Rename wq_update_pod() to unbound_wq_update_pwq() workqueue: Remove the arguments @hotplug_cpu and @online from wq_update_pod() workqueue: Remove the argument @cpu_going_down from wq_calc_pod_cpumask() workqueue: Remove the unneeded cpumask empty check in wq_calc_pod_cpumask() workqueue: Remove cpus_read_lock() from apply_wqattrs_lock() workqueue: Simplify wq_calc_pod_cpumask() with wq_online_cpumask workqueue: Add wq_online_cpumask workqueue: Init rescuer's affinities as the wq's effective cpumask workqueue: Put PWQ allocation and WQ enlistment in the same lock C.S. workqueue: Move kthread_flush_worker() out of alloc_and_link_pwqs() workqueue: Make rescuer initialization as the last step of the creation of a new wq workqueue: Register sysfs after the whole creation of the new wq workqueue: Simplify goto statement workqueue: Update cpumasks after only applying it successfully workqueue: Improve scalability of workqueue watchdog touch workqueue: wq_watchdog_touch is always called with valid CPU workqueue: Remove useless pool->dying_workers workqueue: Detach workers directly in idle_cull_fn() workqueue: Don't bind the rescuer in the last working cpu ...
2024-07-14workqueue: Always queue work items to the newest PWQ for order workqueuesLai Jiangshan
To ensure non-reentrancy, __queue_work() attempts to enqueue a work item to the pool of the currently executing worker. This is not only unnecessary for an ordered workqueue, where order inherently suggests non-reentrancy, but it could also disrupt the sequence if the item is not enqueued on the newest PWQ. Just queue it to the newest PWQ and let order management guarantees non-reentrancy. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Fixes: 4c065dbce1e8 ("workqueue: Enable unbound cpumask update on ordered workqueues") Cc: stable@vger.kernel.org # v6.9+ Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit 74347be3edfd11277799242766edf844c43dd5d3)
2024-07-11workqueue: Rename wq_update_pod() to unbound_wq_update_pwq()Lai Jiangshan
What wq_update_pod() does is just to update the pwq of the specific cpu. Rename it and update the comments. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-11workqueue: Remove the arguments @hotplug_cpu and @online from wq_update_pod()Lai Jiangshan
The arguments @hotplug_cpu and @online are not used in wq_update_pod() since the functions called by wq_update_pod() don't need them. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-11workqueue: Remove the argument @cpu_going_down from wq_calc_pod_cpumask()Lai Jiangshan
wq_calc_pod_cpumask() uses wq_online_cpumask, which excludes the cpu going down, so the argument cpu_going_down is unused and can be removed. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-11workqueue: Remove the unneeded cpumask empty check in wq_calc_pod_cpumask()Lai Jiangshan
The cpumask empty check in wq_calc_pod_cpumask() has long been useless. It just works purely as documents which states that the cpumask is not possible empty after the function returns. Now the code above is even more explicit that the cpumask is not empty, so the document-only empty check can be removed. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-11workqueue: Remove cpus_read_lock() from apply_wqattrs_lock()Lai Jiangshan
1726a1713590 ("workqueue: Put PWQ allocation and WQ enlistment in the same lock C.S.") led to the following possible deadlock: WARNING: possible recursive locking detected 6.10.0-rc5-00004-g1d4c6111406c #1 Not tainted -------------------------------------------- swapper/0/1 is trying to acquire lock: c27760f4 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue (kernel/workqueue.c:5152 kernel/workqueue.c:5730) but task is already holding lock: c27760f4 (cpu_hotplug_lock){++++}-{0:0}, at: padata_alloc (kernel/padata.c:1007) ... stack backtrace: ... cpus_read_lock (include/linux/percpu-rwsem.h:53 kernel/cpu.c:488) alloc_workqueue (kernel/workqueue.c:5152 kernel/workqueue.c:5730) padata_alloc (kernel/padata.c:1007 (discriminator 1)) pcrypt_init_padata (crypto/pcrypt.c:327 (discriminator 1)) pcrypt_init (crypto/pcrypt.c:353) do_one_initcall (init/main.c:1267) do_initcalls (init/main.c:1328 (discriminator 1) init/main.c:1345 (discriminator 1)) kernel_init_freeable (init/main.c:1364) kernel_init (init/main.c:1469) ret_from_fork (arch/x86/kernel/process.c:153) ret_from_fork_asm (arch/x86/entry/entry_32.S:737) entry_INT80_32 (arch/x86/entry/entry_32.S:944) This is caused by pcrypt allocating a workqueue while holding cpus_read_lock(), so workqueue code can't do it again as that can lead to deadlocks if down_write starts after the first down_read. The pwq creations and installations have been reworked based on wq_online_cpumask rather than cpu_online_mask making cpus_read_lock() is unneeded during wqattrs changes. Fix the deadlock by removing cpus_read_lock() from apply_wqattrs_lock(). tj: Updated changelog. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Fixes: 1726a1713590 ("workqueue: Put PWQ allocation and WQ enlistment in the same lock C.S.") Link: http://lkml.kernel.org/r/202407081521.83b627c1-lkp@intel.com Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-11workqueue: Simplify wq_calc_pod_cpumask() with wq_online_cpumaskLai Jiangshan
Avoid relying on cpu_online_mask for wqattrs changes so that cpus_read_lock() can be removed from apply_wqattrs_lock(). And with wq_online_cpumask, attrs->__pod_cpumask doesn't need to be reused as a temporary storage to calculate if the pod have any online CPUs @attrs wants since @cpu_going_down is not in the wq_online_cpumask. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-11workqueue: Add wq_online_cpumaskLai Jiangshan
The new wq_online_mask mirrors the cpu_online_mask except during hotplugging; specifically, it differs between the hotplugging stages of workqueue_offline_cpu() and workqueue_online_cpu(), during which the transitioning CPU is not represented in the mask. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-05workqueue: Init rescuer's affinities as the wq's effective cpumaskLai Jiangshan
Make it consistent with apply_wqattrs_commit(). Link: https://lore.kernel.org/lkml/20240203154334.791910-5-longman@redhat.com/ Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-05workqueue: Put PWQ allocation and WQ enlistment in the same lock C.S.Lai Jiangshan
The PWQ allocation and WQ enlistment are not within the same lock-held critical section; therefore, their states can become out of sync when the user modifies the unbound mask or if CPU hotplug events occur in the interim since those operations only update the WQs that are already in the list. Make the PWQ allocation and WQ enlistment atomic. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-05workqueue: Move kthread_flush_worker() out of alloc_and_link_pwqs()Lai Jiangshan
kthread_flush_worker() can't be called with wq_pool_mutex held. Prepare for moving wq_pool_mutex and cpu hotplug lock out of alloc_and_link_pwqs(). Cc: Zqiang <qiang.zhang1211@gmail.com> Link: https://lore.kernel.org/lkml/20230920060704.24981-1-qiang.zhang1211@gmail.com/ Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-05workqueue: Make rescuer initialization as the last step of the creation of a ↵Lai Jiangshan
new wq For early wq allocation, rescuer initialization is the last step of the creation of a new wq. Make the behavior the same for all allocations. Prepare for initializing rescuer's affinities with the default pwq's affinities. Prepare for moving the whole workqueue initializing procedure into wq_pool_mutex and cpu hotplug locks. Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-05workqueue: Register sysfs after the whole creation of the new wqLai Jiangshan
workqueue creation includes adding it to the workqueue list. Prepare for moving the whole workqueue initializing procedure into wq_pool_mutex and cpu hotplug locks. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-02workqueue: Simplify goto statementLai Jiangshan
Use a simple if-statement to replace the cumbersome goto-statement in workqueue_set_unbound_cpumask(). Cc: Waiman Long <longman@redhat.com> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-02workqueue: Update cpumasks after only applying it successfullyLai Jiangshan
Make workqueue_unbound_exclude_cpumask() and workqueue_set_unbound_cpumask() only update wq_isolated_cpumask and wq_requested_unbound_cpumask when workqueue_apply_unbound_cpumask() returns successfully. Fixes: fe28f631fa94("workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask") Cc: Waiman Long <longman@redhat.com> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-06-25workqueue: Improve scalability of workqueue watchdog touchNicholas Piggin
On a ~2000 CPU powerpc system, hard lockups have been observed in the workqueue code when stop_machine runs (in this case due to CPU hotplug). This is due to lots of CPUs spinning in multi_cpu_stop, calling touch_nmi_watchdog() which ends up calling wq_watchdog_touch(). wq_watchdog_touch() writes to the global variable wq_watchdog_touched, and that can find itself in the same cacheline as other important workqueue data, which slows down operations to the point of lockups. In the case of the following abridged trace, worker_pool_idr was in the hot line, causing the lockups to always appear at idr_find. watchdog: CPU 1125 self-detected hard LOCKUP @ idr_find Call Trace: get_work_pool __queue_work call_timer_fn run_timer_softirq __do_softirq do_softirq_own_stack irq_exit timer_interrupt decrementer_common_virt * interrupt: 900 (timer) at multi_cpu_stop multi_cpu_stop cpu_stopper_thread smpboot_thread_fn kthread Fix this by having wq_watchdog_touch() only write to the line if the last time a touch was recorded exceeds 1/4 of the watchdog threshold. Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-06-25workqueue: wq_watchdog_touch is always called with valid CPUNicholas Piggin
Warn in the case it is called with cpu == -1. This does not appear to happen anywhere. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-06-21workqueue: Remove useless pool->dying_workersLai Jiangshan
A dying worker is first moved from pool->workers to pool->dying_workers in set_worker_dying() and removed from pool->dying_workers in detach_dying_workers(). The whole procedure is in the some lock context of wq_pool_attach_mutex. So pool->dying_workers is useless, just remove it and keep the dying worker in pool->workers after set_worker_dying() and remove it in detach_dying_workers() with wq_pool_attach_mutex held. Cc: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-06-21workqueue: Detach workers directly in idle_cull_fn()Lai Jiangshan
The code to kick off the destruction of workers is now in a process context (idle_cull_fn()), and the detaching of a worker is not required to be inside the worker thread now, so just do the detaching directly in idle_cull_fn(). wake_dying_workers() is renamed to detach_dying_workers() and the unneeded wakeup in wake_dying_workers() is also removed. Cc: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-06-21workqueue: Don't bind the rescuer in the last working cpuLai Jiangshan
So that when the rescuer is woken up next time, it will not interrupt the last working cpu which might be busy on other crucial works but have nothing to do with the rescuer's incoming works. Cc: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-06-21workqueue: Reap workers via kthread_stop() and remove detach_completionLai Jiangshan
The code to kick off the destruction of workers is now in a process context (idle_cull_fn()), so kthread_stop() can be used in the process context to replace the work of pool->detach_completion. The wakeup in wake_dying_workers() is unneeded after this change, but it is harmless, jut keep it here until next patch renames wake_dying_workers() rather than renaming it again and again. Cc: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-06-19workqueue: Avoid nr_active manipulation in grabbing inactive itemsLai Jiangshan
Current try_to_grab_pending() activates the inactive item and subsequently treats it as though it were a standard activated item. This approach prevents duplicating handling logic for both active and inactive items, yet the premature activation of an inactive item triggers trace_workqueue_activate_work(), yielding an unintended user space visible side effect. And the unnecessary increment of the nr_active, which is not a simple counter now, followed by a counteracted decrement, is inefficient and complicates the code. Just remove the nr_active manipulation code in grabbing inactive items. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-06-10workqueue: replace call_rcu by kfree_rcu for simple kmem_cache_free callbackJulia Lawall
Since SLOB was removed, it is not necessary to use call_rcu when the callback only performs kmem_cache_free. Use kfree_rcu() directly. The changes were done using the following Coccinelle semantic patch. This semantic patch is designed to ignore cases where the callback function is used in another way. // <smpl> @r@ expression e; local idexpression e2; identifier cb,f; position p; @@ ( call_rcu(...,e2) | call_rcu(&e->f,cb@p) ) @r1@ type T; identifier x,r.cb; @@ cb(...) { ( kmem_cache_free(...); | T x = ...; kmem_cache_free(...,x); | T x; x = ...; kmem_cache_free(...,x); ) } @s depends on r1@ position p != r.p; identifier r.cb; @@ cb@p @script:ocaml@ cb << r.cb; p << s.p; @@ Printf.eprintf "Other use of %s at %s:%d\n" cb (List.hd p).file (List.hd p).line @depends on r1 && !s@ expression e; identifier r.cb,f; position r.p; @@ - call_rcu(&e->f,cb@p) + kfree_rcu(e,f) @r1a depends on !s@ type T; identifier x,r.cb; @@ - cb(...) { ( - kmem_cache_free(...); | - T x = ...; - kmem_cache_free(...,x); | - T x; - x = ...; - kmem_cache_free(...,x); ) - } // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-06-07workqueue: Clean code in alloc_and_link_pwqs()Wenchao Hao
wq->flags would not change, so it's not necessary to check if WQ_BH is set in loop for_each_possible_cpu(), move define and set of pools out of loop to simpliy the code. Signed-off-by: Wenchao Hao <haowenchao22@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-05-20workqueue: Refactor worker ID formatting and make wq_worker_comm() use full ↵Tejun Heo
ID string Currently, worker ID formatting is open coded in create_worker(), init_rescuer() and worker_thread() (for %WORKER_DIE case). The formatted ID is saved into task->comm and wq_worker_comm() uses it as the base name to append extra information to when generating the name to be shown to userspace. However, TASK_COMM_LEN is only 16 leading to badly truncated names for rescuers. For example, the rescuer for the inet_frag_wq workqueue becomes: $ ps -ef | grep '[k]worker/R-inet' root 483 2 0 Apr26 ? 00:00:00 [kworker/R-inet_] Even for non-rescue workers, it's easy to run over 15 characters on moderately large machines. Fit it by consolidating worker ID formatting into a new helper format_worker_id() and calling it from wq_worker_comm() to obtain the untruncated worker ID string. $ ps -ef | grep '[k]worker/R-inet' root 60 2 0 12:10 ? 00:00:00 [kworker/R-inet_frag_wq] Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Jan Engelhardt <jengelh@inai.de> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-05-15Merge branch 'for-6.10' into test-merge-for-6.10Tejun Heo
2024-05-13Merge tag 'sched-core-2024-05-13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Add cpufreq pressure feedback for the scheduler - Rework misfit load-balancing wrt affinity restrictions - Clean up and simplify the code around ::overutilized and ::overload access. - Simplify sched_balance_newidle() - Bump SCHEDSTAT_VERSION to 16 due to a cleanup of CPU_MAX_IDLE_TYPES handling that changed the output. - Rework & clean up <asm/vtime.h> interactions wrt arch_vtime_task_switch() - Reorganize, clean up and unify most of the higher level scheduler balancing function names around the sched_balance_*() prefix - Simplify the balancing flag code (sched_balance_running) - Miscellaneous cleanups & fixes * tag 'sched-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits) sched/pelt: Remove shift of thermal clock sched/cpufreq: Rename arch_update_thermal_pressure() => arch_update_hw_pressure() thermal/cpufreq: Remove arch_update_thermal_pressure() sched/cpufreq: Take cpufreq feedback into account cpufreq: Add a cpufreq pressure feedback for the scheduler sched/fair: Fix update of rd->sg_overutilized sched/vtime: Do not include <asm/vtime.h> header s390/irq,nmi: Include <asm/vtime.h> header directly s390/vtime: Remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover sched/vtime: Get rid of generic vtime_task_switch() implementation sched/vtime: Remove confusing arch_vtime_task_switch() declaration sched/balancing: Simplify the sg_status bitmask and use separate ->overloaded and ->overutilized flags sched/fair: Rename set_rd_overutilized_status() to set_rd_overutilized() sched/fair: Rename SG_OVERLOAD to SG_OVERLOADED sched/fair: Rename {set|get}_rd_overload() to {set|get}_rd_overloaded() sched/fair: Rename root_domain::overload to ::overloaded sched/fair: Use helper functions to access root_domain::overload sched/fair: Check root_domain::overload value before update sched/fair: Combine EAS check with root_domain::overutilized access sched/fair: Simplify the continue_balancing logic in sched_balance_newidle() ...
2024-04-24workqueue: Fix divide error in wq_update_node_max_active()Lai Jiangshan
Yue Sun and xingwei lee reported a divide error bug in wq_update_node_max_active(): divide error: 0000 [#1] PREEMPT SMP KASAN PTI CPU: 1 PID: 21 Comm: cpuhp/1 Not tainted 6.9.0-rc5 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:wq_update_node_max_active+0x369/0x6b0 kernel/workqueue.c:1605 Code: 24 bf 00 00 00 80 44 89 fe e8 83 27 33 00 41 83 fc ff 75 0d 41 81 ff 00 00 00 80 0f 84 68 01 00 00 e8 fb 22 33 00 44 89 f8 99 <41> f7 fc 89 c5 89 c7 44 89 ee e8 a8 24 33 00 89 ef 8b 5c 24 04 89 RSP: 0018:ffffc9000018fbb0 EFLAGS: 00010293 RAX: 00000000000000ff RBX: 0000000000000001 RCX: ffff888100ada500 RDX: 0000000000000000 RSI: 00000000000000ff RDI: 0000000080000000 RBP: 0000000000000001 R08: ffffffff815b1fcd R09: 1ffff1100364ad72 R10: dffffc0000000000 R11: ffffed100364ad73 R12: 0000000000000000 R13: 0000000000000100 R14: 0000000000000000 R15: 00000000000000ff FS: 0000000000000000(0000) GS:ffff888135c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fb8c06ca6f8 CR3: 000000010d6c6000 CR4: 0000000000750ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> workqueue_offline_cpu+0x56f/0x600 kernel/workqueue.c:6525 cpuhp_invoke_callback+0x4e1/0x870 kernel/cpu.c:194 cpuhp_thread_fun+0x411/0x7d0 kernel/cpu.c:1092 smpboot_thread_fn+0x544/0xa10 kernel/smpboot.c:164 kthread+0x2ed/0x390 kernel/kthread.c:388 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:244 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- After analysis, it happens when all of the CPUs in a workqueue's affinity get offine. The problem can be easily reproduced by: # echo 8 > /sys/devices/virtual/workqueue/<any-wq-name>/cpumask # echo 0 > /sys/devices/system/cpu/cpu3/online Use the default max_actives for nodes when all of the CPUs in the workqueue's affinity get offline to fix the problem. Reported-by: Yue Sun <samsun1006219@gmail.com> Reported-by: xingwei lee <xrivendell7@gmail.com> Link: https://lore.kernel.org/lkml/CAEkJfYPGS1_4JqvpSo0=FM0S1ytB8CEbyreLTtWpR900dUZymw@mail.gmail.com/ Fixes: 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues") Cc: stable@vger.kernel.org Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-04-23workqueue: The default node_nr_active should have its max set to max_activeTejun Heo
The default nna (node_nr_active) is used when the pool isn't tied to a specific NUMA node. This can happen in the following cases: 1. On NUMA, if per-node pwq init failure and the fallback pwq is used. 2. On NUMA, if a pool is configured to span multiple nodes. 3. On single node setups. 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues") set the default nna->max to min_active because only #1 was being considered. For #2 and #3, using min_active means that the max concurrency in normal operation is pushed down to min_active which is currently 8, which can obviously lead to performance issues. exact value nna->max is set to doesn't really matter. #2 can only happen if the workqueue is intentionally configured to ignore NUMA boundaries and there's no good way to distribute max_active in this case. #3 is the default behavior on single node machines. Let's set it the default nna->max to max_active. This fixes the artificially lowered concurrency problem on single node machines and shouldn't hurt anything for other cases. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Fixes: 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues") Link: https://lore.kernel.org/dm-devel/20240410084531.2134621-1-shinichiro.kawasaki@wdc.com/ Signed-off-by: Tejun Heo <tj@kernel.org>
2024-04-23workqueue: Fix selection of wake_cpu in kick_pool()Sven Schnelle
With cpu_possible_mask=0-63 and cpu_online_mask=0-7 the following kernel oops was observed: smp: Bringing up secondary CPUs ... smp: Brought up 1 node, 8 CPUs Unable to handle kernel pointer dereference in virtual kernel address space Failing address: 0000000000000000 TEID: 0000000000000803 [..] Call Trace: arch_vcpu_is_preempted+0x12/0x80 select_idle_sibling+0x42/0x560 select_task_rq_fair+0x29a/0x3b0 try_to_wake_up+0x38e/0x6e0 kick_pool+0xa4/0x198 __queue_work.part.0+0x2bc/0x3a8 call_timer_fn+0x36/0x160 __run_timers+0x1e2/0x328 __run_timer_base+0x5a/0x88 run_timer_softirq+0x40/0x78 __do_softirq+0x118/0x388 irq_exit_rcu+0xc0/0xd8 do_ext_irq+0xae/0x168 ext_int_handler+0xbe/0xf0 psw_idle_exit+0x0/0xc default_idle_call+0x3c/0x110 do_idle+0xd4/0x158 cpu_startup_entry+0x40/0x48 rest_init+0xc6/0xc8 start_kernel+0x3c4/0x5e0 startup_continue+0x3c/0x50 The crash is caused by calling arch_vcpu_is_preempted() for an offline CPU. To avoid this, select the cpu with cpumask_any_and_distribute() to mask __pod_cpumask with cpu_online_mask. In case no cpu is left in the pool, skip the assignment. tj: This doesn't fully fix the bug as CPUs can still go down between picking the target CPU and the wake call. Fixing that likely requires adding cpu_online() test to either the sched or s390 arch code. However, regardless of how that is fixed, workqueue shouldn't be picking a CPU which isn't online as that would result in unpredictable and worse behavior. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Fixes: 8639ecebc9b1 ("workqueue: Implement non-strict affinity scope for unbound workqueues") Cc: stable@vger.kernel.org # v6.6+ Signed-off-by: Tejun Heo <tj@kernel.org>
2024-04-08workqueue: Add destroy_work_on_stack() in workqueue_softirq_dead()Zqiang
This commit add missed destroy_work_on_stack() operations for dead_work.work. Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-03-25workqueue: Cleanup subsys attribute registrationDan Williams
While reviewing users of subsys_virtual_register() I noticed that wq_sysfs_init() ignores the @groups argument. This looks like a historical artifact as the original wq_subsys only had one attribute to register. On the way to building up an @groups argument to pass to subsys_virtual_register() a few more cleanups fell out: * Use DEVICE_ATTR_RO() and DEVICE_ATTR_RW() for cpumask_{isolated,requested} and cpumask respectively. Rename the @show and @store methods accordingly. * Co-locate the attribute definition with the methods. This required moving wq_unbound_cpumask_show down next to wq_unbound_cpumask_store (renamed to cpumask_show() and cpumask_store()) * Use ATTRIBUTE_GROUPS() to skip some boilerplate declarations Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-03-25workqueue: Use list_last_entry() to get the last idle workerLai Jiangshan
It is clearer than open code. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-03-25workqueue: Move attrs->cpumask out of worker_pool's properties when ↵Lai Jiangshan
attrs->affn_strict Allow more pools can be shared when attrs->affn_strict. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-03-25workqueue: Use INIT_WORK_ONSTACK in workqueue_softirq_dead()Lai Jiangshan
dead_work is a stack variable. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-03-25workqueue: Allow cancel_work_sync() and disable_work() from atomic contexts ↵Tejun Heo
on BH work items Now that work_grab_pending() can always grab the PENDING bit without sleeping, the only thing that prevents allowing cancel_work_sync() of a BH work item from an atomic context is the flushing of the in-flight instance. When we're flushing a BH work item for cancel_work_sync(), we know that the work item is not queued and must be executing in a BH context, which means that it's safe to busy-wait for its completion from a non-hardirq atomic context. This patch updates __flush_work() so that it busy-waits when flushing a BH work item for cancel_work_sync(). might_sleep() is pushed from start_flush_work() to its callers - when operating on a BH work item, __cancel_work_sync() now enforces !in_hardirq() instead of might_sleep(). This allows cancel_work_sync() and disable_work() to be called from non-hardirq atomic contexts on BH work items. v3: In __flush_work(), test WORK_OFFQ_BH to tell whether a work item being canceled can be busy waited instead of making start_flush_work() return the pool. (Lai) v2: Lai pointed out that __flush_work() was accessing pool->flags outside the RCU critical section protecting the pool pointer. Fix it by testing and remembering the result inside the RCU critical section. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-03-25workqueue: Remember whether a work item was on a BH workqueueTejun Heo
Add an off-queue flag, WORK_OFFQ_BH, that indicates whether the last workqueue the work item was on was a BH one. This will be used to test whether a work item is BH in cancel_sync path to implement atomic cancel_sync'ing for BH work items. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-03-25workqueue: Remove WORK_OFFQ_CANCELINGTejun Heo
cancel[_delayed]_work_sync() guarantees that it can shut down self-requeueing work items. To achieve that, it grabs and then holds WORK_STRUCT_PENDING bit set while flushing the currently executing instance. As the PENDING bit is set, all queueing attempts including the self-requeueing ones fail and once the currently executing instance is flushed, the work item should be idle as long as someone else isn't actively queueing it. This means that the cancel_work_sync path may hold the PENDING bit set while flushing the target work item. This isn't a problem for the queueing path - it can just fail which is the desired effect. It doesn't affect flush. It doesn't matter to cancel_work either as it can just report that the work item has successfully canceled. However, if there's another cancel_work_sync attempt on the work item, it can't simply fail or report success and that would breach the guarantee that it should provide. cancel_work_sync has to wait for and grab that PENDING bit and go through the motions. WORK_OFFQ_CANCELING and wq_cancel_waitq are what implement this cancel_work_sync to cancel_work_sync wait mechanism. When a work item is being canceled, WORK_OFFQ_CANCELING is also set on it and other cancel_work_sync attempts wait on the bit to be cleared using the wait queue. While this works, it's an isolated wart which doesn't jive with the rest of flush and cancel mechanisms and forces enable_work() and disable_work() to require a sleepable context, which hampers their usability. Now that a work item can be disabled, we can use that to block queueing while cancel_work_sync is in progress. Instead of holding PENDING the bit, it can temporarily disable the work item, flush and then re-enable it as that'd achieve the same end result of blocking queueings while canceling and thus enable canceling of self-requeueing work items. - WORK_OFFQ_CANCELING and the surrounding mechanims are removed. - work_grab_pending() is now simpler, no longer has to wait for a blocking operation and thus can be called from any context. - With work_grab_pending() simplified, no need to use try_to_grab_pending() directly. All users are converted to use work_grab_pending(). - __cancel_work_sync() is updated to __cancel_work() with WORK_CANCEL_DISABLE to cancel and plug racing queueing attempts. It then flushes and re-enables the work item if necessary. - These changes allow disable_work() and enable_work() to be called from any context. v2: Lai pointed out that mod_delayed_work_on() needs to check the disable count before queueing the delayed work item. Added clear_pending_if_disabled() call. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-03-25workqueue: Implement disable/enable for (delayed) work itemsTejun Heo
While (delayed) work items could be flushed and canceled, there was no way to prevent them from being queued in the future. While this didn't lead to functional deficiencies, it sometimes required a bit more effort from the workqueue users to e.g. sequence shutdown steps with more care. Workqueue is currently in the process of replacing tasklet which does support disabling and enabling. The feature is used relatively widely to, for example, temporarily suppress main path while a control plane operation (reset or config change) is in progress. To enable easy conversion of tasklet users and as it seems like an inherent useful feature, this patch implements disabling and enabling of work items. - A work item carries 16bit disable count in work->data while not queued. The access to the count is synchronized by the PENDING bit like all other parts of work->data. - If the count is non-zero, the work item cannot be queued. Any attempt to queue the work item fails and returns %false. - disable_work[_sync](), enable_work(), disable_delayed_work[_sync]() and enable_delayed_work() are added. v3: enable_work() was using local_irq_enable() instead of local_irq_restore() to undo IRQ-disable by work_grab_pending(). This is awkward now and will become incorrect as enable_work() will later be used from IRQ context too. (Lai) v2: Lai noticed that queue_work_node() wasn't checking the disable count. Fixed. queue_rcu_work() is updated to trigger warning if the inner work item is disabled. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-03-25workqueue: Preserve OFFQ bits in cancel[_sync] pathsTejun Heo
The cancel[_sync] paths acquire and release WORK_STRUCT_PENDING, and manipulate WORK_OFFQ_CANCELING. However, they assume that all the OFFQ bit values except for the pool ID are statically known and don't preserve them, which is not wrong in the current code as the pool ID and CANCELING are the only information carried. However, the planned disable/enable support will add more fields and need them to be preserved. This patch updates work data handling so that only the bits which need updating are updated. - struct work_offq_data is added along with work_offqd_unpack() and work_offqd_pack_flags() to help manipulating multiple fields contained in work->data. Note that the helpers look a bit silly right now as there isn't that much to pack. The next patch will add more. - mark_work_canceling() which is used only by __cancel_work_sync() is replaced by open-coded usage of work_offq_data and set_work_pool_and_keep_pending() in __cancel_work_sync(). - __cancel_work[_sync]() uses offq_data helpers to preserve other OFFQ bits when clearing WORK_STRUCT_PENDING and WORK_OFFQ_CANCELING at the end. - This removes all users of get_work_pool_id() which is dropped. Note that get_work_pool_id() could handle both WORK_STRUCT_PWQ and !WORK_STRUCT_PWQ cases; however, it was only being called after try_to_grab_pending() succeeded, in which case WORK_STRUCT_PWQ is never set and thus it's safe to use work_offqd_unpack() instead. No behavior changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-03-25Merge tag 'v6.9-rc1' into sched/core, to pick up fixes and to refresh the branchIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-03-21Merge tag 'driver-core-6.9-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the "big" set of driver core and kernfs changes for 6.9-rc1. Nothing all that crazy here, just some good updates that include: - automatic attribute group hiding from Dan Williams (he fixed up my horrible attempt at doing this.) - kobject lock contention fixes from Eric Dumazet - driver core cleanups from Andy - kernfs rcu work from Tejun - fw_devlink changes to resolve some reported issues - other minor changes, all details in the shortlog All of these have been in linux-next for a long time with no reported issues" * tag 'driver-core-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (28 commits) device: core: Log warning for devices pending deferred probe on timeout driver: core: Use dev_* instead of pr_* so device metadata is added driver: core: Log probe failure as error and with device metadata of: property: fw_devlink: Add support for "post-init-providers" property driver core: Add FWLINK_FLAG_IGNORE to completely ignore a fwnode link driver core: Adds flags param to fwnode_link_add() debugfs: fix wait/cancellation handling during remove device property: Don't use "proxy" headers device property: Move enum dev_dma_attr to fwnode.h driver core: Move fw_devlink stuff to where it belongs driver core: Drop unneeded 'extern' keyword in fwnode.h firmware_loader: Suppress warning on FW_OPT_NO_WARN flag sysfs:Addresses documentation in sysfs_merge_group and sysfs_unmerge_group. firmware_loader: introduce __free() cleanup hanler platform-msi: Remove usage of the deprecated ida_simple_xx() API sysfs: Introduce DEFINE_SIMPLE_SYSFS_GROUP_VISIBLE() sysfs: Document new "group visible" helpers sysfs: Fix crash on empty group attributes array sysfs: Introduce a mechanism to hide static attribute_groups sysfs: Introduce a mechanism to hide static attribute_groups ...
2024-03-12sched/balancing: Rename scheduler_tick() => sched_tick()Ingo Molnar
- Standardize on prefixing scheduler-internal functions defined in <linux/sched.h> with sched_*() prefix. scheduler_tick() was the only function using the scheduler_ prefix. Harmonize it. - The other reason to rename it is the NOHZ scheduler tick handling functions are already named sched_tick_*(). Make the 'git grep sched_tick' more meaningful. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20240308111819.1101550-3-mingo@kernel.org