summaryrefslogtreecommitdiff
path: root/kernel/workqueue.c
AgeCommit message (Collapse)Author
2023-08-07workqueue: Modularize wq_pod_type initializationTejun Heo
While wq_pod_type[] can now group CPUs in any aribitrary way, WQ_AFFN_NUM init is hard coded into workqueue_init_topology(). This patch modularizes the init path by introducing init_pod_type() which takes a callback to determine whether two CPUs should share a pod as an argument. init_pod_type() first scans the CPU combinations testing for sharing to assign consecutive pod IDs and initialize pod_type->cpu_pod[]. Once ->cpu_pod[] is determined, ->pod_cpus[] and ->pod_node[] are initialized accordingly. WQ_AFFN_NUMA is now initialized by calling init_pod_type() with cpus_share_numa() which tests whether the CPU belongs to the same NUMA node. This patch may change the pod ID assigned to each NUMA node but that shouldn't cause any behavior changes as the NUMA node to use for allocations are tracked separately in pod_type->pod_node[]. This makes adding new affinty types pretty easy. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Generalize unbound CPU podsTejun Heo
While renamed to pod, the code still assumes that the pods are defined by NUMA boundaries. Let's generalize it: * workqueue_attrs->affn_scope is added. Each enum represents the type of boundaries that define the pods. There are currently two scopes - WQ_AFFN_NUMA and WQ_AFFN_SYSTEM. The former is the same behavior as before - one pod per NUMA node. The latter defines one global pod across the whole system. * struct wq_pod_type is added which describes how pods are configured for each affnity scope. For each pod, it lists the member CPUs and the preferred NUMA node for memory allocations. The reverse mapping from CPU to pod is also available. * wq_pod_enabled is dropped. Pod is now always enabled. The previously disabled behavior is now implemented through WQ_AFFN_SYSTEM. * get_unbound_pool() wants to determine the NUMA node to allocate memory from for the new pool. The variables are renamed from node to pod but the logic still assumes they're one and the same. Clearly distinguish them - walk the WQ_AFFN_NUMA pods to find the matching pod and then use the pod's NUMA node. * wq_calc_pod_cpumask() was taking @pod but assumed that it was the NUMA node. Take @cpu instead and determine the cpumask to use from the pod_type matching @attrs. * apply_wqattrs_prepare() is update to return ERR_PTR() on error instead of NULL so that it can indicate -EINVAL on invalid affinity scopes. This patch allows CPUs to be grouped into pods however desired per type. While this patch causes some internal behavior changes, nothing material should change for workqueue users. v2: Trigger WARN_ON_ONCE() in wqattrs_pod_type() if affn_scope is WQ_AFFN_NR_TYPES which indicates that the function is called with a worker_pool's attrs instead of a workqueue's. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Factor out clearing of workqueue-only attrs fieldsTejun Heo
workqueue_attrs can be used for both workqueues and worker_pools. However, some fields, currently only ->ordered, only apply to workqueues and should be cleared to the default / invalid values. Currently, an unbound workqueue explicitly clears attrs->ordered in get_unbound_pool() after copying the source workqueue attrs, while per-cpu workqueues rely on the fact that zeroing on allocation gives us the desired default value for pool->attrs->ordered. This is fragile. Let's add wqattrs_clear_for_pool() which clears attrs->ordered and is called from both init_worker_pool() and get_unbound_pool(). This will ease adding more workqueue-only attrs fields. In get_unbound_pool(), pool->node initialization is moved upwards for readability. This shouldn't cause any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Factor out actual cpumask calculation to reduce subtlety in ↵Tejun Heo
wq_update_pod() For an unbound pool, multiple cpumasks are involved. U: The user-specified cpumask (may be filtered with cpu_possible_mask). A: The actual cpumask filtered by wq_unbound_cpumask. If the filtering leaves no CPU, wq_unbound_cpumask is used. P: Per-pod subsets of #A. wq->attrs stores #U, wq->dfl_pwq->pool->attrs->cpumask #A, and wq->cpu_pwq[CPU]->pool->attrs->cpumask #P. wq_update_pod() is called to update per-pod pwq's during CPU hotplug. To calculate the new #P for each workqueue, it needs to call wq_calc_pod_cpumask() with @attrs that contains #A. Currently, wq_update_pod() achieves this by calling wq_calc_pod_cpumask() with wq->dfl_pwq->pool->attrs. This is rather fragile because we're calling wq_calc_pod_cpumask() with @attrs of a worker_pool rather than the workqueue's actual attrs when what we want to calculate is the workqueue's cpumask on the pod. While this works fine currently, future changes will add fields which are used differently between workqueues and worker_pools and this subtlety will bite us. This patch factors out #U -> #A calculation from apply_wqattrs_prepare() into wqattrs_actualize_cpumask and updates wq_update_pod() to copy wq->unbound_attrs and use the new helper to obtain #A freshly instead of abusing wq->dfl_pwq->pool_attrs. This shouldn't cause any behavior changes in the current code. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: K Prateek Nayak <kprateek.nayak@amd.com> Reference: http://lkml.kernel.org/r/30625cdd-4d61-594b-8db9-6816b017dde3@amd.com
2023-08-07workqueue: Initialize unbound CPU pods later in the bootTejun Heo
During boot, to initialize unbound CPU pods, wq_pod_init() was called from workqueue_init(). This is early enough for NUMA nodes to be set up but before SMP is brought up and CPU topology information is populated. Workqueue is in the process of improving CPU locality for unbound workqueues and will need access to topology information during pod init. This adds a new init function workqueue_init_topology() which is called after CPU topology information is available and replaces wq_pod_init(). As unbound CPU pods are now initialized after workqueues are activated, we need to revisit the workqueues to apply the pod configuration. Workqueues which are created before workqueue_init_topology() are set up so that they always use the default worker pool. After pods are set up in workqueue_init_topology(), wq_update_pod() is called on all existing workqueues to update the pool associations accordingly. Note that wq_update_pod_attrs_buf allocation is moved to workqueue_init_early(). This isn't necessary right now but enables further generalization of pod handling in the future. This patch changes the initialization sequence but the end result should be the same. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Move wq_pod_init() below workqueue_init()Tejun Heo
wq_pod_init() is called from workqueue_init() and responsible for initializing unbound CPU pods according to NUMA node. Workqueue is in the process of improving affinity awareness and wants to use other topology information to initialize unbound CPU pods; however, unlike NUMA nodes, other topology information isn't yet available in workqueue_init(). The next patch will introduce a later stage init function for workqueue which will be responsible for initializing unbound CPU pods. Relocate wq_pod_init() below workqueue_init() where the new init function is going to be located so that the diff can show the content differences. Just a relocation. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Rename NUMA related names to use pod insteadTejun Heo
Workqueue is in the process of improving CPU affinity awareness. It will become more flexible and won't be tied to NUMA node boundaries. This patch renames all NUMA related names in workqueue.c to use "pod" instead. While "pod" isn't a very common term, it short and captures the grouping of CPUs well enough. These names are only going to be used within workqueue implementation proper, so the specific naming doesn't matter that much. * wq_numa_possible_cpumask -> wq_pod_cpus * wq_numa_enabled -> wq_pod_enabled * wq_update_unbound_numa_attrs_buf -> wq_update_pod_attrs_buf * workqueue_select_cpu_near -> select_numa_node_cpu This rename is different from others. The function is only used by queue_work_node() and specifically tries to find a CPU in the specified NUMA node. As workqueue affinity will become more flexible and untied from NUMA, this function's name should specifically describe that it's for NUMA. * wq_calc_node_cpumask -> wq_calc_pod_cpumask * wq_update_unbound_numa -> wq_update_pod * wq_numa_init -> wq_pod_init * node -> pod in local variables Only renames. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Rename workqueue_attrs->no_numa to ->orderedTejun Heo
With the recent removal of NUMA related module param and sysfs knob, workqueue_attrs->no_numa is now only used to implement ordered workqueues. Let's rename the field so that it's less confusing especially with the planned CPU affinity awareness improvements. Just a rename. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Make unbound workqueues to use per-cpu pool_workqueuesTejun Heo
A pwq (pool_workqueue) represents an association between a workqueue and a worker_pool. When a work item is queued, the workqueue selects the pwq to use, which in turn determines the pool, and queues the work item to the pool through the pwq. pwq is also what implements the maximum concurrency limit - @max_active. As a per-cpu workqueue should be assocaited with a different worker_pool on each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq. However, unbound workqueues were sharing a pwq within each NUMA node by default. The sharing has several downsides: * Because @max_active is per-pwq, the meaning of @max_active changes depending on the machine configuration and whether workqueue NUMA locality support is enabled. * Makes per-cpu and unbound code deviate. * Gets in the way of making workqueue CPU locality awareness more flexible. This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu workqueues do by making the following changes: * wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound workqueues. * numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs the specified pwq to the target CPU's wq->cpu_pwq. * apply_wqattrs_prepare() now always allocates a separate pwq for each CPU unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq. This makes the return value of wq_calc_node_cpumask() unnecessary. It now returns void. * @max_active now means the same thing for both per-cpu and unbound workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer used in workqueue implementation and will be removed later. * All unbound pwq operations which used to be per-numa-node are now per-cpu. For most unbound workqueue users, this shouldn't cause noticeable changes. Work item issue and completion will be a small bit faster, flush_workqueue() would become a bit more expensive, and the total concurrency limit would likely become higher. All @max_active==1 use cases are currently being audited for conversion into alloc_ordered_workqueue() and they shouldn't be affected once the audit and conversion is complete. One area where the behavior change may be more noticeable is workqueue_congested() as the reported congestion state is now per CPU instead of NUMA node. There are only two users of this interface - drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are cc'd. Inputs on the behavior change would be very much appreciated. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: Karsten Graul <kgraul@linux.ibm.com> Cc: Wenjia Zhang <wenjia@linux.ibm.com> Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-07workqueue: Call wq_update_unbound_numa() on all CPUs in NUMA node on CPU hotplugTejun Heo
When a CPU went online or offline, wq_update_unbound_numa() was called only on the CPU which was going up or down. This works fine because all CPUs on the same NUMA node share the same pool_workqueue slot - one CPU updating it updates it for everyone in the node. However, future changes will make each CPU use a separate pool_workqueue even when they're sharing the same worker_pool, which requires updating pool_workqueue's for all CPUs which may be sharing the same pool_workqueue on hotplug. To accommodate the planned changes, this patch updates workqueue_on/offline_cpu() so that they call wq_update_unbound_numa() for all CPUs sharing the same NUMA node as the CPU going up or down. In the current code, the second+ calls would be noops and there shouldn't be any behavior changes. * As wq_update_unbound_numa() is now called on multiple CPUs per each hotplug event, @cpu is renamed to @hotplug_cpu and another @cpu argument is added. The former indicates the CPU being hot[un]plugged and the latter the CPU whose pool_workqueue is being updated. * In wq_update_unbound_numa(), cpu_off is renamed to off_cpu for consistency with the new @hotplug_cpu. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Make per-cpu pool_workqueues allocated and released like unbound onesTejun Heo
Currently, all per-cpu pwq's (pool_workqueue's) are allocated directly through a per-cpu allocation and thus, unlike unbound workqueues, not reference counted. This difference in lifetime management between the two types is a bit confusing. Unbound workqueues are currently accessed through wq->numa_pwq_tbl[] which isn't suitiable for the planned CPU locality related improvements. The plan is to unify pwq handling across per-cpu and unbound workqueues so that they're always accessed through wq->cpu_pwq. In preparation, this patch makes per-cpu pwq's to be allocated, reference counted and released the same way as unbound pwq's. wq->cpu_pwq now holds pointers to pwq's instead of containing them directly. pwq_unbound_release_workfn() is renamed to pwq_release_workfn() as it's now also used for per-cpu work items. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Use a kthread_worker to release pool_workqueuesTejun Heo
pool_workqueue release path is currently bounced to system_wq; however, this is a bit tricky because this bouncing occurs while holding a pool lock and thus has risk of causing a A-A deadlock. This is currently addressed by the fact that only unbound workqueues use this bouncing path and system_wq is a per-cpu workqueue. While this works, it's brittle and requires a work-around like setting the lockdep subclass for the lock of unbound pools. Besides, future changes will use the bouncing path for per-cpu workqueues too making the current approach unusable. Let's just use a dedicated kthread_worker to untangle the dependency. This is just one more kthread for all workqueues and makes the pwq release logic simpler and more robust. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Remove module param disable_numa and sysfs knobs pool_ids and numaTejun Heo
Unbound workqueue CPU affinity is going to receive an overhaul and the NUMA specific knobs won't make sense anymore. Remove them. Also, the pool_ids knob was used for debugging and not really meaningful given that there is no visibility into the pools associated with those IDs. Remove it too. A future patch will improve overall visibility. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Relocate worker and work management functionsTejun Heo
Collect first_idle_worker(), worker_enter/leave_idle(), find_worker_executing_work(), move_linked_works() and wake_up_worker() into one place. These functions will later be used to implement higher level worker management logic. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Rename wq->cpu_pwqs to wq->cpu_pwqTejun Heo
wq->cpu_pwqs is a percpu variable carraying one pointer to a pool_workqueue. The field name being plural is unusual and confusing. Rename it to singular. This patch doesn't cause any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Not all work insertion needs to wake up a workerTejun Heo
insert_work() always tried to wake up a worker; however, the only time it needs to try to wake up a worker is when a new active work item is queued. When a work item goes on the inactive list or queueing a flush work item, there's no reason to try to wake up a worker. This patch moves the worker wakeup logic out of insert_work() and places it in the active new work item queueing path in __queue_work(). While at it: * __queue_work() is dereferencing pwq->pool repeatedly. Add local variable pool. * Every caller of insert_work() calls debug_work_activate(). Consolidate the invocations into insert_work(). * In __queue_work() pool->watchdog_ts update is relocated slightly. This is to better accommodate future changes. This makes wakeups more precise and will help the planned change to assign work items to workers before waking them up. No behavior changes intended. v2: WARN_ON_ONCE(pool != last_pool) added in __queue_work() to clarify as suggested by Lai. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com>
2023-08-07workqueue: Cleanups around process_scheduled_works()Tejun Heo
* Drop the trivial optimization in worker_thread() where it bypasses calling process_scheduled_works() if the first work item isn't linked. This is a mostly pointless micro optimization and gets in the way of improving the work processing path. * Consolidate pool->watchdog_ts updates in the two callers into process_scheduled_works(). Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Drop the special locking rule for worker->flags and ↵Tejun Heo
worker_pool->flags worker->flags used to be accessed from scheduler hooks without grabbing pool->lock for concurrency management. This is no longer true since 6d25be5782e4 ("sched/core, workqueues: Distangle worker accounting from rq lock"). Also, it's unclear why worker_pool->flags was using the "X" rule. All relevant users are accessing it under the pool lock. Let's drop the special "X" rule and use the "L" rule for these flag fields instead. While at it, replace the CONTEXT comment with lockdep_assert_held(). This allows worker_set/clr_flags() to be used from context which isn't the worker itself. This will be used later to implement assinging work items to workers before waking them up so that workqueue can have better control over which worker executes which work item on which CPU. The only actual changes are sanity checks. There shouldn't be any visible behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Merge branch 'for-6.5-fixes' into for-6.6Tejun Heo
Unbound workqueue execution locality improvement patchset is about to applied which will cause merge conflicts with changes in for-6.5-fixes. Let's avoid future merge conflict by pulling in for-6.5-fixes. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: use LIST_HEAD to initialize cull_listYang Yingliang
Use LIST_HEAD() to initialize cull_list instead of open-coding it. Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-25workqueue: Scale up wq_cpu_intensive_thresh_us if BogoMIPS is below 4000Tejun Heo
wq_cpu_intensive_thresh_us is used to detect CPU-hogging per-cpu work items. Once detected, they're excluded from concurrency management to prevent them from blocking other per-cpu work items. If CONFIG_WQ_CPU_INTENSIVE_REPORT is enabled, repeat offenders are also reported so that the code can be updated. The default threshold is 10ms which is long enough to do fair bit of work on modern CPUs while short enough to be usually not noticeable. This unfortunately leads to a lot of, arguable spurious, detections on very slow CPUs. Using the same threshold across CPUs whose performance levels may be apart by multiple levels of magnitude doesn't make whole lot of sense. This patch scales up wq_cpu_intensive_thresh_us upto 1 second when BogoMIPS is below 4000. This is obviously very inaccurate but it doesn't have to be accurate to be useful. The mechanism is still useful when the threshold is fully scaled up and the benefits of reports are usually shared with everyone regardless of who's reporting, so as long as there are sufficient number of fast machines reporting, we don't lose much. Some (or is it all?) ARM CPUs systemtically report significantly lower BogoMIPS. While this doesn't break anything, given how widespread ARM CPUs are, it's at least a missed opportunity and it probably would be a good idea to teach workqueue about it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
2023-07-10workqueue: add cmdline parameter `workqueue.unbound_cpus` to further ↵tiozhang
constrain wq_unbound_cpumask at boot time Motivation of doing this is to better improve boot times for devices when we want to prevent our workqueue works from running on some specific CPUs, e,g, some CPUs are busy with interrupts. Signed-off-by: tiozhang <tiozhang@didiglobal.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-10workqueue: Warn attempt to flush system-wide workqueues.Tetsuo Handa
Based on commit c4f135d643823a86 ("workqueue: Wrap flush_workqueue() using a macro"), all in-tree users stopped flushing system-wide workqueues. Therefore, start emitting runtime message so that all out-of-tree users will understand that they need to update their code. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-06-27Merge tag 'wq-for-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
Pull workqueue updates from Tejun Heo: - Concurrency-managed per-cpu work items that hog CPUs and delay the execution of other work items are now automatically detected and excluded from concurrency management. Reporting on such work items can also be enabled through a config option. - Added tools/workqueue/wq_monitor.py which improves visibility into workqueue usages and behaviors. - Arnd's minimal fix for gcc-13 enum warning on 32bit compiles, superseded by commit afa4bb778e48 in mainline. * tag 'wq-for-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Disable per-cpu CPU hog detection when wq_cpu_intensive_thresh_us is 0 workqueue: Fix WARN_ON_ONCE() triggers in worker_enter_idle() workqueue: fix enum type for gcc-13 workqueue: Track and monitor per-workqueue CPU time usage workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE workqueue: Improve locking rule description for worker fields workqueue: Move worker_set/clr_flags() upwards workqueue: Re-order struct worker fields workqueue: Add pwq->stats[] and a monitoring script Further upgrade queue_work_on() comment
2023-06-23workqueue: clean up WORK_* constant types, clarify maskingLinus Torvalds
Dave Airlie reports that gcc-13.1.1 has started complaining about some of the workqueue code in 32-bit arm builds: kernel/workqueue.c: In function ‘get_work_pwq’: kernel/workqueue.c:713:24: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast] 713 | return (void *)(data & WORK_STRUCT_WQ_DATA_MASK); | ^ [ ... a couple of other cases ... ] and while it's not immediately clear exactly why gcc started complaining about it now, I suspect it's some C23-induced enum type handlign fixup in gcc-13 is the cause. Whatever the reason for starting to complain, the code and data types are indeed disgusting enough that the complaint is warranted. The wq code ends up creating various "helper constants" (like that WORK_STRUCT_WQ_DATA_MASK) using an enum type, which is all kinds of confused. The mask needs to be 'unsigned long', not some unspecified enum type. To make matters worse, the actual "mask and cast to a pointer" is repeated a couple of times, and the cast isn't even always done to the right pointer, but - as the error case above - to a 'void *' with then the compiler finishing the job. That's now how we roll in the kernel. So create the masks using the proper types rather than some ambiguous enumeration, and use a nice helper that actually does the type conversion in one well-defined place. Incidentally, this magically makes clang generate better code. That, admittedly, is really just a sign of clang having been seriously confused before, and cleaning up the typing unconfuses the compiler too. Reported-by: Dave Airlie <airlied@gmail.com> Link: https://lore.kernel.org/lkml/CAPM=9twNnV4zMCvrPkw3H-ajZOH-01JVh_kDrxdPYQErz8ZTdA@mail.gmail.com/ Cc: Arnd Bergmann <arnd@arndb.de> Cc: Tejun Heo <tj@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-05-25workqueue: Disable per-cpu CPU hog detection when wq_cpu_intensive_thresh_us ↵Zqiang
is 0 If workqueue.cpu_intensive_thresh_us is set to 0, the detection mechanism for CPU-hogging per-cpu work item will keep triggering spuriously: workqueue: process_srcu hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND workqueue: gc_worker hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND workqueue: gc_worker hogged CPU for >0us 8 times, consider switching to WQ_UNBOUND workqueue: wait_rcu_exp_gp hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND workqueue: kfree_rcu_monitor hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND workqueue: kfree_rcu_monitor hogged CPU for >0us 8 times, consider switching to WQ_UNBOUND workqueue: reg_todo hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND This commit therefore disables the CPU-hog detection mechanism when workqueue.cpu_intensive_thresh_us is set to 0. tj: Patch description updated and the condition check on cpu_intensive_thresh_us separated into a separate if statement for readability. Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-24workqueue: Fix WARN_ON_ONCE() triggers in worker_enter_idle()Zqiang
Currently, pool->nr_running can be modified from timer tick, that means the timer tick can run nested inside a not-irq-protected section that's in the process of modifying nr_running. Consider the following scenario: CPU0 kworker/0:2 (events) worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND); ->pool->nr_running++; (1) process_one_work() ->worker->current_func(work); ->schedule() ->wq_worker_sleeping() ->worker->sleeping = 1; ->pool->nr_running--; (0) .... ->wq_worker_running() .... CPU0 by interrupt: wq_worker_tick() ->worker_set_flags(worker, WORKER_CPU_INTENSIVE); ->pool->nr_running--; (-1) ->worker->flags |= WORKER_CPU_INTENSIVE; .... ->if (!(worker->flags & WORKER_NOT_RUNNING)) ->pool->nr_running++; (will not execute) ->worker->sleeping = 0; .... ->worker_clr_flags(worker, WORKER_CPU_INTENSIVE); ->pool->nr_running++; (0) .... worker_set_flags(worker, WORKER_PREP); ->pool->nr_running--; (-1) .... worker_enter_idle() ->WARN_ON_ONCE(pool->nr_workers == pool->nr_idle && pool->nr_running); if the nr_workers is equal to nr_idle, due to the nr_running is not zero, will trigger WARN_ON_ONCE(). [ 2.460602] WARNING: CPU: 0 PID: 63 at kernel/workqueue.c:1999 worker_enter_idle+0xb2/0xc0 [ 2.462163] Modules linked in: [ 2.463401] CPU: 0 PID: 63 Comm: kworker/0:2 Not tainted 6.4.0-rc2-next-20230519 #1 [ 2.463771] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014 [ 2.465127] Workqueue: 0x0 (events) [ 2.465678] RIP: 0010:worker_enter_idle+0xb2/0xc0 ... [ 2.472614] Call Trace: [ 2.473152] <TASK> [ 2.474182] worker_thread+0x71/0x430 [ 2.474992] ? _raw_spin_unlock_irqrestore+0x28/0x50 [ 2.475263] kthread+0x103/0x120 [ 2.475493] ? __pfx_worker_thread+0x10/0x10 [ 2.476355] ? __pfx_kthread+0x10/0x10 [ 2.476635] ret_from_fork+0x2c/0x50 [ 2.477051] </TASK> This commit therefore add the check of worker->sleeping in wq_worker_tick(), if the worker->sleeping is not zero, directly return. tj: Updated comment and description. Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Anders Roxell <anders.roxell@linaro.org> Closes: https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20230519/testrun/17078554/suite/boot/test/clang-nightly-lkftconfig/log Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-17workqueue: Track and monitor per-workqueue CPU time usageTejun Heo
Now that wq_worker_tick() is there, we can easily track the rough CPU time consumption of each workqueue by charging the whole tick whenever a tick hits an active workqueue. While not super accurate, it provides reasonable visibility into the workqueues that consume a lot of CPU cycles. wq_monitor.py is updated to report the per-workqueue CPU times. v2: wq_monitor.py was using "cputime" as the key when outputting in json format. Use "cpu_time" instead for consistency with other fields. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-17workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanismTejun Heo
Workqueue now automatically marks per-cpu work items that hog CPU for too long as CPU_INTENSIVE, which excludes them from concurrency management and prevents stalling other concurrency-managed work items. If a work function keeps running over the thershold, it likely needs to be switched to use an unbound workqueue. This patch adds a debug mechanism which tracks the work functions which trigger the automatic CPU_INTENSIVE mechanism and report them using pr_warn() with exponential backoff. v3: Documentation update. v2: Drop bouncing to kthread_worker for printing messages. It was to avoid introducing circular locking dependency through printk but not effective as it still had pool lock -> wci_lock -> printk -> pool lock loop. Let's just print directly using printk_deferred(). Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org>
2023-05-17workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVETejun Heo
If a per-cpu work item hogs the CPU, it can prevent other work items from starting through concurrency management. A per-cpu workqueue which intends to host such CPU-hogging work items can choose to not participate in concurrency management by setting %WQ_CPU_INTENSIVE; however, this can be error-prone and difficult to debug when missed. This patch adds an automatic CPU usage based detection. If a concurrency-managed work item consumes more CPU time than the threshold (10ms by default) continuously without intervening sleeps, wq_worker_tick() which is called from scheduler_tick() will detect the condition and automatically mark it CPU_INTENSIVE. The mechanism isn't foolproof: * Detection depends on tick hitting the work item. Getting preempted at the right timings may allow a violating work item to evade detection at least temporarily. * nohz_full CPUs may not be running ticks and thus can fail detection. * Even when detection is working, the 10ms detection delays can add up if many CPU-hogging work items are queued at the same time. However, in vast majority of cases, this should be able to detect violations reliably and provide reasonable protection with a small increase in code complexity. If some work items trigger this condition repeatedly, the bigger problem likely is the CPU being saturated with such per-cpu work items and the solution would be making them UNBOUND. The next patch will add a debug mechanism to help spot such cases. v4: Documentation for workqueue.cpu_intensive_thresh_us added to kernel-parameters.txt. v3: Switch to use wq_worker_tick() instead of hooking into preemptions as suggested by Peter. v2: Lai pointed out that wq_worker_stopping() also needs to be called from preemption and rtlock paths and an earlier patch was updated accordingly. This patch adds a comment describing the risk of infinte recursions and how they're avoided. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com>
2023-05-17workqueue: Improve locking rule description for worker fieldsTejun Heo
* Some worker fields are modified only by the worker itself while holding pool->lock thus making them safe to read from self, IRQ context if the CPU is running the worker or while holding pool->lock. Add 'K' locking rule for them. * worker->sleeping is currently marked "None" which isn't very descriptive. It's used only by the worker itself. Add 'S' locking rule for it. A future patch will depend on the 'K' rule to access worker->current_* from the scheduler ticks. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-17workqueue: Move worker_set/clr_flags() upwardsTejun Heo
They are going to be used in wq_worker_stopping(). Move them upwards. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com>
2023-05-17workqueue: Add pwq->stats[] and a monitoring scriptTejun Heo
Currently, the only way to peer into workqueue operations is through tracing. While possible, it isn't easy or convenient to monitor per-workqueue behaviors over time this way. Let's add pwq->stats[] that track relevant events and a drgn monitoring script - tools/workqueue/wq_monitor.py. It's arguable whether this needs to be configurable. However, it currently only has several counters and the runtime overhead shouldn't be noticeable given that they're on pwq's which are per-cpu on per-cpu workqueues and per-numa-node on unbound ones. Let's keep it simple for the time being. v2: Patch reordered to earlier with fewer fields. Field will be added back gradually. Help message improved. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com>
2023-05-09Further upgrade queue_work_on() commentPaul E. McKenney
The current queue_work_on() docbook comment says that the caller must ensure that the specified CPU can't go away, and further says that the penalty for failing to nail down the specified CPU is that the workqueue handler might find itself executing on some other CPU. This is true as far as it goes, but fails to note what happens if the specified CPU never was online. Therefore, further expand this comment to say that specifying a CPU that was never online will result in a splat. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-04-29Merge tag 'wq-for-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
Pull workqueue updates from Tejun Heo: "Mostly changes from Petr to improve warning and error reporting. Workqueue now reports more of the relevant failures with better context which should help debugging" * tag 'wq-for-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Introduce show_freezable_workqueues workqueue: Print backtraces from CPUs with hung CPU bound workqueues workqueue: Warn when a rescuer could not be created workqueue: Interrupted create_worker() is not a repeated event workqueue: Warn when a new worker could not be created workqueue: Fix hung time report of worker pools workqueue: Simplify a pr_warn() call in wq_select_unbound_cpu() MAINTAINERS: Add workqueue_internal.h to the WORKQUEUE entry
2023-03-23workqueue: Introduce show_freezable_workqueuesJungseung Lee
Currently show_all_workqueue is called if freeze fails at the time of freeze the workqueues, which shows the status of all workqueues and of all worker pools. In this cases we may only need to dump state of only workqueues that are freezable and busy. This patch defines show_freezable_workqueues, which uses show_one_workqueue, a granular function that shows the state of individual workqueues, so that dump only the state of freezable workqueues at that time. tj: Minor message adjustment. Signed-off-by: Jungseung Lee <js07.lee@samsung.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-03-17workqueue: Print backtraces from CPUs with hung CPU bound workqueuesPetr Mladek
The workqueue watchdog reports a lockup when there was not any progress in the worker pool for a long time. The progress means that a pending work item starts being proceed. Worker pools for unbound workqueues always wake up an idle worker and try to process the work immediately. The last idle worker has to create new worker first. The stall might happen only when a new worker could not be created in which case an error should get printed. Another problem might be too high load. In this case, workers are victims of a global system problem. Worker pools for CPU bound workqueues are designed for lightweight work items that do not need much CPU time. They are proceed one by one on a single worker. New worker is used only when a work is sleeping. It creates one additional scenario. The stall might happen when the CPU-bound workqueue is used for CPU-intensive work. More precisely, the stall is detected when a CPU-bound worker is in the TASK_RUNNING state for too long. In this case, it might be useful to see the backtrace from the problematic worker. The information how long a worker is in the running state is not available. But the CPU-bound worker pools do not have many workers in the running state by definition. And only few pools are typically blocked. It should be acceptable to print backtraces from all workers in TASK_RUNNING state in the stalled worker pools. The number of false positives should be very low. Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-03-17workqueue: Warn when a rescuer could not be createdPetr Mladek
Rescuers are created when a workqueue with WQ_MEM_RECLAIM is allocated. It typically happens during the system boot. systemd switches the root filesystem from initrd to the booted system during boot. It kills processes that block the switch for too long. One of the process might be modprobe that tries to create a workqueue. These problems are hard to reproduce. Also alloc_workqueue() does not pass the error code. Make the debugging easier by printing an error, similar to create_worker(). Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-03-17workqueue: Interrupted create_worker() is not a repeated eventPetr Mladek
kthread_create_on_node() might get interrupted(). It is rare but realistic. For example, when an unbound workqueue is allocated in module_init() callback. It is done in the context of the "modprobe" process. And, for example, systemd might kill pending processes when switching root from initrd to the booted system. The interrupt is a one-off event and the race might be hard to reproduce. It is always worth printing. Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-03-17workqueue: Warn when a new worker could not be createdPetr Mladek
The workqueue watchdog reports a lockup when there was not any progress in the worker pool for a long time. The progress means that a pending work item starts being proceed. The progress is guaranteed by using idle workers or creating new workers for pending work items. There are several reasons why a new worker could not be created: + there is not enough memory + there is no free pool ID (IDR API) + the system reached PID limit + the process creating the new worker was interrupted + the last idle worker (manager) has not been scheduled for a long time. It was not able to even start creating the kthread. None of these failures is reported at the moment. The only clue is that show_one_worker_pool() prints that there is a manager. It is the last idle worker that is responsible for creating a new one. But it is not clear if create_worker() is failing and why. Make the debugging easier by printing errors in create_worker(). The error code is important, especially from kthread_create_on_node(). It helps to distinguish the various reasons. For example, reaching memory limit (-ENOMEM), other system limits (-EAGAIN), or process interrupted (-EINTR). Use pr_once() to avoid repeating the same error every CREATE_COOLDOWN for each stuck worker pool. Ratelimited printk() might be better. It would help to know if the problem remains. It would be more clear if the create_worker() errors and workqueue stalls are related. Also old messages might get lost when the internal log buffer is full. The problem is that printk() might touch the watchdog. For example, see touch_nmi_watchdog() in serial8250_console_write(). It would require synchronization of the begin and length of the ratelimit interval with the workqueue watchdog. Otherwise, the error messages might break the watchdog. This does not look worth the complexity. Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-03-17workqueue: Fix hung time report of worker poolsPetr Mladek
The workqueue watchdog prints a warning when there is no progress in a worker pool. Where the progress means that the pool started processing a pending work item. Note that it is perfectly fine to process work items much longer. The progress should be guaranteed by waking up or creating idle workers. show_one_worker_pool() prints state of non-idle worker pool. It shows a delay since the last pool->watchdog_ts. The timestamp is updated when a first pending work is queued in __queue_work(). Also it is updated when a work is dequeued for processing in worker_thread() and rescuer_thread(). The delay is misleading when there is no pending work item. In this case it shows how long the last work item is being proceed. Show zero instead. There is no stall if there is no pending work. Fixes: 82607adcf9cdf40fb7b ("workqueue: implement lockup detector") Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-03-17workqueue: Simplify a pr_warn() call in wq_select_unbound_cpu()Ammar Faizi
Use pr_warn_once() to achieve the same thing. It's simpler. Signed-off-by: Ammar Faizi <ammarfaizi2@gnuweeb.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-03-17workqueue: move to use bus_get_dev_root()Greg Kroah-Hartman
Direct access to the struct bus_type dev_root pointer is going away soon so replace that with a call to bus_get_dev_root() instead, which is what it is there for. Cc: Lai Jiangshan <jiangshanlai@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230313182918.1312597-8-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-13workqueue: Fold rebind_worker() within rebind_workers()Valentin Schneider
!CONFIG_SMP builds complain about rebind_worker() being unused. Its only user, rebind_workers() is indeed only defined for CONFIG_SMP, so just fold the two lines back up there. Link: http://lore.kernel.org/r/20230113143102.2e94d74f@canb.auug.org.au Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-01-12workqueue: Unbind kworkers before sending them to exit()Valentin Schneider
It has been reported that isolated CPUs can suffer from interference due to per-CPU kworkers waking up just to die. A surge of workqueue activity during initial setup of a latency-sensitive application (refresh_vm_stats() being one of the culprits) can cause extra per-CPU kworkers to be spawned. Then, said latency-sensitive task can be running merrily on an isolated CPU only to be interrupted sometime later by a kworker marked for death (cf. IDLE_WORKER_TIMEOUT, 5 minutes after last kworker activity). Prevent this by affining kworkers to the wq_unbound_cpumask (which doesn't contain isolated CPUs, cf. HK_TYPE_WQ) before waking them up after marking them with WORKER_DIE. Changing the affinity does require a sleepable context, leverage the newly introduced pool->idle_cull_work to get that. Remove dying workers from pool->workers and keep track of them in a separate list. This intentionally prevents for_each_loop_worker() from iterating over workers that are marked for death. Rename destroy_worker() to set_working_dying() to better reflect its effects and relationship with wake_dying_workers(). Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-01-12workqueue: Don't hold any lock while rcuwait'ing for !POOL_MANAGER_ACTIVEValentin Schneider
put_unbound_pool() currently passes wq_manager_inactive() as exit condition to rcuwait_wait_event(), which grabs pool->lock to check for pool->flags & POOL_MANAGER_ACTIVE A later patch will require destroy_worker() to be invoked with wq_pool_attach_mutex held, which needs to be acquired before pool->lock. A mutex cannot be acquired within rcuwait_wait_event(), as it could clobber the task state set by rcuwait_wait_event() Instead, restructure the waiting logic to acquire any necessary lock outside of rcuwait_wait_event(). Since further work cannot be inserted into unbound pwqs that have reached ->refcnt==0, this is bound to make forward progress as eventually the worklist will be drained and need_more_worker(pool) will remain false, preventing any worker from stealing the manager position from us. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-01-12workqueue: Convert the idle_timer to a timer + work_structValentin Schneider
A later patch will require a sleepable context in the idle worker timeout function. Converting worker_pool.idle_timer to a delayed_work gives us just that, however this would imply turning all idle_timer expiries into scheduler events (waking up a worker to handle the dwork). Instead, implement a "custom dwork" where the timer callback does some extra checks before queuing the associated work. No change in functionality intended. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-01-12workqueue: Factorize unbind/rebind_workers() logicValentin Schneider
Later patches will reuse this code, move it into reusable functions. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-01-12workqueue: Protects wq_unbound_cpumask with wq_pool_attach_mutexLai Jiangshan
When unbind_workers() reads wq_unbound_cpumask to set the affinity of freshly-unbound kworkers, it only holds wq_pool_attach_mutex. This isn't sufficient as wq_unbound_cpumask is only protected by wq_pool_mutex. Make wq_unbound_cpumask protected with wq_pool_attach_mutex and also remove the need of temporary saved_cpumask. Fixes: 10a5a651e3af ("workqueue: Restrict kworker in the offline CPU pool running on housekeeping CPUs") Reported-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-01-06workqueue: Make show_pwq() use run-length encodingPaul E. McKenney
The show_pwq() function dumps out a pool_workqueue structure's activity, including the pending work-queue handlers: Showing busy workqueues and worker pools: workqueue events: flags=0x0 pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11 in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func pending: test_work_func, test_work_func, test_work_func1, test_work_func1, test_work_func1, test_work_func1, test_work_func1 When large systems are facing certain types of hang conditions, it is not unusual for this "pending" list to contain runs of hundreds of identical function names. This "wall of text" is difficult to read, and worse yet, it can be interleaved with other output such as stack traces. Therefore, make show_pwq() use run-length encoding so that the above printout instead looks like this: Showing busy workqueues and worker pools: workqueue events: flags=0x0 pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11 in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func pending: 2*test_work_func, 5*test_work_func1 When no comma would be printed, including the WORK_STRUCT_LINKED case, a new run is started unconditionally. This output is more readable, places less stress on the hardware, firmware, and software on the console-log path, and reduces interference with other output. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Dave Jones <davej@codemonkey.org.uk> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Tejun Heo <tj@kernel.org>