summaryrefslogtreecommitdiff
path: root/kernel/workqueue.c
AgeCommit message (Collapse)Author
2012-09-10workqueue: restore POOL_MANAGING_WORKERSLai Jiangshan
This patch restores POOL_MANAGING_WORKERS which was replaced by pool->manager_mutex by 6037315269 "workqueue: use mutex for global_cwq manager exclusion". There's a subtle idle worker depletion bug across CPU hotplug events and we need to distinguish an actual manager and CPU hotplug preventing management. POOL_MANAGING_WORKERS will be used for the former and manager_mutex the later. This patch just lays POOL_MANAGING_WORKERS on top of the existing manager_mutex and doesn't introduce any synchronization changes. The next patch will update it. Note that this patch fixes a non-critical anomaly where too_many_workers() may return %true spuriously while CPU hotplug is in progress. While the issue could schedule idle timer spuriously, it didn't trigger any actual misbehavior. tj: Rewrote patch description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-09-05workqueue: fix possible deadlock in idle worker rebindingTejun Heo
Currently, rebind_workers() and idle_worker_rebind() are two-way interlocked. rebind_workers() waits for idle workers to finish rebinding and rebound idle workers wait for rebind_workers() to finish rebinding busy workers before proceeding. Unfortunately, this isn't enough. The second wait from idle workers is implemented as follows. wait_event(gcwq->rebind_hold, !(worker->flags & WORKER_REBIND)); rebind_workers() clears WORKER_REBIND, wakes up the idle workers and then returns. If CPU hotplug cycle happens again before one of the idle workers finishes the above wait_event(), rebind_workers() will repeat the first part of the handshake - set WORKER_REBIND again and wait for the idle worker to finish rebinding - and this leads to deadlock because the idle worker would be waiting for WORKER_REBIND to clear. This is fixed by adding another interlocking step at the end - rebind_workers() now waits for all the idle workers to finish the above WORKER_REBIND wait before returning. This ensures that all rebinding steps are complete on all idle workers before the next hotplug cycle can happen. This problem was diagnosed by Lai Jiangshan who also posted a patch to fix the issue, upon which this patch is based. This is the minimal fix and further patches are scheduled for the next merge window to simplify the CPU hotplug path. Signed-off-by: Tejun Heo <tj@kernel.org> Original-patch-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <1346516916-1991-3-git-send-email-laijs@cn.fujitsu.com>
2012-09-05workqueue: move WORKER_REBIND clearing in rebind_workers() to the end of the ↵Tejun Heo
function This doesn't make any functional difference and is purely to help the next patch to be simpler. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
2012-09-04workqueue: UNBOUND -> REBIND morphing in rebind_workers() should be atomicLai Jiangshan
The compiler may compile the following code into TWO write/modify instructions. worker->flags &= ~WORKER_UNBOUND; worker->flags |= WORKER_REBIND; so the other CPU may temporarily see worker->flags which doesn't have either WORKER_UNBOUND or WORKER_REBIND set and perform local wakeup prematurely. Fix it by using single explicit assignment via ACCESS_ONCE(). Because idle workers have another WORKER_NOT_RUNNING flag, this bug doesn't exist for them; however, update it to use the same pattern for consistency. tj: Applied the change to idle workers too and updated comments and patch description a bit. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org
2012-08-21workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()Tejun Heo
cancel_delayed_work() can't be called from IRQ handlers due to its use of del_timer_sync() and can't cancel work items which are already transferred from timer to worklist. Also, unlike other flush and cancel functions, a canceled delayed_work would still point to the last associated cpu_workqueue. If the workqueue is destroyed afterwards and the work item is re-used on a different workqueue, the queueing code can oops trying to dereference already freed cpu_workqueue. This patch reimplements cancel_delayed_work() using try_to_grab_pending() and set_work_cpu_and_clear_pending(). This allows the function to be called from IRQ handlers and makes its behavior consistent with other flush / cancel functions. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org>
2012-08-21workqueue: use irqsafe timer for delayed_workTejun Heo
Up to now, for delayed_works, try_to_grab_pending() couldn't be used from IRQ handlers because IRQs may happen while delayed_work_timer_fn() is in progress leading to indefinite -EAGAIN. This patch makes delayed_work use the new TIMER_IRQSAFE flag for delayed_work->timer. This makes try_to_grab_pending() and thus mod_delayed_work_on() safe to call from IRQ handlers. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-20workqueue: gut system_nrt[_freezable]_wq()Tejun Heo
Now that all workqueues are non-reentrant, system[_freezable]_wq() are equivalent to system_nrt[_freezable]_wq(). Replace the latter with wrappers around system[_freezable]_wq(). The wrapping goes through inline functions so that __deprecated can be added easily. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-20workqueue: gut flush[_delayed]_work_sync()Tejun Heo
Now that all workqueues are non-reentrant, flush[_delayed]_work_sync() are equivalent to flush[_delayed]_work(). Drop the separate implementation and make them thin wrappers around flush[_delayed]_work(). * start_flush_work() no longer takes @wait_executing as the only left user - flush_work() - always sets it to %true. * __cancel_work_timer() uses flush_work() instead of wait_on_work(). Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-20workqueue: make all workqueues non-reentrantTejun Heo
By default, each per-cpu part of a bound workqueue operates separately and a work item may be executing concurrently on different CPUs. The behavior avoids some cross-cpu traffic but leads to subtle weirdities and not-so-subtle contortions in the API. * There's no sane usefulness in allowing a single work item to be executed concurrently on multiple CPUs. People just get the behavior unintentionally and get surprised after learning about it. Most either explicitly synchronize or use non-reentrant/ordered workqueue but this is error-prone. * flush_work() can't wait for multiple instances of the same work item on different CPUs. If a work item is executing on cpu0 and then queued on cpu1, flush_work() can only wait for the one on cpu1. Unfortunately, work items can easily cross CPU boundaries unintentionally when the queueing thread gets migrated. This means that if multiple queuers compete, flush_work() can't even guarantee that the instance queued right before it is finished before returning. * flush_work_sync() was added to work around some of the deficiencies of flush_work(). In addition to the usual flushing, it ensures that all currently executing instances are finished before returning. This operation is expensive as it has to walk all CPUs and at the same time fails to address competing queuer case. Incorrectly using flush_work() when flush_work_sync() is necessary is an easy error to make and can lead to bugs which are difficult to reproduce. * Similar problems exist for flush_delayed_work[_sync](). Other than the cross-cpu access concern, there's no benefit in allowing parallel execution and it's plain silly to have this level of contortion for workqueue which is widely used from core code to extremely obscure drivers. This patch makes all workqueues non-reentrant. If a work item is executing on a different CPU when queueing is requested, it is always queued to that CPU. This guarantees that any given work item can be executing on one CPU at maximum and if a work item is queued and executing, both are on the same CPU. The only behavior change which may affect workqueue users negatively is that non-reentrancy overrides the affinity specified by queue_work_on(). On a reentrant workqueue, the affinity specified by queue_work_on() is always followed. Now, if the work item is executing on one of the CPUs, the work item will be queued there regardless of the requested affinity. I've reviewed all workqueue users which request explicit affinity, and, fortunately, none seems to be crazy enough to exploit parallel execution of the same work item. This adds an additional busy_hash lookup if the work item was previously queued on a different CPU. This shouldn't be noticeable under any sane workload. Work item queueing isn't a very high-frequency operation and they don't jump across CPUs all the time. In a micro benchmark to exaggerate this difference - measuring the time it takes for two work items to repeatedly jump between two CPUs a number (10M) of times with busy_hash table densely populated, the difference was around 3%. While the overhead is measureable, it is only visible in pathological cases and the difference isn't huge. This change brings much needed sanity to workqueue and makes its behavior consistent with timer. I think this is the right tradeoff to make. This enables significant simplification of workqueue API. Simplification patches will follow. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-20workqueue: fix checkpatch issuesValentin Ilie
Fixed some checkpatch warnings. tj: adapted to wq/for-3.7 and massaged pr_xxx() format strings a bit. Signed-off-by: Valentin Ilie <valentin.ilie@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <1345326762-21747-1-git-send-email-valentin.ilie@gmail.com>
2012-08-16workqueue: use system_highpri_wq for unbind_workJoonsoo Kim
To speed cpu down processing up, use system_highpri_wq. As scheduling priority of workers on it is higher than system_wq and it is not contended by other normal works on this cpu, work on it is processed faster than system_wq. tj: CPU up/downs care quite a bit about latency these days. This shouldn't hurt anything and makes sense. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-16workqueue: use system_highpri_wq for highpri workers in rebind_workers()Joonsoo Kim
In rebind_workers(), we do inserting a work to rebind to cpu for busy workers. Currently, in this case, we use only system_wq. This makes a possible error situation as there is mismatch between cwq->pool and worker->pool. To prevent this, we should use system_highpri_wq for highpri worker to match theses. This implements it. tj: Rephrased comment a bit. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-16workqueue: introduce system_highpri_wqJoonsoo Kim
Commit 3270476a6c0ce322354df8679652f060d66526dc ('workqueue: reimplement WQ_HIGHPRI using a separate worker_pool') introduce separate worker pool for HIGHPRI. When we handle busyworkers for gcwq, it can be normal worker or highpri worker. But, we don't consider this difference in rebind_workers(), we use just system_wq for highpri worker. It makes mismatch between cwq->pool and worker->pool. It doesn't make error in current implementation, but possible in the future. Now, we introduce system_highpri_wq to use proper cwq for highpri workers in rebind_workers(). Following patch fix this issue properly. tj: Even apart from rebinding, having system_highpri_wq generally makes sense. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-16workqueue: change value of lcpu in __queue_delayed_work_on()Joonsoo Kim
We assign cpu id into work struct's data field in __queue_delayed_work_on(). In current implementation, when work is come in first time, current running cpu id is assigned. If we do __queue_delayed_work_on() with CPU A on CPU B, __queue_work() invoked in delayed_work_timer_fn() go into the following sub-optimal path in case of WQ_NON_REENTRANT. gcwq = get_gcwq(cpu); if (wq->flags & WQ_NON_REENTRANT && (last_gcwq = get_work_gcwq(work)) && last_gcwq != gcwq) { Change lcpu to @cpu and rechange lcpu to local cpu if lcpu is WORK_CPU_UNBOUND. It is sufficient to prevent to go into sub-optimal path. tj: Slightly rephrased the comment. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-16workqueue: correct req_cpu in trace_workqueue_queue_work()Joonsoo Kim
When we do tracing workqueue_queue_work(), it records requested cpu. But, if !(@wq->flag & WQ_UNBOUND) and @cpu is WORK_CPU_UNBOUND, requested cpu is changed as local cpu. In case of @wq->flag & WQ_UNBOUND, above change is not occured, therefore it is reasonable to correct it. Use temporary local variable for storing requested cpu. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-16workqueue: use enum value to set array size of pools in gcwqJoonsoo Kim
Commit 3270476a6c0ce322354df8679652f060d66526dc ('workqueue: reimplement WQ_HIGHPRI using a separate worker_pool') introduce separate worker_pool for HIGHPRI. Although there is NR_WORKER_POOLS enum value which represent size of pools, definition of worker_pool in gcwq doesn't use it. Using it makes code robust and prevent future mistakes. So change code to use this enum value. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-13workqueue: add missing wmb() in clear_work_data()Tejun Heo
Any operation which clears PENDING should be preceded by a wmb to guarantee that the next PENDING owner sees all the changes made before PENDING release. There are only two places where PENDING is cleared - set_work_cpu_and_clear_pending() and clear_work_data(). The caller of the former already does smp_wmb() but the latter doesn't have any. Move the wmb above set_work_cpu_and_clear_pending() into it and add one to clear_work_data(). There hasn't been any report related to this issue, and, given how clear_work_data() is used, it is extremely unlikely to have caused any actual problems on any architecture. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com>
2012-08-13workqueue: fix CPU binding of flush_delayed_work[_sync]()Tejun Heo
delayed_work encodes the workqueue to use and the last CPU in delayed_work->work.data while it's on timer. The target CPU is implicitly recorded as the CPU the timer is queued on and delayed_work_timer_fn() queues delayed_work->work to the CPU it is running on. Unfortunately, this leaves flush_delayed_work[_sync]() no way to find out which CPU the delayed_work was queued for when they try to re-queue after killing the timer. Currently, it chooses the local CPU flush is running on. This can unexpectedly move a delayed_work queued on a specific CPU to another CPU and lead to subtle errors. There isn't much point in trying to save several bytes in struct delayed_work, which is already close to a hundred bytes on 64bit with all debug options turned off. This patch adds delayed_work->cpu to remember the CPU it's queued for. Note that if the timer is migrated during CPU down, the work item could be queued to the downed global_cwq after this change. As a detached global_cwq behaves like an unbound one, this doesn't change much for the delayed_work. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org>
2012-08-03workqueue: implement mod_delayed_work[_on]()Tejun Heo
Workqueue was lacking a mechanism to modify the timeout of an already pending delayed_work. delayed_work users have been working around this using several methods - using an explicit timer + work item, messing directly with delayed_work->timer, and canceling before re-queueing, all of which are error-prone and/or ugly. This patch implements mod_delayed_work[_on]() which behaves similarly to mod_timer() - if the delayed_work is idle, it's queued with the given delay; otherwise, its timeout is modified to the new value. Zero @delay guarantees immediate execution. v2: Updated to reflect try_to_grab_pending() changes. Now safe to be called from bh context. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com>
2012-08-03workqueue: mark a work item being canceled as suchTejun Heo
There can be two reasons try_to_grab_pending() can fail with -EAGAIN. One is when someone else is queueing or deqeueing the work item. With the previous patches, it is guaranteed that PENDING and queued state will soon agree making it safe to busy-retry in this case. The other is if multiple __cancel_work_timer() invocations are racing one another. __cancel_work_timer() grabs PENDING and then waits for running instances of the target work item on all CPUs while holding PENDING and !queued. try_to_grab_pending() invoked from another task will keep returning -EAGAIN while the current owner is waiting. Not distinguishing the two cases is okay because __cancel_work_timer() is the only user of try_to_grab_pending() and it invokes wait_on_work() whenever grabbing fails. For the first case, busy looping should be fine but wait_on_work() doesn't cause any critical problem. For the latter case, the new contender usually waits for the same condition as the current owner, so no unnecessarily extended busy-looping happens. Combined, these make __cancel_work_timer() technically correct even without irq protection while grabbing PENDING or distinguishing the two different cases. While the current code is technically correct, not distinguishing the two cases makes it difficult to use try_to_grab_pending() for other purposes than canceling because it's impossible to tell whether it's safe to busy-retry grabbing. This patch adds a mechanism to mark a work item being canceled. try_to_grab_pending() now disables irq on success and returns -EAGAIN to indicate that grabbing failed but PENDING and queued states are gonna agree soon and it's safe to busy-loop. It returns -ENOENT if the work item is being canceled and it may stay PENDING && !queued for arbitrary amount of time. __cancel_work_timer() is modified to mark the work canceling with WORK_OFFQ_CANCELING after grabbing PENDING, thus making try_to_grab_pending() fail with -ENOENT instead of -EAGAIN. Also, it invokes wait_on_work() iff grabbing failed with -ENOENT. This isn't necessary for correctness but makes it consistent with other future users of try_to_grab_pending(). v2: try_to_grab_pending() was testing preempt_count() to ensure that the caller has disabled preemption. This triggers spuriously if !CONFIG_PREEMPT_COUNT. Use preemptible() instead. Reported by Fengguang Wu. v3: Updated so that try_to_grab_pending() disables irq on success rather than requiring preemption disabled by the caller. This makes busy-looping easier and will allow try_to_grap_pending() to be used from bh/irq contexts. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Fengguang Wu <fengguang.wu@intel.com>
2012-08-03workqueue: reorganize try_to_grab_pending() and __cancel_timer_work()Tejun Heo
* Use bool @is_dwork instead of @timer and let try_to_grab_pending() use to_delayed_work() to determine the delayed_work address. * Move timer handling from __cancel_work_timer() to try_to_grab_pending(). * Make try_to_grab_pending() use -EAGAIN instead of -1 for busy-looping and drop the ret local variable. * Add proper function comment to try_to_grab_pending(). This makes the code a bit easier to understand and will ease further changes. This patch doesn't make any functional change. v2: Use @is_dwork instead of @timer. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-03workqueue: factor out __queue_delayed_work() from queue_delayed_work_on()Tejun Heo
This is to prepare for mod_delayed_work[_on]() and doesn't cause any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-03workqueue: introduce WORK_OFFQ_FLAG_*Tejun Heo
Low WORK_STRUCT_FLAG_BITS bits of work_struct->data contain WORK_STRUCT_FLAG_* and flush color. If the work item is queued, the rest point to the cpu_workqueue with WORK_STRUCT_CWQ set; otherwise, WORK_STRUCT_CWQ is clear and the bits contain the last CPU number - either a real CPU number or one of WORK_CPU_*. Scheduled addition of mod_delayed_work[_on]() requires an additional flag, which is used only while a work item is off queue. There are more than enough bits to represent off-queue CPU number on both 32 and 64bits. This patch introduces WORK_OFFQ_FLAG_* which occupy the lower part of the @work->data high bits while off queue. This patch doesn't define any actual OFFQ flag yet. Off-queue CPU number is now shifted by WORK_OFFQ_CPU_SHIFT, which adds the number of bits used by OFFQ flags to WORK_STRUCT_FLAG_SHIFT, to make room for OFFQ flags. To avoid shift width warning with large WORK_OFFQ_FLAG_BITS, ulong cast is added to WORK_STRUCT_NO_CPU and, just in case, BUILD_BUG_ON() to check that there are enough bits to accomodate off-queue CPU number is added. This patch doesn't make any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-03workqueue: move try_to_grab_pending() upwardsTejun Heo
try_to_grab_pending() will be used by to-be-implemented mod_delayed_work[_on](). Move try_to_grab_pending() and related functions above queueing functions. This patch only moves functions around. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-03workqueue: fix zero @delay handling of queue_delayed_work_on()Tejun Heo
If @delay is zero and the dealyed_work is idle, queue_delayed_work() queues it for immediate execution; however, queue_delayed_work_on() lacks this logic and always goes through timer regardless of @delay. This patch moves 0 @delay handling logic from queue_delayed_work() to queue_delayed_work_on() so that both functions behave the same. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-03workqueue: unify local CPU queueing handlingTejun Heo
Queueing functions have been using different methods to determine the local CPU. * queue_work() superflously uses get/put_cpu() to acquire and hold the local CPU across queue_work_on(). * delayed_work_timer_fn() uses smp_processor_id(). * queue_delayed_work() calls queue_delayed_work_on() with -1 @cpu which is interpreted as the local CPU. * flush_delayed_work[_sync]() were using raw_smp_processor_id(). * __queue_work() interprets %WORK_CPU_UNBOUND as local CPU if the target workqueue is bound one but nobody uses this. This patch converts all functions to uniformly use %WORK_CPU_UNBOUND to indicate local CPU and use the local binding feature of __queue_work(). unlikely() is dropped from %WORK_CPU_UNBOUND handling in __queue_work(). Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-03workqueue: set delayed_work->timer function on initializationTejun Heo
delayed_work->timer.function is currently initialized during queue_delayed_work_on(). Export delayed_work_timer_fn() and set delayed_work timer function during delayed_work initialization together with other fields. This ensures the timer function is always valid on an initialized delayed_work. This is to help mod_delayed_work() implementation. To detect delayed_work users which diddle with the internal timer, trigger WARN if timer function doesn't match on queue. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-03workqueue: disable irq while manipulating PENDINGTejun Heo
Queueing operations use WORK_STRUCT_PENDING_BIT to synchronize access to the target work item. They first try to claim the bit and proceed with queueing only after that succeeds and there's a window between PENDING being set and the actual queueing where the task can be interrupted or preempted. There's also a similar window in process_one_work() when clearing PENDING. A work item is dequeued, gcwq->lock is released and then PENDING is cleared and the worker might get interrupted or preempted between releasing gcwq->lock and clearing PENDING. cancel[_delayed]_work_sync() tries to claim or steal PENDING. The function assumes that a work item with PENDING is either queued or in the process of being [de]queued. In the latter case, it busy-loops until either the work item loses PENDING or is queued. If canceling coincides with the above described interrupts or preemptions, the canceling task will busy-loop while the queueing or executing task is preempted. This patch keeps irq disabled across claiming PENDING and actual queueing and moves PENDING clearing in process_one_work() inside gcwq->lock so that busy looping from PENDING && !queued doesn't wait for interrupted/preempted tasks. Note that, in process_one_work(), setting last CPU and clearing PENDING got merged into single operation. This removes possible long busy-loops and will allow using try_to_grab_pending() from bh and irq contexts. v2: __queue_work() was testing preempt_count() to ensure that the caller has disabled preemption. This triggers spuriously if !CONFIG_PREEMPT_COUNT. Use preemptible() instead. Reported by Fengguang Wu. v3: Disable irq instead of preemption. IRQ will be disabled while grabbing gcwq->lock later anyway and this allows using try_to_grab_pending() from bh and irq contexts. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Fengguang Wu <fengguang.wu@intel.com>
2012-08-03workqueue: add missing smp_wmb() in process_one_work()Tejun Heo
WORK_STRUCT_PENDING is used to claim ownership of a work item and process_one_work() releases it before starting execution. When someone else grabs PENDING, all pre-release updates to the work item should be visible and all updates made by the new owner should happen afterwards. Grabbing PENDING uses test_and_set_bit() and thus has a full barrier; however, clearing doesn't have a matching wmb. Given the preceding spin_unlock and use of clear_bit, I don't believe this can be a problem on an actual machine and there hasn't been any related report but it still is theretically possible for clear_pending to permeate upwards and happen before work->entry update. Add an explicit smp_wmb() before work_clear_pending(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: stable@vger.kernel.org
2012-08-03workqueue: make queueing functions return boolTejun Heo
All queueing functions return 1 on success, 0 if the work item was already pending. Update them to return bool instead. This signifies better that they don't return 0 / -errno. This is cleanup and doesn't cause any functional difference. While at it, fix comment opening for schedule_work_on(). Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-03workqueue: reorder queueing functions so that _on() variants are on topTejun Heo
Currently, queue/schedule[_delayed]_work_on() are located below the counterpart without the _on postifx even though the latter is usually implemented using the former. Swap them. This is cleanup and doesn't cause any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-07-22workqueue: fix spurious CPU locality WARN from process_one_work()Tejun Heo
25511a4776 "workqueue: reimplement CPU online rebinding to handle idle workers" added CPU locality sanity check in process_one_work(). It triggers if a worker is executing on a different CPU without UNBOUND or REBIND set. This works for all normal workers but rescuers can trigger this spuriously when they're serving the unbound or a disassociated global_cwq - rescuers don't have either flag set and thus its gcwq->cpu can be a different value including %WORK_CPU_UNBOUND. Fix it by additionally testing %GCWQ_DISASSOCIATED. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> LKML-Refence: <20120721213656.GA7783@linux.vnet.ibm.com>
2012-07-17workqueue: simplify CPU hotplug codeTejun Heo
With trustee gone, CPU hotplug code can be simplified. * gcwq_claim/release_management() now grab and release gcwq lock too respectively and gained _and_lock and _and_unlock postfixes. * All CPU hotplug logic was implemented in workqueue_cpu_callback() which was called by workqueue_cpu_up/down_callback() for the correct priority. This was because up and down paths shared a lot of logic, which is no longer true. Remove workqueue_cpu_callback() and move all hotplug logic into the two actual callbacks. This patch doesn't make any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
2012-07-17workqueue: remove CPU offline trusteeTejun Heo
With the previous changes, a disassociated global_cwq now can run as an unbound one on its own - it can create workers as necessary to drain remaining works after the CPU has been brought down and manage the number of workers using the usual idle timer mechanism making trustee completely redundant except for the actual unbinding operation. This patch removes the trustee and let a disassociated global_cwq manage itself. Unbinding is moved to a work item (for CPU affinity) which is scheduled and flushed from CPU_DONW_PREPARE. This patch moves nr_running clearing outside gcwq and manager locks to simplify the code. As nr_running is unused at the point, this is safe. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
2012-07-17workqueue: don't butcher idle workers on an offline CPUTejun Heo
Currently, during CPU offlining, after all pending work items are drained, the trustee butchers all workers. Also, on CPU onlining failure, workqueue_cpu_callback() ensures that the first idle worker is destroyed. Combined, these guarantee that an offline CPU doesn't have any worker for it once all the lingering work items are finished. This guarantee isn't really necessary and makes CPU on/offlining more expensive than needs to be, especially for platforms which use CPU hotplug for powersaving. This patch lets offline CPUs removes idle worker butchering from the trustee and let a CPU which failed onlining keep the created first worker. The first worker is created if the CPU doesn't have any during CPU_DOWN_PREPARE and started right away. If onlining succeeds, the rebind_workers() call in CPU_ONLINE will rebind it like any other workers. If onlining fails, the worker is left alone till the next try. This makes CPU hotplugs cheaper by allowing global_cwqs to keep workers across them and simplifies code. Note that trustee doesn't re-arm idle timer when it's done and thus the disassociated global_cwq will keep all workers until it comes back online. This will be improved by further patches. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
2012-07-17workqueue: reimplement CPU online rebinding to handle idle workersTejun Heo
Currently, if there are left workers when a CPU is being brough back online, the trustee kills all idle workers and scheduled rebind_work so that they re-bind to the CPU after the currently executing work is finished. This works for busy workers because concurrency management doesn't try to wake up them from scheduler callbacks, which require the target task to be on the local run queue. The busy worker bumps concurrency counter appropriately as it clears WORKER_UNBOUND from the rebind work item and it's bound to the CPU before returning to the idle state. To reduce CPU on/offlining overhead (as many embedded systems use it for powersaving) and simplify the code path, workqueue is planned to be modified to retain idle workers across CPU on/offlining. This patch reimplements CPU online rebinding such that it can also handle idle workers. As noted earlier, due to the local wakeup requirement, rebinding idle workers is tricky. All idle workers must be re-bound before scheduler callbacks are enabled. This is achieved by interlocking idle re-binding. Idle workers are requested to re-bind and then hold until all idle re-binding is complete so that no bound worker starts executing work item. Only after all idle workers are re-bound and parked, CPU_ONLINE proceeds to release them and queue rebind work item to busy workers thus guaranteeing scheduler callbacks aren't invoked until all idle workers are ready. worker_rebind_fn() is renamed to busy_worker_rebind_fn() and idle_worker_rebind() for idle workers is added. Rebinding logic is moved to rebind_workers() and now called from CPU_ONLINE after flushing trustee. While at it, add CPU sanity check in worker_thread(). Note that now a worker may become idle or the manager between trustee release and rebinding during CPU_ONLINE. As the previous patch updated create_worker() so that it can be used by regular manager while unbound and this patch implements idle re-binding, this is safe. This prepares for removal of trustee and keeping idle workers across CPU hotplugs. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
2012-07-17workqueue: drop @bind from create_worker()Tejun Heo
Currently, create_worker()'s callers are responsible for deciding whether the newly created worker should be bound to the associated CPU and create_worker() sets WORKER_UNBOUND only for the workers for the unbound global_cwq. Creation during normal operation is always via maybe_create_worker() and @bind is true. For workers created during hotplug, @bind is false. Normal operation path is planned to be used even while the CPU is going through hotplug operations or offline and this static decision won't work. Drop @bind from create_worker() and decide whether to bind by looking at GCWQ_DISASSOCIATED. create_worker() will also set WORKER_UNBOUND autmatically if disassociated. To avoid flipping GCWQ_DISASSOCIATED while create_worker() is in progress, the flag is now allowed to be changed only while holding all manager_mutexes on the global_cwq. This requires that GCWQ_DISASSOCIATED is not cleared behind trustee's back. CPU_ONLINE no longer clears DISASSOCIATED before flushing trustee, which clears DISASSOCIATED before rebinding remaining workers if asked to release. For cases where trustee isn't around, CPU_ONLINE clears DISASSOCIATED after flushing trustee. Also, now, first_idle has UNBOUND set on creation which is explicitly cleared by CPU_ONLINE while binding it. These convolutions will soon be removed by further simplification of CPU hotplug path. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
2012-07-17workqueue: use mutex for global_cwq manager exclusionTejun Heo
POOL_MANAGING_WORKERS is used to ensure that at most one worker takes the manager role at any given time on a given global_cwq. Trustee later hitched on it to assume manager adding blocking wait for the bit. As trustee already needed a custom wait mechanism, waiting for MANAGING_WORKERS was rolled into the same mechanism. Trustee is scheduled to be removed. This patch separates out MANAGING_WORKERS wait into per-pool mutex. Workers use mutex_trylock() to test for manager role and trustee uses mutex_lock() to claim manager roles. gcwq_claim/release_management() helpers are added to grab and release manager roles of all pools on a global_cwq. gcwq_claim_management() always grabs pool manager mutexes in ascending pool index order and uses pool index as lockdep subclass. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
2012-07-17workqueue: ROGUE workers are UNBOUND workersTejun Heo
Currently, WORKER_UNBOUND is used to mark workers for the unbound global_cwq and WORKER_ROGUE is used to mark workers for disassociated per-cpu global_cwqs. Both are used to make the marked worker skip concurrency management and the only place they make any difference is in worker_enter_idle() where WORKER_ROGUE is used to skip scheduling idle timer, which can easily be replaced with trustee state testing. This patch replaces WORKER_ROGUE with WORKER_UNBOUND and drops WORKER_ROGUE. This is to prepare for removing trustee and handling disassociated global_cwqs as unbound. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
2012-07-17workqueue: drop CPU_DYING notifier operationTejun Heo
Workqueue used CPU_DYING notification to mark GCWQ_DISASSOCIATED. This was necessary because workqueue's CPU_DOWN_PREPARE happened before other DOWN_PREPARE notifiers and workqueue needed to stay associated across the rest of DOWN_PREPARE. After the previous patch, workqueue's DOWN_PREPARE happens after others and can set GCWQ_DISASSOCIATED directly. Drop CPU_DYING and let the trustee set GCWQ_DISASSOCIATED after disabling concurrency management. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
2012-07-17workqueue: perform cpu down operations from low priority cpu_notifier()Tejun Heo
Currently, all workqueue cpu hotplug operations run off CPU_PRI_WORKQUEUE which is higher than normal notifiers. This is to ensure that workqueue is up and running while bringing up a CPU before other notifiers try to use workqueue on the CPU. Per-cpu workqueues are supposed to remain working and bound to the CPU for normal CPU_DOWN_PREPARE notifiers. This holds mostly true even with workqueue offlining running with higher priority because workqueue CPU_DOWN_PREPARE only creates a bound trustee thread which runs the per-cpu workqueue without concurrency management without explicitly detaching the existing workers. However, if the trustee needs to create new workers, it creates unbound workers which may wander off to other CPUs while CPU_DOWN_PREPARE notifiers are in progress. Furthermore, if the CPU down is cancelled, the per-CPU workqueue may end up with workers which aren't bound to the CPU. While reliably reproducible with a convoluted artificial test-case involving scheduling and flushing CPU burning work items from CPU down notifiers, this isn't very likely to happen in the wild, and, even when it happens, the effects are likely to be hidden by the following successful CPU down. Fix it by using different priorities for up and down notifiers - high priority for up operations and low priority for down operations. Workqueue cpu hotplug operations will soon go through further cleanup. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
2012-07-13workqueue: reimplement WQ_HIGHPRI using a separate worker_poolTejun Heo
WQ_HIGHPRI was implemented by queueing highpri work items at the head of the global worklist. Other than queueing at the head, they weren't handled differently; unfortunately, this could lead to execution latency of a few seconds on heavily loaded systems. Now that workqueue code has been updated to deal with multiple worker_pools per global_cwq, this patch reimplements WQ_HIGHPRI using a separate worker_pool. NR_WORKER_POOLS is bumped to two and gcwq->pools[0] is used for normal pri work items and ->pools[1] for highpri. Highpri workers get -20 nice level and has 'H' suffix in their names. Note that this change increases the number of kworkers per cpu. POOL_HIGHPRI_PENDING, pool_determine_ins_pos() and highpri chain wakeup code in process_one_work() are no longer used and removed. This allows proper prioritization of highpri work items and removes high execution latency of highpri work items. v2: nr_running indexing bug in get_pool_nr_running() fixed. v3: Refreshed for the get_pool_nr_running() update in the previous patch. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Josh Hunt <joshhunt00@gmail.com> LKML-Reference: <CAKA=qzaHqwZ8eqpLNFjxnO2fX-tgAOjmpvxgBFjv6dJeQaOW1w@mail.gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fengguang Wu <fengguang.wu@intel.com>
2012-07-13workqueue: introduce NR_WORKER_POOLS and for_each_worker_pool()Tejun Heo
Introduce NR_WORKER_POOLS and for_each_worker_pool() and convert code paths which need to manipulate all pools in a gcwq to use them. NR_WORKER_POOLS is currently one and for_each_worker_pool() iterates over only @gcwq->pool. Note that nr_running is per-pool property and converted to an array with NR_WORKER_POOLS elements and renamed to pool_nr_running. Note that get_pool_nr_running() currently assumes 0 index. The next patch will make use of non-zero index. The changes in this patch are mechanical and don't caues any functional difference. This is to prepare for multiple pools per gcwq. v2: nr_running indexing bug in get_pool_nr_running() fixed. v3: Pointer to array is stupid. Don't use it in get_pool_nr_running() as suggested by Linus. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-12workqueue: separate out worker_pool flagsTejun Heo
GCWQ_MANAGE_WORKERS, GCWQ_MANAGING_WORKERS and GCWQ_HIGHPRI_PENDING are per-pool properties. Add worker_pool->flags and make the above three flags per-pool flags. The changes in this patch are mechanical and don't caues any functional difference. This is to prepare for multiple pools per gcwq. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-07-12workqueue: use @pool instead of @gcwq or @cpu where applicableTejun Heo
Modify all functions which deal with per-pool properties to pass around @pool instead of @gcwq or @cpu. The changes in this patch are mechanical and don't caues any functional difference. This is to prepare for multiple pools per gcwq. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-07-12workqueue: factor out worker_pool from global_cwqTejun Heo
Move worklist and all worker management fields from global_cwq into the new struct worker_pool. worker_pool points back to the containing gcwq. worker and cpu_workqueue_struct are updated to point to worker_pool instead of gcwq too. This change is mechanical and doesn't introduce any functional difference other than rearranging of fields and an added level of indirection in some places. This is to prepare for multiple pools per gcwq. v2: Comment typo fixes as suggested by Namhyung. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org>
2012-07-12workqueue: don't use WQ_HIGHPRI for unbound workqueuesTejun Heo
Unbound wqs aren't concurrency-managed and try to execute work items as soon as possible. This is currently achieved by implicitly setting %WQ_HIGHPRI on all unbound workqueues; however, WQ_HIGHPRI implementation is about to be restructured and this usage won't be valid anymore. Add an explicit chain-wakeup path for unbound workqueues in process_one_work() instead of piggy backing on %WQ_HIGHPRI. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-05-15lockdep: fix oops in processing workqueuePeter Zijlstra
Under memory load, on x86_64, with lockdep enabled, the workqueue's process_one_work() has been seen to oops in __lock_acquire(), barfing on a 0xffffffff00000000 pointer in the lockdep_map's class_cache[]. Because it's permissible to free a work_struct from its callout function, the map used is an onstack copy of the map given in the work_struct: and that copy is made without any locking. Surprisingly, gcc (4.5.1 in Hugh's case) uses "rep movsl" rather than "rep movsq" for that structure copy: which might race with a workqueue user's wait_on_work() doing lock_map_acquire() on the source of the copy, putting a pointer into the class_cache[], but only in time for the top half of that pointer to be copied to the destination map. Boom when process_one_work() subsequently does lock_map_acquire() on its onstack copy of the lockdep_map. Fix this, and a similar instance in call_timer_fn(), with a lockdep_copy_map() function which additionally NULLs the class_cache[]. Note: this oops was actually seen on 3.4-next, where flush_work() newly does the racing lock_map_acquire(); but Tejun points out that 3.4 and earlier are already vulnerable to the same through wait_on_work(). * Patch orginally from Peter. Hugh modified it a bit and wrote the description. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Reported-by: Hugh Dickins <hughd@google.com> LKML-Reference: <alpine.LSU.2.00.1205070951170.1544@eggly.anvils> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-05-14workqueue: skip nr_running sanity check in worker_enter_idle() if trustee is ↵Tejun Heo
active worker_enter_idle() has WARN_ON_ONCE() which triggers if nr_running isn't zero when every worker is idle. This can trigger spuriously while a cpu is going down due to the way trustee sets %WORKER_ROGUE and zaps nr_running. It first sets %WORKER_ROGUE on all workers without updating nr_running, releases gcwq->lock, schedules, regrabs gcwq->lock and then zaps nr_running. If the last running worker enters idle inbetween, it would see stale nr_running which hasn't been zapped yet and trigger the WARN_ON_ONCE(). Fix it by performing the sanity check iff the trustee is idle. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: stable@vger.kernel.org
2012-04-23workqueue: Catch more locking problems with flush_work()Stephen Boyd
If a workqueue is flushed with flush_work() lockdep checking can be circumvented. For example: static DEFINE_MUTEX(mutex); static void my_work(struct work_struct *w) { mutex_lock(&mutex); mutex_unlock(&mutex); } static DECLARE_WORK(work, my_work); static int __init start_test_module(void) { schedule_work(&work); return 0; } module_init(start_test_module); static void __exit stop_test_module(void) { mutex_lock(&mutex); flush_work(&work); mutex_unlock(&mutex); } module_exit(stop_test_module); would not always print a warning when flush_work() was called. In this trivial example nothing could go wrong since we are guaranteed module_init() and module_exit() don't run concurrently, but if the work item is schedule asynchronously we could have a scenario where the work item is running just at the time flush_work() is called resulting in a classic ABBA locking problem. Add a lockdep hint by acquiring and releasing the work item lockdep_map in flush_work() so that we always catch this potential deadlock scenario. Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Reviewed-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>