summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2011-08-14sched: Throttle entities exceeding their allowed bandwidthPaul Turner
With the machinery in place to throttle and unthrottle entities, as well as handle their participation (or lack there of) we can now enable throttling. There are 2 points that we must check whether it's time to set throttled state: put_prev_entity() and enqueue_entity(). - put_prev_entity() is the typical throttle path, we reach it by exceeding our allocated run-time within update_curr()->account_cfs_rq_runtime() and going through a reschedule. - enqueue_entity() covers the case of a wake-up into an already throttled group. In this case we know the group cannot be on_rq and can throttle immediately. Checks are added at time of put_prev_entity() and enqueue_entity() Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184758.091415417@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Migrate throttled tasks on HOTPLUGPaul Turner
Throttled tasks are invisisble to cpu-offline since they are not eligible for selection by pick_next_task(). The regular 'escape' path for a thread that is blocked at offline is via ttwu->select_task_rq, however this will not handle a throttled group since there are no individual thread wakeups on an unthrottle. Resolve this by unthrottling offline cpus so that threads can be migrated. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.989000590@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Prevent buddy interactions with throttled entitiesPaul Turner
Buddies allow us to select "on-rq" entities without actually selecting them from a cfs_rq's rb_tree. As a result we must ensure that throttled entities are not falsely nominated as buddies. The fact that entities are dequeued within throttle_entity is not sufficient for clearing buddy status as the nomination may occur after throttling. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.886850167@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Prevent interactions with throttled entitiesPaul Turner
From the perspective of load-balance and shares distribution, throttled entities should be invisible. However, both of these operations work on 'active' lists and are not inherently aware of what group hierarchies may be present. In some cases this may be side-stepped (e.g. we could sideload via tg_load_down in load balance) while in others (e.g. update_shares()) it is more difficult to compute without incurring some O(n^2) costs. Instead, track hierarchicaal throttled state at time of transition. This allows us to easily identify whether an entity belongs to a throttled hierarchy and avoid incorrect interactions with it. Also, when an entity leaves a throttled hierarchy we need to advance its time averaging for shares averaging so that the elapsed throttled time is not considered as part of the cfs_rq's operation. We also use this information to prevent buddy interactions in the wakeup and yield_to() paths. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.777916795@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Allow for positional tg_tree walksPaul Turner
Extend walk_tg_tree to accept a positional argument static int walk_tg_tree_from(struct task_group *from, tg_visitor down, tg_visitor up, void *data) Existing semantics are preserved, caller must hold rcu_lock() or sufficient analogue. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.677889157@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Add support for unthrottling group entitiesPaul Turner
At the start of each period we refresh the global bandwidth pool. At this time we must also unthrottle any cfs_rq entities who are now within bandwidth once more (as quota permits). Unthrottled entities have their corresponding cfs_rq->throttled flag cleared and their entities re-enqueued. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.574628950@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Add support for throttling group entitiesPaul Turner
Now that consumption is tracked (via update_curr()) we add support to throttle group entities (and their corresponding cfs_rqs) in the case where this is no run-time remaining. Throttled entities are dequeued to prevent scheduling, additionally we mark them as throttled (using cfs_rq->throttled) to prevent them from becoming re-enqueued until they are unthrottled. A list of a task_group's throttled entities are maintained on the cfs_bandwidth structure. Note: While the machinery for throttling is added in this patch the act of throttling an entity exceeding its bandwidth is deferred until later within the series. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.480608533@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Expire invalid runtimePaul Turner
Since quota is managed using a global state but consumed on a per-cpu basis we need to ensure that our per-cpu state is appropriately synchronized. Most importantly, runtime that is state (from a previous period) should not be locally consumable. We take advantage of existing sched_clock synchronization about the jiffy to efficiently detect whether we have (globally) crossed a quota boundary above. One catch is that the direction of spread on sched_clock is undefined, specifically, we don't know whether our local clock is behind or ahead of the one responsible for the current expiration time. Fortunately we can differentiate these by considering whether the global deadline has advanced. If it has not, then we assume our clock to be "fast" and advance our local expiration; otherwise, we know the deadline has truly passed and we expire our local runtime. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.379275352@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Add a timer to handle CFS bandwidth refreshPaul Turner
This patch adds a per-task_group timer which handles the refresh of the global CFS bandwidth pool. Since the RT pool is using a similar timer there's some small refactoring to share this support. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.277271273@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Accumulate per-cfs_rq cpu usage and charge against bandwidthPaul Turner
Account bandwidth usage on the cfs_rq level versus the task_groups to which they belong. Whether we are tracking bandwidth on a given cfs_rq is maintained under cfs_rq->runtime_enabled. cfs_rq's which belong to a bandwidth constrained task_group have their runtime accounted via the update_curr() path, which withdraws bandwidth from the global pool as desired. Updates involving the global pool are currently protected under cfs_bandwidth->lock, local runtime is protected by rq->lock. This patch only assigns and tracks quota, no action is taken in the case that cfs_rq->runtime_used exceeds cfs_rq->runtime_assigned. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.179386821@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Validate CFS quota hierarchiesPaul Turner
Add constraints validation for CFS bandwidth hierarchies. Validate that: max(child bandwidth) <= parent_bandwidth In a quota limited hierarchy, an unconstrained entity (e.g. bandwidth==RUNTIME_INF) inherits the bandwidth of its parent. This constraint is chosen over sum(child_bandwidth) as notion of over-commit is valuable within SCHED_OTHER. Some basic code from the RT case is re-factored for reuse. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.083774572@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Introduce primitives to account for CFS bandwidth trackingPaul Turner
In this patch we introduce the notion of CFS bandwidth, partitioned into globally unassigned bandwidth, and locally claimed bandwidth. - The global bandwidth is per task_group, it represents a pool of unclaimed bandwidth that cfs_rqs can allocate from. - The local bandwidth is tracked per-cfs_rq, this represents allotments from the global pool bandwidth assigned to a specific cpu. Bandwidth is managed via cgroupfs, adding two new interfaces to the cpu subsystem: - cpu.cfs_period_us : the bandwidth period in usecs - cpu.cfs_quota_us : the cpu bandwidth (in usecs) that this tg will be allowed to consume over period above. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184756.972636699@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Implement hierarchical task accounting for SCHED_OTHERPaul Turner
Introduce hierarchical task accounting for the group scheduling case in CFS, as well as promoting the responsibility for maintaining rq->nr_running to the scheduling classes. The primary motivation for this is that with scheduling classes supporting bandwidth throttling it is possible for entities participating in throttled sub-trees to not have root visible changes in rq->nr_running across activate and de-activate operations. This in turn leads to incorrect idle and weight-per-task load balance decisions. This also allows us to make a small fixlet to the fastpath in pick_next_task() under group scheduling. Note: this issue also exists with the existing sched_rt throttling mechanism. This patch does not address that. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184756.878333391@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched/cpupri: Remove cpupri->pri_activeYong Zhang
Since [sched/cpupri: Remove the vec->lock], member pri_active of struct cpupri is not needed any more, just remove it. Also clean stuff related to it. Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110806001004.GA2207@zhy Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched/cpupri: Fix memory barriers for vec updates to always be in orderSteven Rostedt
[ This patch actually compiles. Thanks to Mike Galbraith for pointing that out. I compiled and booted this patch with no issues. ] Re-examining the cpupri patch, I see there's a possible race because the update of the two priorities vec->counts are not protected by a memory barrier. When a RT runqueue is overloaded and wants to push an RT task to another runqueue, it scans the RT priority vectors in a loop from lowest priority to highest. When we queue or dequeue an RT task that changes a runqueue's highest priority task, we update the vectors to show that a runqueue is rated at a different priority. To do this, we first set the new priority mask, and increment the vec->count, and then set the old priority mask by decrementing the vec->count. If we are lowering the runqueue's RT priority rating, it will trigger a RT pull, and we do not care if we miss pushing to this runqueue or not. But if we raise the priority, but the priority is still lower than an RT task that is looking to be pushed, we must make sure that this runqueue is still seen by the push algorithm (the loop). Because the loop reads from lowest to highest, and the new priority is set before the old one is cleared, we will either see the new or old priority set and the vector will be checked. But! Since there's no memory barrier between the updates of the two, the old count may be decremented first before the new count is incremented. This means the loop may see the old count of zero and skip it, and also the new count of zero before it was updated. A possible runqueue that the RT task could move to could be missed. A conditional memory barrier is placed between the vec->count updates and is only called when both updates are done. The smp_wmb() has also been changed to smp_mb__before_atomic_inc/dec(), as they are not needed by archs that already synchronize atomic_inc/dec(). The smp_rmb() has been moved to be called at every iteration of the loop so that the race between seeing the two updates is visible by each iteration of the loop, as an arch is free to optimize the reading of memory of the counters in the loop. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1312547269.18583.194.camel@gandalf.stny.rr.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched/cpupri: Remove the vec->lockSteven Rostedt
sched/cpupri: Remove the vec->lock The cpupri vec->lock has been showing up as a top contention lately. This is because of the RT push/pull logic takes an agressive approach for migrating RT tasks. The cpupri logic is in place to improve the performance of the push/pull when dealing with large number CPU machines. The problem though is a vec->lock is required, where a vec is a global per RT priority structure. That is, if there are lots of RT tasks at the same priority, every time they are added or removed from the RT queue, this global vec->lock is taken. Now that more kernel threads are becoming RT (RCU boost and threaded interrupts) this is becoming much more of an issue. There are two variables that are being synced by the vec->lock. The cpupri bitmask, and the vec->counter. The cpupri bitmask is one bit per priority. If a RT priority vec has a process queued, then the vec->count is > 0 and the cpupri bitmask is set for that RT priority. If the cpupri bitmask gets out of sync with the vec->counter, we could end up pushing a low proirity RT task to a high priority queue. That RT task that could have run immediately could be queued on a run queue with a higher priority task indefinitely. The solution is not to use the cpupri bitmask and just look at the vec->count directly when doing a pull. The cpupri bitmask is just a fast way to scan the RT priorities when a pull is made. Instead of using the bitmask, and just examine all RT priorities, and look at the vec->counts, we could eliminate the vec->lock. The scan of RT tasks is to find a run queue that we can push an RT task to, and we do not push to a high priority queue, thus the scan only needs to go from 1 to RT task->prio, and not all 100 RT priorities. The push algorithm, which does the scan of RT priorities (and scan of the bitmask) only happens when we have an overloaded RT run queue (more than one RT task queued). The grabbing of the vec->lock happens every time any RT task is queued or dequeued on the run queue for that priority. The slowing down of the scan by not using a bitmask is negligible by the speed up of removing the vec->lock contention, and replacing it with an atomic counter and memory barrier. To prove this, I wrote a patch that times both the loop and the code that grabs the vec->locks. I passed the patches to various people (and companies) to test and show the results. I let everyone choose their own load to test, giving different loads on the system, for various different setups. Here's some of the results: (snipping to a few CPUs to not make this change log huge, but the results were consistent across the entire system). System 1 (24 CPUs) Before patch: CPU: Name Count Max Min Average Total ---- ---- ----- --- --- ------- ----- [...] cpu 20: loop 3057 1.766 0.061 0.642 1963.170 vec 6782949 90.469 0.089 0.414 2811760.503 cpu 21: loop 2617 1.723 0.062 0.641 1679.074 vec 6782810 90.499 0.089 0.291 1978499.900 cpu 22: loop 2212 1.863 0.063 0.699 1547.160 vec 6767244 85.685 0.089 0.435 2949676.898 cpu 23: loop 2320 2.013 0.062 0.594 1380.265 vec 6781694 87.923 0.088 0.431 2928538.224 After patch: cpu 20: loop 2078 1.579 0.061 0.533 1108.006 vec 6164555 5.704 0.060 0.143 885185.809 cpu 21: loop 2268 1.712 0.065 0.575 1305.248 vec 6153376 5.558 0.060 0.187 1154960.469 cpu 22: loop 1542 1.639 0.095 0.533 823.249 vec 6156510 5.720 0.060 0.190 1172727.232 cpu 23: loop 1650 1.733 0.068 0.545 900.781 vec 6170784 5.533 0.060 0.167 1034287.953 All times are in microseconds. The 'loop' is the amount of time spent doing the loop across the priorities (before patch uses bitmask). the 'vec' is the amount of time in the code that requires grabbing the vec->lock. The second patch just does not have the vec lock, but encompasses the same code. Amazingly the loop code even went down on average. The vec code went from .5 down to .18, that's more than half the time spent! Note, more than one test was run, but they all had the same results. System 2 (64 CPUs) Before patch: CPU: Name Count Max Min Average Total ---- ---- ----- --- --- ------- ----- cpu 60: loop 0 0 0 0 0 vec 5410840 277.954 0.084 0.782 4232895.727 cpu 61: loop 0 0 0 0 0 vec 4915648 188.399 0.084 0.570 2803220.301 cpu 62: loop 0 0 0 0 0 vec 5356076 276.417 0.085 0.786 4214544.548 cpu 63: loop 0 0 0 0 0 vec 4891837 170.531 0.085 0.799 3910948.833 After patch: cpu 60: loop 0 0 0 0 0 vec 5365118 5.080 0.021 0.063 340490.267 cpu 61: loop 0 0 0 0 0 vec 4898590 1.757 0.019 0.071 347903.615 cpu 62: loop 0 0 0 0 0 vec 5737130 3.067 0.021 0.119 687108.734 cpu 63: loop 0 0 0 0 0 vec 4903228 1.822 0.021 0.071 348506.477 The test run during the measurement did not have any (very few, from other CPUs) RT tasks pushing. But this shows that it helped out tremendously with the contention, as the contention happens because the vec->lock is taken only on queuing at an RT priority, and different CPUs that queue tasks at the same priority will have contention. I tested on my own 4 CPU machine with the following results: Before patch: CPU: Name Count Max Min Average Total ---- ---- ----- --- --- ------- ----- cpu 0: loop 2377 1.489 0.158 0.588 1398.395 vec 4484 770.146 2.301 4.396 19711.755 cpu 1: loop 2169 1.962 0.160 0.576 1250.110 vec 4425 152.769 2.297 4.030 17834.228 cpu 2: loop 2324 1.749 0.155 0.559 1299.799 vec 4368 779.632 2.325 4.665 20379.268 cpu 3: loop 2325 1.629 0.157 0.561 1306.113 vec 4650 408.782 2.394 4.348 20222.577 After patch: CPU: Name Count Max Min Average Total ---- ---- ----- --- --- ------- ----- cpu 0: loop 2121 1.616 0.113 0.636 1349.189 vec 4303 1.151 0.225 0.421 1811.966 cpu 1: loop 2130 1.638 0.178 0.644 1372.927 vec 4627 1.379 0.235 0.428 1983.648 cpu 2: loop 2056 1.464 0.165 0.637 1310.141 vec 4471 1.311 0.217 0.433 1937.927 cpu 3: loop 2154 1.481 0.162 0.601 1295.083 vec 4236 1.253 0.230 0.425 1803.008 This was running my migrate.c code that can be found at: http://lwn.net/Articles/425763/ The migrate code does stress the RT tasks a bit. This shows that the loop did increase a little after the patch, but not by much. The vec code dropped dramatically. From 4.3us down to .42us. That's a 10x improvement! Tested-by: Mike Galbraith <mgalbraith@suse.de> Tested-by: Luis Claudio R. Gonçalves <lgoncalv@redhat.com> Tested-by: Matthew Hank Sabins<msabins@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Gregory Haskins <gregory.haskins@gmail.com> Acked-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Chris Mason <chris.mason@oracle.com> Link: http://lkml.kernel.org/r/1312317372.18583.101.camel@gandalf.stny.rr.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Use pushable_tasks to determine next highest prioSteven Rostedt
Hillf Danton proposed a patch (see link) that cleaned up the sched_rt code that calculates the priority of the next highest priority task to be used in finding run queues to pull from. His patch removed the calculating of the next prio to just use the current prio when deteriming if we should examine a run queue to pull from. The problem with his patch was that it caused more false checks. Because we check a run queue for pushable tasks if the current priority of that run queue is higher in priority than the task about to run on our run queue. But after grabbing the locks and doing the real check, we find that there may not be a task that has a higher prio task to pull. Thus the locks were taken with nothing to do. I added some trace_printks() to record when and how many times the run queue locks were taken to check for pullable tasks, compared to how many times we pulled a task. With the current method, it was: 3806 locks taken vs 2812 pulled tasks With Hillf's patch: 6728 locks taken vs 2804 pulled tasks The number of times locks were taken to pull a task went up almost double with no more success rate. But his patch did get me thinking. When we look at the priority of the highest task to consider taking the locks to do a pull, a failure to pull can be one of the following: (in order of most likely) o RT task was pushed off already between the check and taking the lock o Waiting RT task can not be migrated o RT task's CPU affinity does not include the target run queue's CPU o RT task's priority changed between the check and taking the lock And with Hillf's patch, the thing that caused most of the failures, is the RT task to pull was not at the right priority to pull (not greater than the current RT task priority on the target run queue). Most of the above cases we can't help. But the current method does not check if the next highest prio RT task can be migrated or not, and if it can not, we still grab the locks to do the test (we don't find out about this fact until after we have the locks). I thought about this case, and realized that the pushable task plist that is maintained only holds RT tasks that can migrate. If we move the calculating of the next highest prio task from the inc/dec_rt_task() functions into the queuing of the pushable tasks, then we only measure the priorities of those tasks that we push, and we get this basically for free. Not only does this patch make the code a little more efficient, it cleans it up and makes it a little simpler. Thanks to Hillf Danton for inspiring me on this patch. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hillf Danton <dhillf@gmail.com> Cc: Gregory Haskins <ghaskins@novell.com> Link: http://lkml.kernel.org/r/BANLkTimQ67180HxCx5vgMqumqw1EkFh3qg@mail.gmail.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Balance RT tasks when forked as wellSteven Rostedt
When a new task is woken, the code to balance the RT task is currently skipped in the select_task_rq() call. But it will be pushed if the rq is currently overloaded with RT tasks anyway. The issue is that we already queued the task, and if it does get pushed, it will have to be dequeued and requeued on the new run queue. The advantage with pushing it first is that we avoid this requeuing as we are pushing it off before the task is ever queued. See commit 318e0893ce3f524 ("sched: pre-route RT tasks on wakeup") for more details. The return of select_task_rq() when it is not a wake up has also been changed to return task_cpu() instead of smp_processor_id(). This is more of a sanity because the current only other user of select_task_rq() besides wake ups, is an exec, where task_cpu() should also be the same as smp_processor_id(). But if it is used for other purposes, lets keep the task on the same CPU. Why would we mant to migrate it to the current CPU? Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hillf Danton <dhillf@gmail.com> Link: http://lkml.kernel.org/r/20110617015919.832743148@goodmis.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Remove resetting exec_start in put_prev_task_rt()Hillf Danton
There's no reason to clean the exec_start in put_prev_task_rt() as it is reset when the task gets back to the run queue. This saves us doing a store() in the fast path. Signed-off-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Yong Zhang <yong.zhang0@gmail.com> Link: http://lkml.kernel.org/r/BANLkTimqWD=q6YnSDi-v9y=LMWecgEzEWg@mail.gmail.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched, rt: Fix rq->rt.pushable_tasks bug in push_rt_task()Hillf Danton
Do not call dequeue_pushable_task() when failing to push an eligible task, as it remains pushable, merely not at this particular moment. Signed-off-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: Mike Galbraith <mgalbraith@gmx.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Yong Zhang <yong.zhang0@gmail.com> Link: http://lkml.kernel.org/r/1306895385.4791.26.camel@marge.simson.net Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Remove noop in lowest_flag_domain()Hillf Danton
Checking for the validity of sd is removed, since it is already checked by the for_each_domain macro. Signed-off-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/BANLkTimT+Tut-3TshCDm-NiLLXrOznibNA@mail.gmail.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Remove noop in next_prio()Hillf Danton
When computing the next priority for a given run-queue, the check for RT priority of the task determined by the pick_next_highest_task_rt() function could be removed, since only RT tasks are returned by the function. Reviewed-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/BANLkTimxmWiof9s5AvS3v_0X+sMiE=0x5g@mail.gmail.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: fix broken SCHED_RESET_ON_FORK handlingMike Galbraith
Setting child->prio = current->normal_prio _after_ SCHED_RESET_ON_FORK has been handled for an RT parent gives birth to a deranged mutant child with non-RT policy, but RT prio and sched_class. Move PI leakage protection up, always set priorities and weight, and if the child is leaving RT class, reset rt_priority to the proper value. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1311779695.8691.2.camel@marge.simson.net Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Kill WAKEUP_PREEMPTYong Zhang
Remove the WAKEUP_PREEMPT feature, disabling it doesn't make any sense and its outlived its use by a long long while. Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Acked-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110729082033.GB12106@zhy Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Remove rq->avg_load_per_taskJan H. Schönherr
Since commit a2d47777 ("sched: fix stale value in average load per task") the variable rq->avg_load_per_task is no longer required. Remove it. Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1312189408-17172-1-git-send-email-schnhrr@cs.tu-berlin.de Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14watchdog: Make the kthreads NUMA affineEric Dumazet
Watchdog kthreads can use kthread_create_on_node() to NUMA affine their stack and task_struct. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1312394344-18815-1-git-send-email-dzickus@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14perf: provide PMU when initing eventsMark Rutland
Currently, an event's 'pmu' field is set after pmu::event_init() is called. This means that pmu::event_init() must figure out which struct pmu the event was initialised from. This makes it difficult to consolidate common event initialisation code for similar PMUs, and very difficult to implement drivers for PMUs which can have multiple instances (e.g. a USB controller PMU, a GPU PMU, etc). This patch sets the 'pmu' field before initialising the event, allowing event init code to identify the struct pmu instance easily. In the event of failure to initialise an event, the event is destroyed via kfree() without calling perf_event::destroy(), so this shouldn't result in bad behaviour even if the destroy field was set before failure to initialise was noted. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1313062280-19123-1-git-send-email-mark.rutland@arm.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14perf: Add PM notifiers to fix CPU hotplug racesPeter Zijlstra
Francis reports that s2r gets him spurious NMIs, this is because the suspend code leaves the boot cpu up and running. Cure this by adding a suspend notifier. The problem is that hotplug and suspend are completely un-serialized and the PM notifiers run before the suspend cpu unplug of all but the boot cpu. This leaves a window where the user can initialize another hotplug operation (either remove or add a cpu) resulting in either one too many or one too few hotplug ops. Thus we cannot use the hotplug code for the suspend case. There's another reason to not use the hotplug code, which is that the hotplug code totally destroys the perf state, we can do better for suspend and simply remove all counters from the PMU so that we can re-instate them on resume. Reported-by: Francis Moreau <francis.moro@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-1cvevybkgmv4s6v5y37t4847@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-12xfs: remove subdirectoriesChristoph Hellwig
Use the move from Linux 2.6 to Linux 3.x as an excuse to kill the annoying subdirectories in the XFS source code. Besides the large amount of file rename the only changes are to the Makefile, a few files including headers with the subdirectory prefix, and the binary sysctl compat code that includes a header under fs/xfs/ from kernel/. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2011-08-11move RLIMIT_NPROC check from set_user() to do_execve_common()Vasiliy Kulikov
The patch http://lkml.org/lkml/2003/7/13/226 introduced an RLIMIT_NPROC check in set_user() to check for NPROC exceeding via setuid() and similar functions. Before the check there was a possibility to greatly exceed the allowed number of processes by an unprivileged user if the program relied on rlimit only. But the check created new security threat: many poorly written programs simply don't check setuid() return code and believe it cannot fail if executed with root privileges. So, the check is removed in this patch because of too often privilege escalations related to buggy programs. The NPROC can still be enforced in the common code flow of daemons spawning user processes. Most of daemons do fork()+setuid()+execve(). The check introduced in execve() (1) enforces the same limit as in setuid() and (2) doesn't create similar security issues. Neil Brown suggested to track what specific process has exceeded the limit by setting PF_NPROC_EXCEEDED process flag. With the change only this process would fail on execve(), and other processes' execve() behaviour is not changed. Solar Designer suggested to re-check whether NPROC limit is still exceeded at the moment of execve(). If the process was sleeping for days between set*uid() and execve(), and the NPROC counter step down under the limit, the defered execve() failure because NPROC limit was exceeded days ago would be unexpected. If the limit is not exceeded anymore, we clear the flag on successful calls to execve() and fork(). The flag is also cleared on successful calls to set_user() as the limit was exceeded for the previous user, not the current one. Similar check was introduced in -ow patches (without the process flag). v3 - clear PF_NPROC_EXCEEDED on successful calls to set_user(). Reviewed-by: James Morris <jmorris@namei.org> Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-11Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf symbols: Check '/tmp/perf-' symbol file ownership perf sched: Usage leftover from trace -> script rename perf sched: Do not delete session object prematurely perf tools: Check $HOME/.perfconfig ownership perf, x86: Add model 45 SandyBridge support perf tools: Add support to install perf python extension perf tools: do not look at ./config for configuration perf tools: Make clean leaves some files perf lock: Dropping unsupported ':r' modifier perf probe: Fix coredump introduced by probe module option jump label: Reduce the cycle count by changing the link order perf report: Use ui__warning in some more places perf python: Add PERF_RECORD_{LOST,READ,SAMPLE} routine tables perf evlist: Introduce 'disable' method trace events: Update version number reference to new 3.x scheme for EVENT_POWER_TRACING_DEPRECATED perf buildid-cache: Zero out buffer of filenames when adding/removing buildid
2011-08-11blktrace: add FLUSH/FUA supportNamhyung Kim
Add FLUSH/FUA support to blktrace. As FLUSH precedes WRITE and/or FUA follows WRITE, use the same 'F' flag for both cases and distinguish them by their (relative) position. The end results look like (other flags might be shown also): - WRITE: W - WRITE_FLUSH: FW - WRITE_FUA: WF - WRITE_FLUSH_FUA: FWF Note that we reuse TC_BARRIER due to lack of bit space of act_mask so that the older versions of blktrace tools will report flush requests as barriers from now on. Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Namhyung Kim <namhyung@gmail.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-10Tracepoint: Dissociate from module mutexMathieu Desnoyers
Copy the information needed from struct module into a local module list held within tracepoint.c from within the module coming/going notifier. This vastly simplifies locking of tracepoint registration / unregistration, because we don't have to take the module mutex to register and unregister tracepoints anymore. Steven Rostedt ran into dependency problems related to modules mutex vs kprobes mutex vs ftrace mutex vs tracepoint mutex that seems to be hard to fix without removing this dependency between tracepoint and module mutex. (note: it should be investigated whether kprobes could benefit of being dissociated from the modules mutex too.) This also fixes module handling of tracepoint list iterators, because it was expecting the list to be sorted by pointer address. Given we have control on our own list now, it's OK to sort this list which has tracepoints as its only purpose. The reason why this sorting is required is to handle the fact that seq files (and any read() operation from user-space) cannot hold the tracepoint mutex across multiple calls, so list entries may vanish between calls. With sorting, the tracepoint iterator becomes usable even if the list don't contain the exact item pointed to by the iterator anymore. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Jason Baron <jbaron@redhat.com> CC: Ingo Molnar <mingo@elte.hu> CC: Lai Jiangshan <laijs@cn.fujitsu.com> CC: Peter Zijlstra <a.p.zijlstra@chello.nl> CC: Thomas Gleixner <tglx@linutronix.de> CC: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Link: http://lkml.kernel.org/r/20110810191839.GC8525@Krystal Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-08-10tracing: Clean up tb_fmt to not give faulty compile warningSteven Rostedt
gcc incorrectly states that the variable "fmt" is uninitialized when CC_OPITMIZE_FOR_SIZE is set. Instead of just blindly setting fmt to NULL, the code is cleaned up a little to be a bit easier for humans to follow, as well as gcc to know the variables are initialized. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-08-10alarmtimers: Rework RTC device selection using class interfaceJohn Stultz
This allows cleaner detection of the RTC device being registered, rather then probing any time someone calls alarmtimer_get_rtcdev. CC: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-08-10alarmtimers: Add try_to_cancel functionalityJohn Stultz
There's a number of edge cases when cancelling a alarm, so to be sure we accurately do so, introduce try_to_cancel, which returns proper failure errors if it cannot. Also modify cancel to spin until the alarm is properly disabled. CC: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-08-10alarmtimers: Add more refined alarm state trackingJohn Stultz
In order to allow for functionality like try_to_cancel, add more refined state tracking (similar to hrtimers). CC: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-08-10alarmtimers: Remove period from alarm structureJohn Stultz
Now that periodic alarmtimers are managed by the handler function, remove the period value from the alarm structure and let the handlers manage the interval on their own. CC: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-08-10alarmtimers: Remove interval cap limit hackJohn Stultz
Now that the alarmtimers code has been refactored, the interval cap limit can be removed. CC: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-08-10alarmtimers: Add alarm_forward functionalityJohn Stultz
In order to avoid wasting time expiring and re-adding very high freq periodic alarmtimers, introduce alarm_forward() which is similar to hrtimer_forward and moves the timer to the next future expiration time and returns the number of overruns. CC: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-08-10alarmtimers: Push rearming peroidic timers down into alamrtimer handlerJohn Stultz
This patch pushes the periodic alarmtimer re-arming down into the alarmtimer handler, mimicking how hrtimers handle this. CC: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-08-10alarmtimers: Change alarmtimer functions to return alarmtimer_restart valuesJohn Stultz
In order to properly fix the denial of service issue with high freq periodic alarm timers, we need to push the re-arming logic into the alarm timer handler, much as the hrtimer code does. This patch introduces alarmtimer_restart enum and changes the alarmtimer handler declarations to use it as a return value. Further, to ease following changes, it extends the alarmtimer handler functions to also take the time at expiration. No logic is yet modified. CC: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-08-10alarmtimers: Avoid possible denial of service with high freq periodic timersJohn Stultz
Its possible to jam up the alarm timers by setting very small interval timers, which will cause the alarmtimer subsystem to spend all of its time firing and restarting timers. This can effectivly lock up a box. A deeper fix is needed, closely mimicking the hrtimer code, but for now just cap the interval to 100us to avoid userland hanging the system. CC: Thomas Gleixner <tglx@linutronix.de> CC: stable@kernel.org Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-08-10alarmtimers: Memset itimerspec passed into alarm_timer_getJohn Stultz
Following common_timer_get, zero out the itimerspec passed in. CC: Thomas Gleixner <tglx@linutronix.de> CC: stable@kernel.org Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-08-10alarmtimers: Avoid possible null pointer traversalJohn Stultz
We don't check if old_setting is non null before assigning it, so correct this. CC: Thomas Gleixner <tglx@linutronix.de> CC: stable@kernel.org Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-08-09cap_syslog: don't use WARN_ONCE for CAP_SYS_ADMIN deprecation warningJonathan Nieder
syslog-ng versions before 3.3.0beta1 (2011-05-12) assume that CAP_SYS_ADMIN is sufficient to access syslog, so ever since CAP_SYSLOG was introduced (2010-11-25) they have triggered a warning. Commit ee24aebffb75 ("cap_syslog: accept CAP_SYS_ADMIN for now") improved matters a little by making syslog-ng work again, just keeping the WARN_ONCE(). But still, this is a warning that writes a stack trace we don't care about to syslog, sets a taint flag, and alarms sysadmins when nothing worse has happened than use of an old userspace with a recent kernel. Convert the WARN_ONCE to a printk_once to avoid that while continuing to give userspace developers a hint that this is an unwanted backward-compatibility feature and won't be around forever. Reported-by: Ralf Hildebrandt <ralf.hildebrandt@charite.de> Reported-by: Niels <zorglub_olsen@hotmail.com> Reported-by: Paweł Sikora <pluto@agmk.net> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Liked-by: Gergely Nagy <algernon@madhouse-project.org> Acked-by: Serge Hallyn <serge@hallyn.com> Acked-by: James Morris <jmorris@namei.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-09lockdep: Fix wrong assumption in match_held_lockPeter Zijlstra
match_held_lock() was assuming it was being called on a lock class that had already seen usage. This condition was true for bug-free code using lockdep_assert_held(), since you're in fact holding the lock when calling it. However the assumption fails the moment you assume the assertion can fail, which is the whole point of having the assertion in the first place. Anyway, now that there's more lockdep_is_held() users, notably __rcu_dereference_check(), its much easier to trigger this since we test for a number of locks and we only need to hold any one of them to be good. Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1312547787.28695.2.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-05jump label: Reduce the cycle count by changing the link orderJason Baron
In the course of testing jump labels for use with the CFS bandwidth controller, Paul Turner, discovered that using jump labels reduced the branch count and the instruction count, but did not reduce the cycle count or wall time. I noticed that having the jump_label.o included in the kernel but not used in any way still caused this increase in cycle count and wall time. Thus, I moved jump_label.o in the kernel/Makefile, thus changing the link order, and presumably moving it out of hot icache areas. This brought down the cycle count/time as expected. In addition to Paul's testing, I've tested the patch using a single 'static_branch()' in the getppid() path, and basically running tight loops of calls to getppid(). Here are my results for the branch disabled case: With jump labels turned on (CONFIG_JUMP_LABEL), branch disabled: Performance counter stats for 'bash -c /tmp/getppid;true' (50 runs): 3,969,510,217 instructions # 0.864 IPC ( +-0.000% ) 4,592,334,954 cycles ( +- 0.046% ) 751,634,470 branches ( +- 0.000% ) 1.722635797 seconds time elapsed ( +- 0.046% ) Jump labels turned off (CONFIG_JUMP_LABEL not set), branch disabled: Performance counter stats for 'bash -c /tmp/getppid;true' (50 runs): 4,009,611,846 instructions # 0.867 IPC ( +-0.000% ) 4,622,210,580 cycles ( +- 0.012% ) 771,662,904 branches ( +- 0.000% ) 1.734341454 seconds time elapsed ( +- 0.022% ) Signed-off-by: Jason Baron <jbaron@redhat.com> Cc: rth@redhat.com Cc: a.p.zijlstra@chello.nl Cc: rostedt@goodmis.org Link: http://lkml.kernel.org/r/20110805204040.GG2522@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu> Tested-by: Paul Turner <pjt@google.com>
2011-08-05Merge branch 'linus' into perf/urgentIngo Molnar
Merge reason: Include most of the merge window trees, to do fixes on top. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-04Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: slab, lockdep: Annotate the locks before using them lockdep: Clear whole lockdep_map on initialization slab, lockdep: Annotate slab -> rcu -> debug_object -> slab lockdep: Fix up warning lockdep: Fix trace_hardirqs_on_caller() futex: Fix regression with read only mappings