summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2016-07-14workqueue: Convert to state machine callbacksThomas Gleixner
Get rid of the prio ordering of the separate notifiers and use a proper state callback pair. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160713153335.197083890@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-14perf/core: Convert to hotplug state machineThomas Gleixner
Actually a nice symmetric startup/teardown pair which fits properly into the state machine concept. In the long run we should be able to invoke the startup callback for the boot CPU via the state machine and get rid of the init function which invokes it on the boot CPU. Note: This comes actually before the perf hardware callbacks. In the notifier model the hardware callbacks have a higher priority than the core callback. But that's solely for CPU offline so that hardware migration of events happens before the core is notified about the outgoing CPU. With the symetric state array model we have the following ordering: UP: core -> hardware DOWN: hardware -> core Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Reviewed-by: Sebastian Siewior <bigeasy@linutronix.de> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160713153333.587514098@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-14cpu/hotplug: Handle early registration gracefullyThomas Gleixner
We switched the hotplug machinery to smpboot threads. Early registration of hotplug callbacks, i.e. from do_pre_smp_initcalls(), happens before the threads are initialized. Instead of moving the thread init, we simply handle it in the hotplug code itself and invoke the function directly. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160713153332.896450738@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-13Merge branch 'core/urgent' into smp/hotplug to pick up dependenciesThomas Gleixner
2016-07-13Merge branch 'core/rcu' into smp/hotplug to pick up dependenciesThomas Gleixner
2016-07-13Merge branch 'timers/core' into smp/hotplug to pick up dependenciesThomas Gleixner
2016-07-13cpu/hotplug: Keep enough storage space if SMP=n to avoid array out of bounds ↵Thomas Gleixner
scribble Xiaolong Ye reported lock debug warnings triggered by the following commit: 8de4a0066106 ("perf/x86: Convert the core to the hotplug state machine") The bug is the following: the cpuhp_bp_states[] array is cut short when CONFIG_SMP=n, but the dynamically registered callbacks are stored nevertheless and happily scribble outside of the array bounds... We need to store them in case that the state is unregistered so we can invoke the teardown function. That's independent of CONFIG_SMP. Make sure the array is large enough. Reported-by: kernel test robot <xiaolong.ye@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Adam Borowski <kilobyte@angband.pl> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Anna-Maria Gleixner <anna-maria@linutronix.de> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@suse.de> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Stephane Eranian <eranian@google.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: lkp@01.org Cc: stable@vger.kernel.org Cc: tipbuild@zytor.com Fixes: cff7d378d3fd "cpu/hotplug: Convert to a state machine for the control processor" Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1607122144560.4083@nanos Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-08Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Two load-balancing fixes for cgroups-intense workloads" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix calc_cfs_shares() fixed point arithmetics width confusion sched/fair: Fix effective_load() to consistently use smoothed load
2016-07-07Merge branch 'timers/fast-wheel' into timers/coreIngo Molnar
2016-07-07timers: Implement optimization for same expiry time in mod_timer()Anna-Maria Gleixner
The existing optimization for same expiry time in mod_timer() checks whether the timer expiry time is the same as the new requested expiry time. In the old timer wheel implementation this does not take the slack batching into account, neither does the new implementation evaluate whether the new expiry time will requeue the timer to the same bucket. To optimize that, we can calculate the resulting bucket and check if the new expiry time is different from the current expiry time. This calculation happens outside the base lock held region. If the resulting bucket is the same we can avoid taking the base lock and requeueing the timer. If the timer needs to be requeued then we have to check under the base lock whether the base time has changed between the lockless calculation and taking the lock. If it has changed we need to recalculate under the lock. This optimization takes effect for timers which are enqueued into the less granular wheel levels (1 and above). With a simple test case the functionality has been verified: Before After Match: 5.5% 86.6% Requeue: 94.5% 13.4% Recalc: <0.01% In the non optimized case the timer is requeued in 94.5% of the cases. With the index optimization in place the requeue rate drops to 13.4%. The case where the lockless index calculation has to be redone is less than 0.01%. With a real world test case (networking) we observed the following changes: Before After Match: 97.8% 99.7% Requeue: 2.2% 0.3% Recalc: <0.001% That means two percent fewer lock/requeue/unlock operations done in one of the hot path use cases of timers. Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Chris Mason <clm@fb.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: George Spelvin <linux@sciencehorizons.net> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Len Brown <lenb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160704094342.778527749@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07timers: Split out index calculationAnna-Maria Gleixner
For further optimizations we need to seperate index calculation from queueing. No functional change. Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Chris Mason <clm@fb.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: George Spelvin <linux@sciencehorizons.net> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Len Brown <lenb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160704094342.691159619@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07timers: Only wake softirq if necessaryThomas Gleixner
With the wheel forwading in place and with the HZ=1000 4ms folding we can avoid running the softirq at all. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Chris Mason <clm@fb.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: George Spelvin <linux@sciencehorizons.net> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Len Brown <lenb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160704094342.607650550@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07timers: Forward the wheel clock whenever possibleThomas Gleixner
The wheel clock is stale when a CPU goes into a long idle sleep. This has the side effect that timers which are queued end up in the outer wheel levels. That results in coarser granularity. To solve this, we keep track of the idle state and forward the wheel clock whenever possible. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Chris Mason <clm@fb.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: George Spelvin <linux@sciencehorizons.net> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Len Brown <lenb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160704094342.512039360@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07timers/nohz: Remove pointless tick_nohz_kick_tick() functionThomas Gleixner
This was a failed attempt to optimize the timer expiry in idle, which was disabled and never revisited. Remove the cruft. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Chris Mason <clm@fb.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: George Spelvin <linux@sciencehorizons.net> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Len Brown <lenb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160704094342.431073782@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07timers: Optimize collect_expired_timers() for NOHZAnna-Maria Gleixner
After a NOHZ idle sleep the timer wheel must be forwarded to current jiffies. There might be expired timers so the current code loops and checks the expired buckets for timers. This can take quite some time for long NOHZ idle periods. The pending bitmask in the timer base allows us to do a quick search for the next expiring timer and therefore a fast forward of the base time which prevents pointless long lasting loops. For a 3 seconds idle sleep this reduces the catchup time from ~1ms to 5us. Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Chris Mason <clm@fb.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: George Spelvin <linux@sciencehorizons.net> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Len Brown <lenb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160704094342.351296290@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07timers: Move __run_timers() functionAnna-Maria Gleixner
Move __run_timers() below __next_timer_interrupt() and next_pending_bucket() in preparation for __run_timers() NOHZ optimization. No functional change. Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Chris Mason <clm@fb.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: George Spelvin <linux@sciencehorizons.net> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Len Brown <lenb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160704094342.271872665@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07timers: Remove set_timer_slack() leftoversThomas Gleixner
We now have implicit batching in the timer wheel. The slack API is no longer used, so remove it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Andrew F. Davis <afd@ti.com> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Chris Mason <clm@fb.com> Cc: David S. Miller <davem@davemloft.net> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: George Spelvin <linux@sciencehorizons.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jaehoon Chung <jh80.chung@samsung.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: John Stultz <john.stultz@linaro.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Len Brown <lenb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathias Nyman <mathias.nyman@intel.com> Cc: Pali Rohár <pali.rohar@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Sebastian Reichel <sre@kernel.org> Cc: Ulf Hansson <ulf.hansson@linaro.org> Cc: linux-block@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mmc@vger.kernel.org Cc: linux-pm@vger.kernel.org Cc: linux-usb@vger.kernel.org Cc: netdev@vger.kernel.org Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160704094342.189813118@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07timers: Switch to a non-cascading wheelThomas Gleixner
The current timer wheel has some drawbacks: 1) Cascading: Cascading can be an unbound operation and is completely pointless in most cases because the vast majority of the timer wheel timers are canceled or rearmed before expiration. (They are used as timeout safeguards, not as real timers to measure time.) 2) No fast lookup of the next expiring timer: In NOHZ scenarios the first timer soft interrupt after a long NOHZ period must fast forward the base time to the current value of jiffies. As we have no way to find the next expiring timer fast, the code loops linearly and increments the base time one by one and checks for expired timers in each step. This causes unbound overhead spikes exactly in the moment when we should wake up as fast as possible. After a thorough analysis of real world data gathered on laptops, workstations, webservers and other machines (thanks Chris!) I came to the conclusion that the current 'classic' timer wheel implementation can be modified to address the above issues. The vast majority of timer wheel timers is canceled or rearmed before expiry. Most of them are timeouts for networking and other I/O tasks. The nature of timeouts is to catch the exception from normal operation (TCP ack timed out, disk does not respond, etc.). For these kinds of timeouts the accuracy of the timeout is not really a concern. Timeouts are very often approximate worst-case values and in case the timeout fires, we already waited for a long time and performance is down the drain already. The few timers which actually expire can be split into two categories: 1) Short expiry times which expect halfways accurate expiry 2) Long term expiry times are inaccurate today already due to the batching which is done for NOHZ automatically and also via the set_timer_slack() API. So for long term expiry timers we can avoid the cascading property and just leave them in the less granular outer wheels until expiry or cancelation. Timers which are armed with a timeout larger than the wheel capacity are no longer cascaded. We expire them with the longest possible timeout (6+ days). We have not observed such timeouts in our data collection, but at least we handle them, applying the rule of the least surprise. To avoid extending the wheel levels for HZ=1000 so we can accomodate the longest observed timeouts (5 days in the network conntrack code) we reduce the first level granularity on HZ=1000 to 4ms, which effectively is the same as the HZ=250 behaviour. From our data analysis there is nothing which relies on that 1ms granularity and as a side effect we get better batching and timer locality for the networking code as well. Contrary to the classic wheel the granularity of the next wheel is not the capacity of the first wheel. The granularities of the wheels are in the currently chosen setting 8 times the granularity of the previous wheel. So for HZ=250 we end up with the following granularity levels: Level Offset Granularity Range 0 0 4 ms 0 ms - 252 ms 1 64 32 ms 256 ms - 2044 ms (256ms - ~2s) 2 128 256 ms 2048 ms - 16380 ms (~2s - ~16s) 3 192 2048 ms (~2s) 16384 ms - 131068 ms (~16s - ~2m) 4 256 16384 ms (~16s) 131072 ms - 1048572 ms (~2m - ~17m) 5 320 131072 ms (~2m) 1048576 ms - 8388604 ms (~17m - ~2h) 6 384 1048576 ms (~17m) 8388608 ms - 67108863 ms (~2h - ~18h) 7 448 8388608 ms (~2h) 67108864 ms - 536870911 ms (~18h - ~6d) That's a worst case inaccuracy of 12.5% for the timers which are queued at the beginning of a level. So the new wheel concept addresses the old issues: 1) Cascading is avoided completely 2) By keeping the timers in the bucket until expiry/cancelation we can track the buckets which have timers enqueued in a bucket bitmap and therefore can look up the next expiring timer very fast and O(1). A further benefit of the concept is that the slack calculation which is done on every timer start is no longer necessary because the granularity levels provide natural batching already. Our extensive testing with various loads did not show any performance degradation vs. the current wheel implementation. This patch does not address the 'fast lookup' issue as we wanted to make sure that there is no regression introduced by the wheel redesign. The optimizations are in follow up patches. This patch contains fixes from Anna-Maria Gleixner and Richard Cochran. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Chris Mason <clm@fb.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: George Spelvin <linux@sciencehorizons.net> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Len Brown <lenb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160704094342.108621834@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07timers: Give a few structs and members proper namesThomas Gleixner
Some of the names in the internal implementation of the timer code are not longer correct and others are simply too long to type. Clean it up before we switch the wheel implementation over to the new scheme. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Chris Mason <clm@fb.com> Cc: Eric Dumazet <edumazet@google.com> Cc: George Spelvin <linux@sciencehorizons.net> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Len Brown <lenb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160704094341.948752516@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07signals: Use hrtimer for sigtimedwait()Thomas Gleixner
We've converted most timeout related syscalls to hrtimers, but sigtimedwait() did not get this treatment. Convert it so we get a reasonable accuracy and remove the user space exposure to the timer wheel properties. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Chris Mason <clm@fb.com> Cc: Cyril Hrubis <chrubis@suse.cz> Cc: George Spelvin <linux@sciencehorizons.net> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Len Brown <lenb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160704094341.787164909@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07timers: Remove the deprecated mod_timer_pinned() APIThomas Gleixner
We switched all users to initialize the timers as pinned and call mod_timer(). Remove the now unused timer API function. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Chris Mason <clm@fb.com> Cc: Eric Dumazet <edumazet@google.com> Cc: George Spelvin <linux@sciencehorizons.net> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Len Brown <lenb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160704094341.706205231@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07timers: Make 'pinned' a timer propertyThomas Gleixner
We want to move the timer migration logic from a 'push' to a 'pull' model. Under the current 'push' model pinned timers are handled via a runtime API variant: mod_timer_pinned(). The 'pull' model requires us to store the pinned attribute of a timer in the timer_list structure itself, as a new TIMER_PINNED bit in timer->flags. This flag must be set at initialization time and the timer APIs recognize the flag. This patch: - Implements the new flag and associated new-style initialization methods - makes mod_timer() recognize new-style pinned timers, - and adds some migration helper facility to allow step by step conversion of old-style to new-style pinned timers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Chris Mason <clm@fb.com> Cc: Eric Dumazet <edumazet@google.com> Cc: George Spelvin <linux@sciencehorizons.net> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Len Brown <lenb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160704094341.049338558@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07perf/core: Fix pmu::filter_match for SW-led groupsMark Rutland
The following commit: 66eb579e66ec ("perf: allow for PMU-specific event filtering") added the pmu::filter_match() callback. This was intended to avoid HW constraints on events from resulting in extremely pessimistic scheduling. However, pmu::filter_match() is only called for the leader of each event group. When the leader is a SW event, we do not filter the groups, and may fail at pmu::add() time, and when this happens we'll give up on scheduling any event groups later in the list until they are rotated ahead of the failing group. This can result in extremely sub-optimal event scheduling behaviour, e.g. if running the following on a big.LITTLE platform: $ taskset -c 0 ./perf stat \ -e 'a57{context-switches,armv8_cortex_a57/config=0x11/}' \ -e 'a53{context-switches,armv8_cortex_a53/config=0x11/}' \ ls <not counted> context-switches (0.00%) <not counted> armv8_cortex_a57/config=0x11/ (0.00%) 24 context-switches (37.36%) 57589154 armv8_cortex_a53/config=0x11/ (37.36%) Here the 'a53' event group was always eligible to be scheduled, but the 'a57' group never eligible to be scheduled, as the task was always affine to a Cortex-A53 CPU. The SW (group leader) event in the 'a57' group was eligible, but the HW event failed at pmu::add() time, resulting in ctx_flexible_sched_in giving up on scheduling further groups with HW events. One way of avoiding this is to check pmu::filter_match() on siblings as well as the group leader. If any of these fail their pmu::filter_match() call, we must skip the entire group before attempting to add any events. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Fixes: 66eb579e66ec ("perf: allow for PMU-specific event filtering") Link: http://lkml.kernel.org/r/1465917041-15339-1-git-send-email-mark.rutland@arm.com [ Small readability edits. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-05tick/broadcast-hrtimer: Set name of the ce_broadcast_hrtimerJisheng Zhang
This is to avoid the "null" name when we either ~ # cat /sys/devices/system/clockevents/broadcast/current_device (null) or ~ # cat /proc/timer_list ... Tick Device: mode: 1 Broadcast device Clock Event Device: (null) ... Signed-off-by: Jisheng Zhang <jszhang@marvell.com> Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/1467709071-3667-1-git-send-email-jszhang@marvell.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-01timers/nohz: Capitalize 'CPU' consistentlyIngo Molnar
While reviewing another patch I noticed that kernel/time/tick-sched.c had a charmingly (confusingly, annoyingly) rich set of variants for spelling 'CPU': cpu cpus CPU CPUs per CPU per-CPU per cpu ... sometimes these were mixed even within the same comment block! Compress these variants down to a single consistent set of: CPU CPUs per-CPU Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-01timers/nohz: Fix several typosWei Jiangang
Signed-off-by: Wei Jiangang <weijg.fnst@cn.fujitsu.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: fenghua.yu@intel.com Link: http://lkml.kernel.org/r/1467175910-2966-2-git-send-email-weijg.fnst@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-30Merge branch 'for-mingo' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU changes from Paul E. McKenney: - Documentation updates. Just some simple changes, no design-level additions. - Miscellaneous fixes. - Torture-test updates. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-29Merge branch 'stable-4.7' of git://git.infradead.org/users/pcmoore/auditLinus Torvalds
Pull audit fixes from Paul Moore: "Two small patches to fix audit problems in 4.7-rcX: the first fixes a potential kref leak, the second removes some header file noise. The first is an important bug fix that really should go in before 4.7 is released, the second is not critical, but falls into the very-nice- to-have category so I'm including in the pull request. Both patches are straightforward, self-contained, and pass our testsuite without problem" * 'stable-4.7' of git://git.infradead.org/users/pcmoore/audit: audit: move audit_get_tty to reduce scope and kabi changes audit: move calcs after alloc and check when logging set loginuid
2016-06-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: "I've been traveling so this accumulates more than week or so of bug fixing. It perhaps looks a little worse than it really is. 1) Fix deadlock in ath10k driver, from Ben Greear. 2) Increase scan timeout in iwlwifi, from Luca Coelho. 3) Unbreak STP by properly reinjecting STP packets back into the stack. Regression fix from Ido Schimmel. 4) Mediatek driver fixes (missing malloc failure checks, leaking of scratch memory, wrong indexing when mapping TX buffers, etc.) from John Crispin. 5) Fix endianness bug in icmpv6_err() handler, from Hannes Frederic Sowa. 6) Fix hashing of flows in UDP in the ruseport case, from Xuemin Su. 7) Fix netlink notifications in ovs for tunnels, delete link messages are never emitted because of how the device registry state is handled. From Nicolas Dichtel. 8) Conntrack module leaks kmemcache on unload, from Florian Westphal. 9) Prevent endless jump loops in nft rules, from Liping Zhang and Pablo Neira Ayuso. 10) Not early enough spinlock initialization in mlx4, from Eric Dumazet. 11) Bind refcount leak in act_ipt, from Cong WANG. 12) Missing RCU locking in HTB scheduler, from Florian Westphal. 13) Several small MACSEC bug fixes from Sabrina Dubroca (missing RCU barrier, using heap for SG and IV, and erroneous use of async flag when allocating AEAD conext.) 14) RCU handling fix in TIPC, from Ying Xue. 15) Pass correct protocol down into ipv4_{update_pmtu,redirect}() in SIT driver, from Simon Horman. 16) Socket timer deadlock fix in TIPC from Jon Paul Maloy. 17) Fix potential deadlock in team enslave, from Ido Schimmel. 18) Memory leak in KCM procfs handling, from Jiri Slaby. 19) ESN generation fix in ipv4 ESP, from Herbert Xu. 20) Fix GFP_KERNEL allocations with locks held in act_ife, from Cong WANG. 21) Use after free in netem, from Eric Dumazet. 22) Uninitialized last assert time in multicast router code, from Tom Goff. 23) Skip raw sockets in sock_diag destruction broadcast, from Willem de Bruijn. 24) Fix link status reporting in thunderx, from Sunil Goutham. 25) Limit resegmentation of retransmit queue so that we do not retransmit too large GSO frames. From Eric Dumazet. 26) Delay bpf program release after grace period, from Daniel Borkmann" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (141 commits) openvswitch: fix conntrack netlink event delivery qed: Protect the doorbell BAR with the write barriers. neigh: Explicitly declare RCU-bh read side critical section in neigh_xmit() e1000e: keep VLAN interfaces functional after rxvlan off cfg80211: fix proto in ieee80211_data_to_8023 for frames without LLC header qlcnic: use the correct ring in qlcnic_83xx_process_rcv_ring_diag() bpf, perf: delay release of BPF prog after grace period net: bridge: fix vlan stats continue counter tcp: do not send too big packets at retransmit time ibmvnic: fix to use list_for_each_safe() when delete items net: thunderx: Fix TL4 configuration for secondary Qsets net: thunderx: Fix link status reporting net/mlx5e: Reorganize ethtool statistics net/mlx5e: Fix number of PFC counters reported to ethtool net/mlx5e: Prevent adding the same vxlan port net/mlx5e: Check for BlueFlame capability before allocating SQ uar net/mlx5e: Change enum to better reflect usage net/mlx5: Add ConnectX-5 PCIe 4.0 to list of supported devices net/mlx5: Update command strings net: marvell: Add separate config ANEG function for Marvell 88E1111 ...
2016-06-29Merge branch 'for-4.7-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "Three fix patches. Two are for cgroup / css init failure path. The last one makes css_set_lock irq-safe as the deadline scheduler ends up calling put_css_set() from irq context" * 'for-4.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Disable IRQs while holding css_set_lock cgroup: set css->id to -1 during init cgroup: remove redundant cleanup in css_create
2016-06-29bpf, perf: delay release of BPF prog after grace periodDaniel Borkmann
Commit dead9f29ddcc ("perf: Fix race in BPF program unregister") moved destruction of BPF program from free_event_rcu() callback to __free_event(), which is problematic if used with tail calls: if prog A is attached as trace event directly, but at the same time present in a tail call map used by another trace event program elsewhere, then we need to delay destruction via RCU grace period since it can still be in use by the program doing the tail call (the prog first needs to be dropped from the tail call map, then trace event with prog A attached destroyed, so we get immediate destruction). Fixes: dead9f29ddcc ("perf: Fix race in BPF program unregister") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Jann Horn <jann@thejh.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-28audit: move audit_get_tty to reduce scope and kabi changesRichard Guy Briggs
The only users of audit_get_tty and audit_put_tty are internal to audit, so move it out of include/linux/audit.h to kernel.h and create a proper function rather than inlining it. This also reduces kABI changes. Suggested-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> [PM: line wrapped description] Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-06-28audit: move calcs after alloc and check when logging set loginuidRichard Guy Briggs
Move the calculations of values after the allocation in case the allocation fails. This avoids wasting effort in the rare case that it fails, but more importantly saves us extra logic to release the tty ref. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-06-27sched/fair: Fix calc_cfs_shares() fixed point arithmetics width confusionPeter Zijlstra
Commit: fde7d22e01aa ("sched/fair: Fix overly small weight for interactive group entities") did something non-obvious but also did it buggy yet latent. The problem was exposed for real by a later commit in the v4.7 merge window: 2159197d6677 ("sched/core: Enable increased load resolution on 64-bit kernels") ... after which tg->load_avg and cfs_rq->load.weight had different units (10 bit fixed point and 20 bit fixed point resp.). Add a comment to explain the use of cfs_rq->load.weight over the 'natural' cfs_rq->avg.load_avg and add scale_load_down() to correct for the difference in unit. Since this is (now, as per a previous commit) the only user of calc_tg_weight(), collapse it. The effects of this bug should be randomly inconsistent SMP-balancing of cgroups workloads. Reported-by: Jirka Hladky <jhladky@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 2159197d6677 ("sched/core: Enable increased load resolution on 64-bit kernels") Fixes: fde7d22e01aa ("sched/fair: Fix overly small weight for interactive group entities") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-27sched/fair: Fix effective_load() to consistently use smoothed loadPeter Zijlstra
Starting with the following commit: fde7d22e01aa ("sched/fair: Fix overly small weight for interactive group entities") calc_tg_weight() doesn't compute the right value as expected by effective_load(). The difference is in the 'correction' term. In order to ensure \Sum rw_j >= rw_i we cannot use tg->load_avg directly, since that might be lagging a correction on the current cfs_rq->avg.load_avg value. Therefore we use tg->load_avg - cfs_rq->tg_load_avg_contrib + cfs_rq->avg.load_avg. Now, per the referenced commit, calc_tg_weight() doesn't use cfs_rq->avg.load_avg, as is later used in @w, but uses cfs_rq->load.weight instead. So stop using calc_tg_weight() and do it explicitly. The effects of this bug are wake_affine() making randomly poor choices in cgroup-intense workloads. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> # v4.3+ Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: fde7d22e01aa ("sched/fair: Fix overly small weight for interactive group entities") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-25Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Thomas Gleixner: "A couple of scheduler fixes: - force watchdog reset while processing sysrq-w - fix a deadlock when enabling trace events in the scheduler - fixes to the throttled next buddy logic - fixes for the average accounting (missing serialization and underflow handling) - allow kernel threads for fallback to online but not active cpus" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Allow kthreads to fall back to online && !active cpus sched/fair: Do not announce throttled next buddy in dequeue_task_fair() sched/fair: Initialize throttle_count for new task-groups lazily sched/fair: Fix cfs_rq avg tracking underflow kernel/sysrq, watchdog, sched/core: Reset watchdog on all CPUs while processing sysrq-w sched/debug: Fix deadlock when enabling sched events sched/fair: Fix post_init_entity_util_avg() serialization
2016-06-25Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Thomas Gleixner: "A single fix to address a race in the static key logic" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/static_key: Fix concurrent static_key_slow_inc()
2016-06-25Fix build break in fork.c when THREAD_SIZE < PAGE_SIZEMichael Ellerman
Commit b235beea9e99 ("Clarify naming of thread info/stack allocators") breaks the build on some powerpc configs, where THREAD_SIZE < PAGE_SIZE: kernel/fork.c:235:2: error: implicit declaration of function 'free_thread_stack' kernel/fork.c:355:8: error: assignment from incompatible pointer type stack = alloc_thread_stack_node(tsk, node); ^ Fix it by renaming free_stack() to free_thread_stack(), and updating the return type of alloc_thread_stack_node(). Fixes: b235beea9e99 ("Clarify naming of thread info/stack allocators") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "Two weeks worth of fixes here" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (41 commits) init/main.c: fix initcall_blacklisted on ia64, ppc64 and parisc64 autofs: don't get stuck in a loop if vfs_write() returns an error mm/page_owner: avoid null pointer dereference tools/vm/slabinfo: fix spelling mistake: "Ocurrences" -> "Occurrences" fs/nilfs2: fix potential underflow in call to crc32_le oom, suspend: fix oom_reaper vs. oom_killer_disable race ocfs2: disable BUG assertions in reading blocks mm, compaction: abort free scanner if split fails mm: prevent KASAN false positives in kmemleak mm/hugetlb: clear compound_mapcount when freeing gigantic pages mm/swap.c: flush lru pvecs on compound page arrival memcg: css_alloc should return an ERR_PTR value on error memcg: mem_cgroup_migrate() may be called with irq disabled hugetlb: fix nr_pmds accounting with shared page tables Revert "mm: disable fault around on emulated access bit architecture" Revert "mm: make faultaround produce old ptes" mailmap: add Boris Brezillon's email mailmap: add Antoine Tenart's email mm, sl[au]b: add __GFP_ATOMIC to the GFP reclaim mask mm: mempool: kasan: don't poot mempool objects in quarantine ...
2016-06-24oom, suspend: fix oom_reaper vs. oom_killer_disable raceMichal Hocko
Tetsuo has reported the following potential oom_killer_disable vs. oom_reaper race: (1) freeze_processes() starts freezing user space threads. (2) Somebody (maybe a kenrel thread) calls out_of_memory(). (3) The OOM killer calls mark_oom_victim() on a user space thread P1 which is already in __refrigerator(). (4) oom_killer_disable() sets oom_killer_disabled = true. (5) P1 leaves __refrigerator() and enters do_exit(). (6) The OOM reaper calls exit_oom_victim(P1) before P1 can call exit_oom_victim(P1). (7) oom_killer_disable() returns while P1 not yet finished (8) P1 perform IO/interfere with the freezer. This situation is unfortunate. We cannot move oom_killer_disable after all the freezable kernel threads are frozen because the oom victim might depend on some of those kthreads to make a forward progress to exit so we could deadlock. It is also far from trivial to teach the oom_reaper to not call exit_oom_victim() because then we would lose a guarantee of the OOM killer and oom_killer_disable forward progress because exit_mm->mmput might block and never call exit_oom_victim. It seems the easiest way forward is to workaround this race by calling try_to_freeze_tasks again after oom_killer_disable. This will make sure that all the tasks are frozen or it bails out. Fixes: 449d777d7ad6 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper") Link: http://lkml.kernel.org/r/1466597634-16199-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24Clarify naming of thread info/stack allocatorsLinus Torvalds
We've had the thread info allocated together with the thread stack for most architectures for a long time (since the thread_info was split off from the task struct), but that is about to change. But the patches that move the thread info to be off-stack (and a part of the task struct instead) made it clear how confused the allocator and freeing functions are. Because the common case was that we share an allocation with the thread stack and the thread_info, the two pointers were identical. That identity then meant that we would have things like ti = alloc_thread_info_node(tsk, node); ... tsk->stack = ti; which certainly _worked_ (since stack and thread_info have the same value), but is rather confusing: why are we assigning a thread_info to the stack? And if we move the thread_info away, the "confusing" code just gets to be entirely bogus. So remove all this confusion, and make it clear that we are doing the stack allocation by renaming and clarifying the function names to be about the stack. The fact that the thread_info then shares the allocation is an implementation detail, and not really about the allocation itself. This is a pure renaming and type fix: we pass in the same pointer, it's just that we clarify what the pointer means. The ia64 code that actually only has one single allocation (for all of task_struct, thread_info and kernel thread stack) now looks a bit odd, but since "tsk->stack" is actually not even used there, that oddity doesn't matter. It would be a separate thing to clean that up, I intentionally left the ia64 changes as a pure brute-force renaming and type change. Acked-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24sched/core: Allow kthreads to fall back to online && !active cpusTejun Heo
During CPU hotplug, CPU_ONLINE callbacks are run while the CPU is online but not active. A CPU_ONLINE callback may create or bind a kthread so that its cpus_allowed mask only allows the CPU which is being brought online. The kthread may start executing before the CPU is made active and can end up in select_fallback_rq(). In such cases, the expected behavior is selecting the CPU which is coming online; however, because select_fallback_rq() only chooses from active CPUs, it determines that the task doesn't have any viable CPU in its allowed mask and ends up overriding it to cpu_possible_mask. CPU_ONLINE callbacks should be able to put kthreads on the CPU which is coming online. Update select_fallback_rq() so that it follows cpu_online() rather than cpu_active() for kthreads. Reported-by: Gautham R Shenoy <ego@linux.vnet.ibm.com> Tested-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-team@fb.com Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/20160616193504.GB3262@mtj.duckdns.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-24sched/fair: Do not announce throttled next buddy in dequeue_task_fair()Konstantin Khlebnikov
Hierarchy could be already throttled at this point. Throttled next buddy could trigger a NULL pointer dereference in pick_next_task_fair(). Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/146608183552.21905.15924473394414832071.stgit@buzz Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-24sched/fair: Initialize throttle_count for new task-groups lazilyKonstantin Khlebnikov
Cgroup created inside throttled group must inherit current throttle_count. Broken throttle_count allows to nominate throttled entries as a next buddy, later this leads to null pointer dereference in pick_next_task_fair(). This patch initialize cfs_rq->throttle_count at first enqueue: laziness allows to skip locking all rq at group creation. Lazy approach also allows to skip full sub-tree scan at throttling hierarchy (not in this patch). Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Link: http://lkml.kernel.org/r/146608182119.21870.8439834428248129633.stgit@buzz Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-24locking/static_key: Fix concurrent static_key_slow_inc()Paolo Bonzini
The following scenario is possible: CPU 1 CPU 2 static_key_slow_inc() atomic_inc_not_zero() -> key.enabled == 0, no increment jump_label_lock() atomic_inc_return() -> key.enabled == 1 now static_key_slow_inc() atomic_inc_not_zero() -> key.enabled == 1, inc to 2 return ** static key is wrong! jump_label_update() jump_label_unlock() Testing the static key at the point marked by (**) will follow the wrong path for jumps that have not been patched yet. This can actually happen when creating many KVM virtual machines with userspace LAPIC emulation; just run several copies of the following program: #include <fcntl.h> #include <unistd.h> #include <sys/ioctl.h> #include <linux/kvm.h> int main(void) { for (;;) { int kvmfd = open("/dev/kvm", O_RDONLY); int vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0); close(ioctl(vmfd, KVM_CREATE_VCPU, 1)); close(vmfd); close(kvmfd); } return 0; } Every KVM_CREATE_VCPU ioctl will attempt a static_key_slow_inc() call. The static key's purpose is to skip NULL pointer checks and indeed one of the processes eventually dereferences NULL. As explained in the commit that introduced the bug: 706249c222f6 ("locking/static_keys: Rework update logic") jump_label_update() needs key.enabled to be true. The solution adopted here is to temporarily make key.enabled == -1, and use go down the slow path when key.enabled <= 0. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> # v4.3+ Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 706249c222f6 ("locking/static_keys: Rework update logic") Link: http://lkml.kernel.org/r/1466527937-69798-1-git-send-email-pbonzini@redhat.com [ Small stylistic edits to the changelog and the code. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-23cgroup: Disable IRQs while holding css_set_lockDaniel Bristot de Oliveira
While testing the deadline scheduler + cgroup setup I hit this warning. [ 132.612935] ------------[ cut here ]------------ [ 132.612951] WARNING: CPU: 5 PID: 0 at kernel/softirq.c:150 __local_bh_enable_ip+0x6b/0x80 [ 132.612952] Modules linked in: (a ton of modules...) [ 132.612981] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.7.0-rc2 #2 [ 132.612981] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014 [ 132.612982] 0000000000000086 45c8bb5effdd088b ffff88013fd43da0 ffffffff813d229e [ 132.612984] 0000000000000000 0000000000000000 ffff88013fd43de0 ffffffff810a652b [ 132.612985] 00000096811387b5 0000000000000200 ffff8800bab29d80 ffff880034c54c00 [ 132.612986] Call Trace: [ 132.612987] <IRQ> [<ffffffff813d229e>] dump_stack+0x63/0x85 [ 132.612994] [<ffffffff810a652b>] __warn+0xcb/0xf0 [ 132.612997] [<ffffffff810e76a0>] ? push_dl_task.part.32+0x170/0x170 [ 132.612999] [<ffffffff810a665d>] warn_slowpath_null+0x1d/0x20 [ 132.613000] [<ffffffff810aba5b>] __local_bh_enable_ip+0x6b/0x80 [ 132.613008] [<ffffffff817d6c8a>] _raw_write_unlock_bh+0x1a/0x20 [ 132.613010] [<ffffffff817d6c9e>] _raw_spin_unlock_bh+0xe/0x10 [ 132.613015] [<ffffffff811388ac>] put_css_set+0x5c/0x60 [ 132.613016] [<ffffffff8113dc7f>] cgroup_free+0x7f/0xa0 [ 132.613017] [<ffffffff810a3912>] __put_task_struct+0x42/0x140 [ 132.613018] [<ffffffff810e776a>] dl_task_timer+0xca/0x250 [ 132.613027] [<ffffffff810e76a0>] ? push_dl_task.part.32+0x170/0x170 [ 132.613030] [<ffffffff8111371e>] __hrtimer_run_queues+0xee/0x270 [ 132.613031] [<ffffffff81113ec8>] hrtimer_interrupt+0xa8/0x190 [ 132.613034] [<ffffffff81051a58>] local_apic_timer_interrupt+0x38/0x60 [ 132.613035] [<ffffffff817d9b0d>] smp_apic_timer_interrupt+0x3d/0x50 [ 132.613037] [<ffffffff817d7c5c>] apic_timer_interrupt+0x8c/0xa0 [ 132.613038] <EOI> [<ffffffff81063466>] ? native_safe_halt+0x6/0x10 [ 132.613043] [<ffffffff81037a4e>] default_idle+0x1e/0xd0 [ 132.613044] [<ffffffff810381cf>] arch_cpu_idle+0xf/0x20 [ 132.613046] [<ffffffff810e8fda>] default_idle_call+0x2a/0x40 [ 132.613047] [<ffffffff810e92d7>] cpu_startup_entry+0x2e7/0x340 [ 132.613048] [<ffffffff81050235>] start_secondary+0x155/0x190 [ 132.613049] ---[ end trace f91934d162ce9977 ]--- The warn is the spin_(lock|unlock)_bh(&css_set_lock) in the interrupt context. Converting the spin_lock_bh to spin_lock_irq(save) to avoid this problem - and other problems of sharing a spinlock with an interrupt. Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: cgroups@vger.kernel.org Cc: stable@vger.kernel.org # 4.5+ Cc: linux-kernel@vger.kernel.org Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2016-06-23locking: avoid passing around 'thread_info' in mutex debugging codeLinus Torvalds
None of the code actually wants a thread_info, it all wants a task_struct, and it's just converting back and forth between the two ("ti->task" to get the task_struct from the thread_info, and "task_thread_info(task)" to go the other way). No semantic change. Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-21Merge branch 'fortglx/4.8/time' of ↵Thomas Gleixner
https://git.linaro.org/people/john.stultz/linux into timers/core Pull time(keeping) updates from John Stultz: - Handle the 1ns issue with the old refusing to die vsyscall machinery - More y2038 updates - Documentation fixes - Simplify clocksource handling
2016-06-20timer: Avoid using timespecArnd Bergmann
The tstats_show() function prints a ktime_t variable by converting it to struct timespec first. The algorithm is ok, but we want to stop using timespec in general because of the 32-bit time_t overflow problem. This changes the code to use struct timespec64, without any functional change. Cc: Prarit Bhargava <prarit@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: John Stultz <john.stultz@linaro.org>
2016-06-20time: Avoid timespec in udelay_testArnd Bergmann
udelay_test_single() uses ktime_get_ts() to get two timespec values and calculate the difference between them, while udelay_test_show() uses the same to printk() the current monotonic time. Both of these are y2038 safe on all machines, but we want to get rid of struct timespec anyway, so this converts the code to use ktime_get_ns() and ktime_get_ts64() respectively. Cc: Prarit Bhargava <prarit@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: John Stultz <john.stultz@linaro.org>