summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2015-05-13locking/rtmutex: Drop usage of __HAVE_ARCH_CMPXCHGSebastian Andrzej Siewior
The rtmutex code is the only user of __HAVE_ARCH_CMPXCHG and we have a few other user of cmpxchg() which do not care about __HAVE_ARCH_CMPXCHG. This define was first introduced in 23f78d4a0 ("[PATCH] pi-futex: rt mutex core") which is v2.6.18. The generic cmpxchg was introduced later in 068fbad288 ("Add cmpxchg_local to asm-generic for per cpu atomic operations") which is v2.6.25. Back then something was required to get rtmutex working with the fast path on architectures without cmpxchg and this seems to be the result. It popped up recently on rt-users because ARM (v6+) does not define __HAVE_ARCH_CMPXCHG (even that it implements it) which results in slower locking performance in the fast path. To put some numbers on it: preempt -RT, am335x, 10 loops of 100000 invocations of rt_spin_lock() + rt_spin_unlock() (time "total" is the average of the 10 loops for the 100000 invocations, "loop" is "total / 100000 * 1000"): cmpxchg | slowpath used || cmpxchg used | total | loop || total | loop --------|-----------|-------||------------|------- ARMv6 | 9129.4 us | 91 ns || 3311.9 us | 33 ns generic | 9360.2 us | 94 ns || 10834.6 us | 108 ns ----------------------------||-------------------- Forcing it to generic cmpxchg() made things worse for the slowpath and even worse in cmpxchg() path. It boils down to 14ns more per lock+unlock in a cache hot loop so it might not be that much in real world. The last test was a substitute for pre ARMv6 machine but then I was able to perform the comparison on imx28 which is ARMv5 and therefore is always is using the generic cmpxchg implementation. And the numbers: | total | loop -------- |----------- |-------- slowpath | 263937.2 us | 2639 ns cmpxchg | 16934.2 us | 169 ns -------------------------------- The numbers are larger since the machine is slower in general. However, letting rtmutex use cmpxchg() instead the slowpath seem to improve things. Since from the ARM (tested on am335x + imx28) point of view always using cmpxchg() in rt_mutex_lock() + rt_mutex_unlock() makes sense I would drop the define. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: will.deacon@arm.com Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/20150225175613.GE6823@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-05-12locking/qrwlock: Rename QUEUE_RWLOCK to QUEUED_RWLOCKSWaiman Long
To be consistent with the queued spinlocks which use CONFIG_QUEUED_SPINLOCKS config parameter, the one for the queued rwlocks is now renamed to CONFIG_QUEUED_RWLOCKS. Signed-off-by: Waiman Long <Waiman.Long@hp.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1431367031-36697-1-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-11locking/pvqspinlock: Rename QUEUED_SPINLOCK to QUEUED_SPINLOCKSIngo Molnar
Valentin Rothberg reported that we use CONFIG_QUEUED_SPINLOCKS in arch/x86/kernel/paravirt_patch_32.c, while the symbol is called CONFIG_QUEUED_SPINLOCK. (Note the extra 'S') But the typo was natural: the proper English term for such a generic object would be 'queued spinlocks' - so rename this and related symbols accordingly to the plural form. Reported-by: Valentin Rothberg <valentinrothberg@gmail.com> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <Waiman.Long@hp.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-11locking/pvqspinlock: Replace xchg() by the more descriptive set_mb()Waiman Long
The xchg() function was used in pv_wait_node() to set a certain value and provide a memory barrier which is what the set_mb() function is for. This patch replaces the xchg() call by set_mb(). Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Waiman Long <Waiman.Long@hp.com> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08locking/pvqspinlock, x86: Enable PV qspinlock for XenDavid Vrabel
This patch adds the necessary Xen specific code to allow Xen to support the CPU halting and kicking operations needed by the queue spinlock PV code. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Daniel J Blueman <daniel@numascale.com> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paolo Bonzini <paolo.bonzini@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: virtualization@lists.linux-foundation.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1429901803-29771-12-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08locking/pvqspinlock, x86: Enable PV qspinlock for KVMWaiman Long
This patch adds the necessary KVM specific code to allow KVM to support the CPU halting and kicking operations needed by the queue spinlock PV code. Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Daniel J Blueman <daniel@numascale.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paolo Bonzini <paolo.bonzini@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: virtualization@lists.linux-foundation.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1429901803-29771-11-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08locking/pvqspinlock: Implement simple paravirt support for the qspinlockWaiman Long
Provide a separate (second) version of the spin_lock_slowpath for paravirt along with a special unlock path. The second slowpath is generated by adding a few pv hooks to the normal slowpath, but where those will compile away for the native case, they expand into special wait/wake code for the pv version. The actual MCS queue can use extra storage in the mcs_nodes[] array to keep track of state and therefore uses directed wakeups. The head contender has no such storage directly visible to the unlocker. So the unlocker searches a hash table with open addressing using a simple binary Galois linear feedback shift register. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Daniel J Blueman <daniel@numascale.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paolo Bonzini <paolo.bonzini@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1429901803-29771-9-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08locking/qspinlock: Revert to test-and-set on hypervisorsPeter Zijlstra (Intel)
When we detect a hypervisor (!paravirt, see qspinlock paravirt support patches), revert to a simple test-and-set lock to avoid the horrors of queue preemption. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Daniel J Blueman <daniel@numascale.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paolo Bonzini <paolo.bonzini@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: virtualization@lists.linux-foundation.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1429901803-29771-8-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08locking/qspinlock: Use a simple write to grab the lockWaiman Long
Currently, atomic_cmpxchg() is used to get the lock. However, this is not really necessary if there is more than one task in the queue and the queue head don't need to reset the tail code. For that case, a simple write to set the lock bit is enough as the queue head will be the only one eligible to get the lock as long as it checks that both the lock and pending bits are not set. The current pending bit waiting code will ensure that the bit will not be set as soon as the tail code in the lock is set. With that change, the are some slight improvement in the performance of the queued spinlock in the 5M loop micro-benchmark run on a 4-socket Westere-EX machine as shown in the tables below. [Standalone/Embedded - same node] # of tasks Before patch After patch %Change ---------- ----------- ---------- ------- 3 2324/2321 2248/2265 -3%/-2% 4 2890/2896 2819/2831 -2%/-2% 5 3611/3595 3522/3512 -2%/-2% 6 4281/4276 4173/4160 -3%/-3% 7 5018/5001 4875/4861 -3%/-3% 8 5759/5750 5563/5568 -3%/-3% [Standalone/Embedded - different nodes] # of tasks Before patch After patch %Change ---------- ----------- ---------- ------- 3 12242/12237 12087/12093 -1%/-1% 4 10688/10696 10507/10521 -2%/-2% It was also found that this change produced a much bigger performance improvement in the newer IvyBridge-EX chip and was essentially to close the performance gap between the ticket spinlock and queued spinlock. The disk workload of the AIM7 benchmark was run on a 4-socket Westmere-EX machine with both ext4 and xfs RAM disks at 3000 users on a 3.14 based kernel. The results of the test runs were: AIM7 XFS Disk Test kernel JPM Real Time Sys Time Usr Time ----- --- --------- -------- -------- ticketlock 5678233 3.17 96.61 5.81 qspinlock 5750799 3.13 94.83 5.97 AIM7 EXT4 Disk Test kernel JPM Real Time Sys Time Usr Time ----- --- --------- -------- -------- ticketlock 1114551 16.15 509.72 7.11 qspinlock 2184466 8.24 232.99 6.01 The ext4 filesystem run had a much higher spinlock contention than the xfs filesystem run. The "ebizzy -m" test was also run with the following results: kernel records/s Real Time Sys Time Usr Time ----- --------- --------- -------- -------- ticketlock 2075 10.00 216.35 3.49 qspinlock 3023 10.00 198.20 4.80 Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Daniel J Blueman <daniel@numascale.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paolo Bonzini <paolo.bonzini@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: virtualization@lists.linux-foundation.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1429901803-29771-7-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08locking/qspinlock: Optimize for smaller NR_CPUSPeter Zijlstra (Intel)
When we allow for a max NR_CPUS < 2^14 we can optimize the pending wait-acquire and the xchg_tail() operations. By growing the pending bit to a byte, we reduce the tail to 16bit. This means we can use xchg16 for the tail part and do away with all the repeated compxchg() operations. This in turn allows us to unconditionally acquire; the locked state as observed by the wait loops cannot change. And because both locked and pending are now a full byte we can use simple stores for the state transition, obviating one atomic operation entirely. This optimization is needed to make the qspinlock achieve performance parity with ticket spinlock at light load. All this is horribly broken on Alpha pre EV56 (and any other arch that cannot do single-copy atomic byte stores). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Daniel J Blueman <daniel@numascale.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paolo Bonzini <paolo.bonzini@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: virtualization@lists.linux-foundation.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1429901803-29771-6-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08locking/qspinlock: Extract out code snippets for the next patchWaiman Long
This is a preparatory patch that extracts out the following 2 code snippets to prepare for the next performance optimization patch. 1) the logic for the exchange of new and previous tail code words into a new xchg_tail() function. 2) the logic for clearing the pending bit and setting the locked bit into a new clear_pending_set_locked() function. This patch also simplifies the trylock operation before queuing by calling queued_spin_trylock() directly. Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Daniel J Blueman <daniel@numascale.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paolo Bonzini <paolo.bonzini@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: virtualization@lists.linux-foundation.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1429901803-29771-5-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08locking/qspinlock: Add pending bitPeter Zijlstra (Intel)
Because the qspinlock needs to touch a second cacheline (the per-cpu mcs_nodes[]); add a pending bit and allow a single in-word spinner before we punt to the second cacheline. It is possible so observe the pending bit without the locked bit when the last owner has just released but the pending owner has not yet taken ownership. In this case we would normally queue -- because the pending bit is already taken. However, in this case the pending bit is guaranteed to be released 'soon', therefore wait for it and avoid queueing. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Daniel J Blueman <daniel@numascale.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paolo Bonzini <paolo.bonzini@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: virtualization@lists.linux-foundation.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1429901803-29771-4-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08locking/qspinlock: Introduce a simple generic 4-byte queued spinlockWaiman Long
This patch introduces a new generic queued spinlock implementation that can serve as an alternative to the default ticket spinlock. Compared with the ticket spinlock, this queued spinlock should be almost as fair as the ticket spinlock. It has about the same speed in single-thread and it can be much faster in high contention situations especially when the spinlock is embedded within the data structure to be protected. Only in light to moderate contention where the average queue depth is around 1-3 will this queued spinlock be potentially a bit slower due to the higher slowpath overhead. This queued spinlock is especially suit to NUMA machines with a large number of cores as the chance of spinlock contention is much higher in those machines. The cost of contention is also higher because of slower inter-node memory traffic. Due to the fact that spinlocks are acquired with preemption disabled, the process will not be migrated to another CPU while it is trying to get a spinlock. Ignoring interrupt handling, a CPU can only be contending in one spinlock at any one time. Counting soft IRQ, hard IRQ and NMI, a CPU can only have a maximum of 4 concurrent lock waiting activities. By allocating a set of per-cpu queue nodes and used them to form a waiting queue, we can encode the queue node address into a much smaller 24-bit size (including CPU number and queue node index) leaving one byte for the lock. Please note that the queue node is only needed when waiting for the lock. Once the lock is acquired, the queue node can be released to be used later. Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Daniel J Blueman <daniel@numascale.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paolo Bonzini <paolo.bonzini@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: virtualization@lists.linux-foundation.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1429901803-29771-2-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08locking/rwsem: Reduce spinlock contention in wakeup after up_read()/up_write()Waiman Long
In up_write()/up_read(), rwsem_wake() will be called whenever it detects that some writers/readers are waiting. The rwsem_wake() function will take the wait_lock and call __rwsem_do_wake() to do the real wakeup. For a heavily contended rwsem, doing a spin_lock() on wait_lock will cause further contention on the heavily contended rwsem cacheline resulting in delay in the completion of the up_read/up_write operations. This patch makes the wait_lock taking and the call to __rwsem_do_wake() optional if at least one spinning writer is present. The spinning writer will be able to take the rwsem and call rwsem_wake() later when it calls up_write(). With the presence of a spinning writer, rwsem_wake() will now try to acquire the lock using trylock. If that fails, it will just quit. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Jason Low <jason.low2@hp.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1430428337-16802-2-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-06Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU fix from Ingo Molnar: "An RCU Kconfig fix that eliminates an annoying interactive kconfig question for CONFIG_RCU_TORTURE_TEST_SLOW_INIT" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rcu: Control grace-period delays directly from value
2015-05-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Receive packet length needs to be adjust by 2 on RX to accomodate the two padding bytes in altera_tse driver. From Vlastimil Setka. 2) If rx frame is dropped due to out of memory in macb driver, we leave the receive ring descriptors in an undefined state. From Punnaiah Choudary Kalluri 3) Some netlink subsystems erroneously signal NLM_F_MULTI. That is only for dumps. Fix from Nicolas Dichtel. 4) Fix mis-use of raw rt->rt_pmtu value in ipv4, one must always go via the ipv4_mtu() helper. From Herbert Xu. 5) Fix null deref in bridge netfilter, and miscalculated lengths in jump/goto nf_tables verdicts. From Florian Westphal. 6) Unhash ping sockets properly. 7) Software implementation of BPF divide did 64/32 rather than 64/64 bit divide. The JITs got it right. Fix from Alexei Starovoitov. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (30 commits) ipv4: Missing sk_nulls_node_init() in ping_unhash(). net: fec: Fix RGMII-ID mode net/mlx4_en: Schedule napi when RX buffers allocation fails netxen_nic: use spin_[un]lock_bh around tx_clean_lock net/mlx4_core: Fix unaligned accesses mlx4_en: Use correct loop cursor in error path. cxgb4: Fix MC1 memory offset calculation bnx2x: Delay during kdump load net: Fix Kernel Panic in bonding driver debugfs file: rlb_hash_table net: dsa: Fix scope of eeprom-length property net: macb: Fix race condition in driver when Rx frame is dropped hv_netvsc: Fix a bug in netvsc_start_xmit() altera_tse: Correct rx packet length mlx4: Fix tx ring affinity_mask creation tipc: fix problem with parallel link synchronization mechanism tipc: remove wrong use of NLM_F_MULTI bridge/nl: remove wrong use of NLM_F_MULTI bridge/mdb: remove wrong use of NLM_F_MULTI net: sched: act_connmark: don't zap skb->nfct trivial: net: systemport: bcmsysport.h: fix 0x0x prefix ...
2015-04-30Merge tag 'pm+acpi-4.1-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management and ACPI fixes from Rafael Wysocki: "Three regression fixes this time, one for a recent regression in the cpuidle core affecting multiple systems, one for an inadvertently added duplicate typedef in ACPICA that breaks compilation with GCC 4.5 and one for an ACPI Smart Battery Subsystem driver regression introduced during the 3.18 cycle (stable-candidate). Specifics: - Fix for a regression in the cpuidle core introduced by one of the recent commits in the clockevents_notify() removal series that put a call to a function which had to be executed with disabled interrupts into a code path running with enabled interrupts (Rafael J Wysocki) - Fix for a build problem in ACPICA (with GCC 4.5) introduced by one of the recent ACPICA tools commits that added a duplicate typedef to one of the ACPICA's header files by mistake (Olaf Hering) - Fix for a regression in the ACPI SBS (Smart Battery Subsystem) driver introduced during the 3.18 development cycle causing the smart battery manager to be marked as not present when it should be marked as present (Chris Bainbridge)" * tag 'pm+acpi-4.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpuidle: Run tick_broadcast_exit() with disabled interrupts ACPI / SBS: Enable battery manager when present ACPICA: remove duplicate u8 typedef
2015-04-30Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm changes from Paolo Bonzini: "Remove from guest code the handling of task migration during a pvclock read; instead use the correct protocol in KVM. This removes the need for task migration notifiers in core scheduler code" [ The scheduler people really hated the migration notifiers, so this was kind of required - Linus ] * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: x86: pvclock: Really remove the sched notifier for cross-cpu migrations kvm: x86: fix kvmclock update protocol
2015-04-30modsign: change default key detailsDavid Howells
Change default key details to be more obviously unspecified. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-29cpuidle: Run tick_broadcast_exit() with disabled interruptsRafael J. Wysocki
Commit 335f49196fd6 (sched/idle: Use explicit broadcast oneshot control function) replaced clockevents_notify() invocations in cpuidle_idle_call() with direct calls to tick_broadcast_enter() and tick_broadcast_exit(), but it overlooked the fact that interrupts were already enabled before calling the latter which led to functional breakage on systems using idle states with the CPUIDLE_FLAG_TIMER_STOP flag set. Fix that by moving the invocations of tick_broadcast_enter() and tick_broadcast_exit() down into cpuidle_enter_state() where interrupts are still disabled when tick_broadcast_exit() is called. Also ensure that interrupts will be disabled before running tick_broadcast_exit() even if they have been enabled by the idle state's ->enter callback. Trigger a WARN_ON_ONCE() in that case, as we generally don't want that to happen for states with CPUIDLE_FLAG_TIMER_STOP set. Fixes: 335f49196fd6 (sched/idle: Use explicit broadcast oneshot control function) Reported-and-tested-by: Linus Walleij <linus.walleij@linaro.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Reported-and-tested-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-04-28Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Martin Schwidefsky: "One additional new feature for 4.1, a new PRNG based on SHA-512 for the zcrypt driver. Two memory management related changes, the page table reallocation for KVM is removed, and with file ptes gone the encoding of page table entries is improved. And three bug fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/zcrypt: Introduce new SHA-512 based Pseudo Random Generator. s390/mm: change swap pte encoding and pgtable cleanup s390/mm: correct transfer of dirty & young bits in __pmd_to_pte s390/bpf: add dependency to z196 features s390/3215: free memory in error path s390/kvm: remove delayed reallocation of page tables for KVM kexec: allocate the kexec control page with KEXEC_CONTROL_MEMORY_GFP
2015-04-27bpf: fix 64-bit divideAlexei Starovoitov
ALU64_DIV instruction should be dividing 64-bit by 64-bit, whereas do_div() does 64-bit by 32-bit divide. x64 and arm64 JITs correctly implement 64 by 64 unsigned divide. llvm BPF backend emits code assuming that ALU64_DIV does 64 by 64. Fixes: 89aa075832b0 ("net: sock: allow eBPF programs to be attached to sockets") Reported-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-27x86: pvclock: Really remove the sched notifier for cross-cpu migrationsPaolo Bonzini
This reverts commits 0a4e6be9ca17c54817cf814b4b5aa60478c6df27 and 80f7fdb1c7f0f9266421f823964fd1962681f6ce. The task migration notifier was originally introduced in order to support the pvclock vsyscall with non-synchronized TSC, but KVM only supports it with synchronized TSC. Hence, on KVM the race condition is only needed due to a bad implementation on the host side, and even then it's so rare that it's mostly theoretical. As far as KVM is concerned it's possible to fix the host, avoiding the additional complexity in the vDSO and the (re)introduction of the task migration notifier. Xen, on the other hand, hasn't yet implemented vsyscall support at all, so we do not care about its plans for non-synchronized TSC. Reported-by: Peter Zijlstra <peterz@infradead.org> Suggested-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-26Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull fourth vfs update from Al Viro: "d_inode() annotations from David Howells (sat in for-next since before the beginning of merge window) + four assorted fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: RCU pathwalk breakage when running into a symlink overmounting something fix I_DIO_WAKEUP definition direct-io: only inc/dec inode->i_dio_count for file systems fs/9p: fix readdir() VFS: assorted d_backing_inode() annotations VFS: fs/inode.c helpers: d_inode() annotations VFS: fs/cachefiles: d_backing_inode() annotations VFS: fs library helpers: d_inode() annotations VFS: assorted weird filesystems: d_inode() annotations VFS: normal filesystems (and lustre): d_inode() annotations VFS: security/: d_inode() annotations VFS: security/: d_backing_inode() annotations VFS: net/: d_inode() annotations VFS: net/unix: d_backing_inode() annotations VFS: kernel/: d_inode() annotations VFS: audit: d_backing_inode() annotations VFS: Fix up some ->d_inode accesses in the chelsio driver VFS: Cachefiles should perform fs modifications on the top layer only VFS: AF_UNIX sockets should call mknod on the top layer only
2015-04-23kexec: allocate the kexec control page with KEXEC_CONTROL_MEMORY_GFPMartin Schwidefsky
Introduce KEXEC_CONTROL_MEMORY_GFP to allow the architecture code to override the gfp flags of the allocation for the kexec control page. The loop in kimage_alloc_normal_control_pages allocates pages with GFP_KERNEL until a page is found that happens to have an address smaller than the KEXEC_CONTROL_MEMORY_LIMIT. On systems with a large memory size but a small KEXEC_CONTROL_MEMORY_LIMIT the loop will keep allocating memory until the oom killer steps in. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2015-04-22Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/auditLinus Torvalds
Pull audit fixes from Paul Moore: "Seven audit patches for v4.1, all bug fixes. The largest, and perhaps most significant commit helps resolve some memory pressure issues related to the inode cache and audit, there are also a few small commits which help resolve some timing issues with the audit log queue, and the rest fall into the always popular "code clean-up" category. In general, nothing really substantial, just a nice set of maintenance patches" * 'upstream' of git://git.infradead.org/users/pcmoore/audit: audit: Remove condition which always evaluates to false audit: reduce mmap_sem hold for mm->exe_file audit: consolidate handling of mm->exe_file audit: code clean up audit: don't reset working wait time accidentally with auditd audit: don't lose set wait time on first successful call to audit_log_start() audit: move the tree pruning to a dedicated thread
2015-04-22Merge tag 'trace-v4.1-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "This adds three fixes for the tracing code. The first is a bug when ftrace_dump_on_oops is triggered in atomic context and function graph tracer is the tracer that is being reported. The second fix is bad parsing of the trace_events from the kernel command line, where it would ignore specific events if the system name is used with defining the event(it enables all events within the system). The last one is a fix to the TRACE_DEFINE_ENUM(), where a check was missing to see if the ptr was incremented to the end of the string, but the loop increments it again and can miss the nul delimiter to stop processing" * tag 'trace-v4.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Fix possible out of bounds memory access when parsing enums tracing: Fix incorrect enabling of trace events by boot cmdline tracing: Handle ftrace_dump() atomic context in graph_trace_open()
2015-04-22Merge tag 'modules-next-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull module updates from Rusty Russell: "Quentin opened a can of worms by adding extable entry checking to modpost, but most architectures seem fixed now. Thanks to all involved. Last minute rebase because I noticed a "[PATCH]" had snuck into a commit message somehow" * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: modpost: don't emit section mismatch warnings for compiler optimizations modpost: expand pattern matching to support substring matches modpost: do not try to match the SHT_NUL section. modpost: fix extable entry size calculation. modpost: fix inverted logic in is_extable_fault_address(). modpost: handle -ffunction-sections modpost: Whitelist .text.fixup and .exception.text params: handle quotes properly for values not of form foo="bar". modpost: document the use of struct section_check. modpost: handle relocations mismatch in __ex_table. scripts: add check_extable.sh script. modpost: mismatch_handler: retrieve tosym information only when needed. modpost: factorize symbol pretty print in get_pretty_name(). modpost: add handler function pointer to sectioncheck. modpost: add .sched.text and .kprobes.text to the TEXT_SECTIONS list. modpost: add strict white-listing when referencing sections. module: do not print allocation-fail warning on bogus user buffer size kernel/module.c: fix typos in message about unused symbols
2015-04-21Merge tag 'char-misc-4.1-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver updates from Greg KH: "Here's the big char/misc driver patchset for 4.1-rc1. Lots of different driver subsystem updates here, nothing major, full details are in the shortlog. All of this has been in linux-next for a while" * tag 'char-misc-4.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (133 commits) mei: trace: remove unused TRACE_SYSTEM_STRING DTS: ARM: OMAP3-N900: Add lis3lv02d support Documentation: DT: lis302: update wakeup binding lis3lv02d: DT: add wakeup unit 2 and wakeup threshold lis3lv02d: DT: use s32 to support negative values Drivers: hv: hv_balloon: correctly handle num_pages>INT_MAX case Drivers: hv: hv_balloon: correctly handle val.freeram<num_pages case mei: replace check for connection instead of transitioning mei: use mei_cl_is_connected consistently mei: fix mei_poll operation hv_vmbus: Add gradually increased delay for retries in vmbus_post_msg() Drivers: hv: hv_balloon: survive ballooning request with num_pages=0 Drivers: hv: hv_balloon: eliminate jumps in piecewiese linear floor function Drivers: hv: hv_balloon: do not online pages in offline blocks hv: remove the per-channel workqueue hv: don't schedule new works in vmbus_onoffer()/vmbus_onoffer_rescind() hv: run non-blocking message handlers in the dispatch tasklet coresight: moving to new "hwtracing" directory coresight-tmc: Adding a status interface to sysfs coresight: remove the unnecessary configuration coresight-default-sink ...
2015-04-21Merge tag 'tty-4.1-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty/serial updates from Greg KH: "Here's the big tty/serial driver update for 4.1-rc1. It was delayed for a bit due to some questions surrounding some of the console command line parsing changes that are in here. There's still one tiny regression for people who were previously putting multiple console command lines and expecting them all to be ignored for some odd reason, but Peter is working on fixing that. If not, I'll send a revert for the offending patch, but I have faith that Peter can address it. Other than the console work here, there's the usual serial driver updates and changes, and a buch of 8250 reworks to try to make that driver easier to maintain over time, and have it support more devices in the future. All of these have been in linux-next for a while" * tag 'tty-4.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (119 commits) n_gsm: Drop unneeded cast on netdev_priv sc16is7xx: expose RTS inversion in RS-485 mode serial: 8250_pci: port failed after wakeup from S3 earlycon: 8250: Document kernel command line options earlycon: 8250: Fix command line regression earlycon: Fix __earlycon_table stride tty: clean up the tty time logic a bit serial: 8250_dw: only get the clock rate in one place serial: 8250_dw: remove useless ACPI ID check dmaengine: hsu: move memory allocation to GFP_NOWAIT dmaengine: hsu: remove redundant pieces of code serial: 8250_pci: add Intel Tangier support dmaengine: hsu: add Intel Tangier PCI ID serial: 8250_pci: replace switch-case by formula for Intel MID serial: 8250_pci: replace switch-case by formula tty: cpm_uart: replace CONFIG_8xx by CONFIG_CPM1 serial: jsm: some off by one bugs serial: xuartps: Fix check in console_setup(). serial: xuartps: Get rid of register access macros. serial: xuartps: Fix iobase use. ...
2015-04-19smp: Fix error case handling in smp_call_function_*()Linus Torvalds
Commit 8053871d0f7f ("smp: Fix smp_call_function_single_async() locking") fixed the locking for the asynchronous smp-call case, but in the process of moving the lock handling around, one of the error cases ended up not unlocking the call data at all. This went unnoticed on x86, because this is a "caller is buggy" case, where the caller is trying to call a non-existent CPU. But apparently ARM does that (at least under qemu-arm). Bindly doing cross-cpu calls to random CPU's that aren't even online seems a bit fishy, but the error handling was clearly not correct. Simply add the missing "csd_unlock()" to the error path. Reported-and-tested-by: Guenter Roeck <linux@roeck-us.net> Analyzed-by: Rabin Vincent <rabin@rab.in> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-18Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Ingo Molnar: "Two fixes: an smp-call fix and a lockdep fix" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: smp: Fix smp_call_function_single_async() locking lockdep: Make print_lock() robust against concurrent release
2015-04-18Merge branch 'for-mingo' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/urgent Pull RCU fix from Paul E. McKenney: "This series contains a single change that fixes Kconfig asking pointless questions." Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix verifier memory corruption and other bugs in BPF layer, from Alexei Starovoitov. 2) Add a conservative fix for doing BPF properly in the BPF classifier of the packet scheduler on ingress. Also from Alexei. 3) The SKB scrubber should not clear out the packet MARK and security label, from Herbert Xu. 4) Fix oops on rmmod in stmmac driver, from Bryan O'Donoghue. 5) Pause handling is not correct in the stmmac driver because it doesn't take into consideration the RX and TX fifo sizes. From Vince Bridgers. 6) Failure path missing unlock in FOU driver, from Wang Cong. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits) net: dsa: use DEVICE_ATTR_RW to declare temp1_max netns: remove BUG_ONs from net_generic() IB/ipoib: Fix ndo_get_iflink sfc: Fix memcpy() with const destination compiler warning. altera tse: Fix network-delays and -retransmissions after high throughput. net: remove unused 'dev' argument from netif_needs_gso() act_mirred: Fix bogus header when redirecting from VLAN inet_diag: fix access to tcp cc information tcp: tcp_get_info() should fetch socket fields once net: dsa: mv88e6xxx: Add missing initialization in mv88e6xxx_set_port_state() skbuff: Do not scrub skb mark within the same name space Revert "net: Reset secmark when scrubbing packet" bpf: fix two bugs in verification logic when accessing 'ctx' pointer bpf: fix bpf helpers to use skb->mac_header relative offsets stmmac: Configure Flow Control to work correctly based on rxfifo size stmmac: Enable unicast pause frame detect in GMAC Register 6 stmmac: Read tx-fifo-depth and rx-fifo-depth from the devicetree stmmac: Add defines and documentation for enabling flow control stmmac: Add properties for transmit and receive fifo sizes stmmac: fix oops on rmmod after assigning ip addr ...
2015-04-17tracing: Fix possible out of bounds memory access when parsing enumsSteven Rostedt (Red Hat)
The code that replaces the enum names with the enum values in the tracepoints' format files could possible miss the end of string nul character. This was caused by processing things like backslashes, quotes and other tokens. After processing the tokens, a check for the nul character needed to be done before continuing the loop, because the loop incremented the pointer before doing the check, which could bypass the nul character. Link: http://lkml.kernel.org/r/552E661D.5060502@oracle.com Reported-by: Sasha Levin <sasha.levin@oracle.com> # via KASan Tested-by: Andrey Ryabinin <a.ryabinin@samsung.com> Fixes: 0c564a538aa9 "tracing: Add TRACE_DEFINE_ENUM() macro to map enums to their values" Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-04-17oprofile: reduce mmap_sem hold for mm->exe_fileDavidlohr Bueso
sync_buffer() needs the mmap_sem for two distinct operations, both only occurring upon user context switch handling: 1) Dealing with the exe_file. 2) Adding the dcookie data as we need to lookup the vma that backs it. This is done via add_sample() and add_data(). This patch isolates 1), for it will no longer need the mmap_sem for serialization. However, for now, make of the more standard get_mm_exe_file(), requiring only holding the mmap_sem to read the value, and relying on reference counting to make sure that the exe file won't dissappear underneath us while doing the get dcookie. As a consequence, for 2) we move the mmap_sem locking into where we really need it, in lookup_dcookie(). The benefits are twofold: reduce mmap_sem hold times, and cleaner code. [akpm@linux-foundation.org: export get_mm_exe_file for arch/x86/oprofile/oprofile.ko] Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Robert Richter <rric@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17gcov: fix softlockupsAndrey Ryabinin
gcov profiling if enabled with other heavy compile-time instrumentation like KASan could trigger following softlockups: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1] Modules linked in: irq event stamp: 22823276 hardirqs last enabled at (22823275): [<ffffffff86e8d10d>] mutex_lock_nested+0x7d9/0x930 hardirqs last disabled at (22823276): [<ffffffff86e9521d>] apic_timer_interrupt+0x6d/0x80 softirqs last enabled at (22823172): [<ffffffff811ed969>] __do_softirq+0x4db/0x729 softirqs last disabled at (22823167): [<ffffffff811edfcf>] irq_exit+0x7d/0x15b CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W 3.19.0-05245-gbb33326-dirty #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014 task: ffff88006cba8000 ti: ffff88006cbb0000 task.ti: ffff88006cbb0000 RIP: kasan_mem_to_shadow+0x1e/0x1f Call Trace: strcmp+0x28/0x70 get_node_by_name+0x66/0x99 gcov_event+0x4f/0x69e gcov_enable_events+0x54/0x7b gcov_fs_init+0xf8/0x134 do_one_initcall+0x1b2/0x288 kernel_init_freeable+0x467/0x580 kernel_init+0x15/0x18b ret_from_fork+0x7c/0xb0 Kernel panic - not syncing: softlockup: hung tasks Fix this by sticking cond_resched() in gcov_enable_events(). Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Cc: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17kernel/sysctl.c: detect overflows when converting to intHeinrich Schuchardt
When converting unsigned long to int overflows may occur. These currently are not detected when writing to the sysctl file system. E.g. on a system where int has 32 bits and long has 64 bits echo 0x800001234 > /proc/sys/kernel/threads-max has the same effect as echo 0x1234 > /proc/sys/kernel/threads-max The patch adds the missing check in do_proc_dointvec_conv. With the patch an overflow will result in an error EINVAL when writing to the the sysctl file system. Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17prctl: avoid using mmap_sem for exe_file serializationDavidlohr Bueso
Oleg cleverly suggested using xchg() to set the new mm->exe_file instead of calling set_mm_exe_file() which requires some form of serialization -- mmap_sem in this case. For archs that do not have atomic rmw instructions we still fallback to a spinlock alternative, so this should always be safe. As such, we only need the mmap_sem for looking up the backing vm_file, which can be done sharing the lock. Naturally, this means we need to manually deal with both the new and old file reference counting, and we need not worry about the MMF_EXE_FILE_CHANGED bits, which can probably be deleted in the future anyway. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Suggested-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17mm: rcu-protected get_mm_exe_file()Konstantin Khlebnikov
This patch removes mm->mmap_sem from mm->exe_file read side. Also it kills dup_mm_exe_file() and moves exe_file duplication into dup_mmap() where both mmap_sems are locked. [akpm@linux-foundation.org: fix comment typo] Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17kernel/sysctl.c: threads-max observe limitsHeinrich Schuchardt
Users can change the maximum number of threads by writing to /proc/sys/kernel/threads-max. With the patch the value entered is checked against the same limits that apply when fork_init is called. Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17kernel/fork.c: avoid division by zeroHeinrich Schuchardt
PAGE_SIZE is not guaranteed to be equal to or less than 8 times the THREAD_SIZE. E.g. architecture hexagon may have page size 1M and thread size 4096. This would lead to a division by zero in the calculation of max_threads. With 32-bit calculation there is no solution which delivers valid results for all possible combinations of the parameters. The code is only called once. Hence a 64-bit calculation can be used as solution. [akpm@linux-foundation.org: use clamp_t(), per Oleg] Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17kernel/fork.c: new function for max_threadsHeinrich Schuchardt
PAGE_SIZE is not guaranteed to be equal to or less than 8 times the THREAD_SIZE. E.g. architecture hexagon may have page size 1M and thread size 4096. This would lead to a division by zero in the calculation of max_threads. With this patch the buggy code is moved to a separate function set_max_threads. The error is not fixed. After fixing the problem in a separate patch the new function can be reused to adjust max_threads after adding or removing memory. Argument mempages of function fork_init() is removed as totalram_pages is an exported symbol. The creation of separate patches for refactoring to a new function and for fixing the logic was suggested by Ingo Molnar. Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17fork_init: update max_threads commentJean Delvare
The comment explaining what value max_threads is set to is outdated. The maximum memory consumption ratio for thread structures was 1/2 until February 2002, then it was briefly changed to 1/16 before being set to 1/8 which we still use today. The comment was never updated to reflect that change, it's about time. Signed-off-by: Jean Delvare <jdelvare@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17fork: report pid reservation failure properlyMichal Hocko
copy_process will report any failure in alloc_pid as ENOMEM currently which is misleading because the pid allocation might fail not only when the memory is short but also when the pid space is consumed already. The current man page even mentions this case: : EAGAIN : : A system-imposed limit on the number of threads was encountered. : There are a number of limits that may trigger this error: the : RLIMIT_NPROC soft resource limit (set via setrlimit(2)), which : limits the number of processes and threads for a real user ID, was : reached; the kernel's system-wide limit on the number of processes : and threads, /proc/sys/kernel/threads-max, was reached (see : proc(5)); or the maximum number of PIDs, /proc/sys/kernel/pid_max, : was reached (see proc(5)). so the current behavior is also incorrect wrt. documentation. POSIX man page also suggest returing EAGAIN when the process count limit is reached. This patch simply propagates error code from alloc_pid and makes sure we return -EAGAIN due to reservation failure. This will make behavior of fork closer to both our documentation and POSIX. alloc_pid might alsoo fail when the reaper in the pid namespace is dead (the namespace basically disallows all new processes) and there is no good error code which would match documented ones. We have traditionally returned ENOMEM for this case which is misleading as well but as per Eric W. Biederman this behavior is documented in man pid_namespaces(7) : If the "init" process of a PID namespace terminates, the kernel : terminates all of the processes in the namespace via a SIGKILL signal. : This behavior reflects the fact that the "init" process is essential for : the correct operation of a PID namespace. In this case, a subsequent : fork(2) into this PID namespace will fail with the error ENOMEM; it is : not possible to create a new processes in a PID namespace whose "init" : process has terminated. and introducing a new error code would be too risky so let's stick to ENOMEM for this case. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17signal: remove warning about using SI_TKILL in rt_[tg]sigqueueinfoVladimir Davydov
Sending SI_TKILL from rt_[tg]sigqueueinfo was deprecated, so now we issue a warning on the first attempt of doing it. We use WARN_ON_ONCE, which is not informative and, what is worse, taints the kernel, making the trinity syscall fuzzer complain false-positively from time to time. It does not look like we need this warning at all, because the behaviour changed quite a long time ago (2.6.39), and if an application relies on the old API, it gets EPERM anyway and can issue a warning by itself. So let us zap the warning in kernel. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Richard Weinberger <richard@nod.at> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17ptrace: ptrace_detach() can no longer race with SIGKILLOleg Nesterov
ptrace_detach() re-checks ->ptrace under tasklist lock and calls release_task() if __ptrace_detach() returns true. This was needed because the __TASK_TRACED tracee could be killed/untraced, and it could even pass exit_notify() before we take tasklist_lock. But this is no longer possible after 9899d11f6544 "ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL". We can turn these checks into WARN_ON() and remove release_task(). While at it, document the setting of child->exit_code. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Pavel Labath <labath@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17ptrace: fix race between ptrace_resume() and wait_task_stopped()Oleg Nesterov
ptrace_resume() is called when the tracee is still __TASK_TRACED. We set tracee->exit_code and then wake_up_state() changes tracee->state. If the tracer's sub-thread does wait() in between, task_stopped_code(ptrace => T) wrongly looks like another report from tracee. This confuses debugger, and since wait_task_stopped() clears ->exit_code the tracee can miss a signal. Test-case: #include <stdio.h> #include <unistd.h> #include <sys/wait.h> #include <sys/ptrace.h> #include <pthread.h> #include <assert.h> int pid; void *waiter(void *arg) { int stat; for (;;) { assert(pid == wait(&stat)); assert(WIFSTOPPED(stat)); if (WSTOPSIG(stat) == SIGHUP) continue; assert(WSTOPSIG(stat) == SIGCONT); printf("ERR! extra/wrong report:%x\n", stat); } } int main(void) { pthread_t thread; pid = fork(); if (!pid) { assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0); for (;;) kill(getpid(), SIGHUP); } assert(pthread_create(&thread, NULL, waiter, NULL) == 0); for (;;) ptrace(PTRACE_CONT, pid, 0, SIGCONT); return 0; } Note for stable: the bug is very old, but without 9899d11f6544 "ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL" the fix should use lock_task_sighand(child). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Pavel Labath <labath@google.com> Tested-by: Pavel Labath <labath@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17smp: Fix smp_call_function_single_async() lockingLinus Torvalds
The current smp_function_call code suffers a number of problems, most notably smp_call_function_single_async() is broken. The problem is that flush_smp_call_function_queue() does csd_unlock() _after_ calling csd->func(). This means that a caller cannot properly synchronize the csd usage as it has to. Change the code to release the csd before calling ->func() for the async case, and put a WARN_ON_ONCE(csd->flags & CSD_FLAG_LOCK) in smp_call_function_single_async() to warn us of improper serialization, because any waiting there can results in deadlocks when called with IRQs disabled. Rename the (currently) unused WAIT flag to SYNCHRONOUS and (re)use it such that we know what to do in flush_smp_call_function_queue(). Rework csd_{,un}lock() to use smp_load_acquire() / smp_store_release() to avoid some full barriers while more clearly providing lock semantics. Finally move the csd maintenance out of generic_exec_single() into its callers for clearer code. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [ Added changelog. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Rafael David Tinoco <inaddy@ubuntu.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/CA+55aFz492bzLFhdbKN-Hygjcreup7CjMEYk3nTSfRWjppz-OA@mail.gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-17lockdep: Make print_lock() robust against concurrent releasePeter Zijlstra
During sysrq's show-held-locks command it is possible that hlock_class() returns NULL for a given lock. The result is then (after the warning): |BUG: unable to handle kernel NULL pointer dereference at 0000001c |IP: [<c1088145>] get_usage_chars+0x5/0x100 |Call Trace: | [<c1088263>] print_lock_name+0x23/0x60 | [<c1576b57>] print_lock+0x5d/0x7e | [<c1088314>] lockdep_print_held_locks+0x74/0xe0 | [<c1088652>] debug_show_all_locks+0x132/0x1b0 | [<c1315c48>] sysrq_handle_showlocks+0x8/0x10 This *might* happen because the thread on the other CPU drops the lock after we are looking ->lockdep_depth and ->held_locks points no longer to a lock that is held. The fix here is to simply ignore it and continue. Reported-by: Andreas Messerschmid <andreas@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>