summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2013-12-15Merge tag 'pci-v3.13-fixes-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI updates from Bjorn Helgaas: "PCI device hotplug - Move device_del() from pci_stop_dev() to pci_destroy_dev() (Rafael Wysocki) Host bridge drivers - Update maintainers for DesignWare, i.MX6, Armada, R-Car (Bjorn Helgaas) - mvebu: Return 'unsupported' for Interrupt Line and Interrupt Pin (Jason Gunthorpe) Miscellaneous - Avoid unnecessary CPU switch when calling .probe() (Alexander Duyck) - Revert "workqueue: allow work_on_cpu() to be called recursively" (Bjorn Helgaas) - Disable Bus Master only on kexec reboot (Khalid Aziz) - Omit PCI ID macro strings to shorten quirk names for LTO (Michal Marek)" * tag 'pci-v3.13-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: MAINTAINERS: Add DesignWare, i.MX6, Armada, R-Car PCI host maintainers PCI: Disable Bus Master only on kexec reboot PCI: mvebu: Return 'unsupported' for Interrupt Line and Interrupt Pin PCI: Omit PCI ID macro strings to shorten quirk names PCI: Move device_del() from pci_stop_dev() to pci_destroy_dev() Revert "workqueue: allow work_on_cpu() to be called recursively" PCI: Avoid unnecessary CPU switch when calling driver .probe() method
2013-12-13KEYS: fix uninitialized persistent_keyring_register_semXiao Guangrong
We run into this bug: [ 2736.063245] Unable to handle kernel paging request for data at address 0x00000000 [ 2736.063293] Faulting instruction address: 0xc00000000037efb0 [ 2736.063300] Oops: Kernel access of bad area, sig: 11 [#1] [ 2736.063303] SMP NR_CPUS=2048 NUMA pSeries [ 2736.063310] Modules linked in: sg nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6table_security ip6table_raw ip6t_REJECT iptable_nat nf_nat_ipv4 iptable_mangle iptable_security iptable_raw ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ebtable_filter ebtables ip6table_filter iptable_filter ip_tables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 nf_nat nf_conntrack ip6_tables ibmveth pseries_rng nx_crypto nfsd auth_rpcgss nfs_acl lockd sunrpc binfmt_misc xfs libcrc32c dm_service_time sd_mod crc_t10dif crct10dif_common ibmvfc scsi_transport_fc scsi_tgt dm_mirror dm_region_hash dm_log dm_multipath dm_mod [ 2736.063383] CPU: 1 PID: 7128 Comm: ssh Not tainted 3.10.0-48.el7.ppc64 #1 [ 2736.063389] task: c000000131930120 ti: c0000001319a0000 task.ti: c0000001319a0000 [ 2736.063394] NIP: c00000000037efb0 LR: c0000000006c40f8 CTR: 0000000000000000 [ 2736.063399] REGS: c0000001319a3870 TRAP: 0300 Not tainted (3.10.0-48.el7.ppc64) [ 2736.063403] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI> CR: 28824242 XER: 20000000 [ 2736.063415] SOFTE: 0 [ 2736.063418] CFAR: c00000000000908c [ 2736.063421] DAR: 0000000000000000, DSISR: 40000000 [ 2736.063425] GPR00: c0000000006c40f8 c0000001319a3af0 c000000001074788 c0000001319a3bf0 GPR04: 0000000000000000 0000000000000000 0000000000000020 000000000000000a GPR08: fffffffe00000002 00000000ffff0000 0000000080000001 c000000000924888 GPR12: 0000000028824248 c000000007e00400 00001fffffa0f998 0000000000000000 GPR16: 0000000000000022 00001fffffa0f998 0000010022e92470 0000000000000000 GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR24: 0000000000000000 c000000000f4a828 00003ffffe527108 0000000000000000 GPR28: c000000000f4a730 c000000000f4a828 0000000000000000 c0000001319a3bf0 [ 2736.063498] NIP [c00000000037efb0] .__list_add+0x30/0x110 [ 2736.063504] LR [c0000000006c40f8] .rwsem_down_write_failed+0x78/0x264 [ 2736.063508] PACATMSCRATCH [800000000280f032] [ 2736.063511] Call Trace: [ 2736.063516] [c0000001319a3af0] [c0000001319a3b80] 0xc0000001319a3b80 (unreliable) [ 2736.063523] [c0000001319a3b80] [c0000000006c40f8] .rwsem_down_write_failed+0x78/0x264 [ 2736.063530] [c0000001319a3c50] [c0000000006c1bb0] .down_write+0x70/0x78 [ 2736.063536] [c0000001319a3cd0] [c0000000002e5ffc] .keyctl_get_persistent+0x20c/0x320 [ 2736.063542] [c0000001319a3dc0] [c0000000002e2388] .SyS_keyctl+0x238/0x260 [ 2736.063548] [c0000001319a3e30] [c000000000009e7c] syscall_exit+0x0/0x7c [ 2736.063553] Instruction dump: [ 2736.063556] 7c0802a6 fba1ffe8 fbc1fff0 fbe1fff8 7cbd2b78 7c9e2378 7c7f1b78 f8010010 [ 2736.063566] f821ff71 e8a50008 7fa52040 40de00c0 <e8be0000> 7fbd2840 40de0094 7fbff040 [ 2736.063579] ---[ end trace 2708241785538296 ]--- It's caused by uninitialized persistent_keyring_register_sem. The bug was introduced by commit f36f8c75, two typos are in that commit: CONFIG_KEYS_KERBEROS_CACHE should be CONFIG_PERSISTENT_KEYRINGS and krb_cache_register_sem should be persistent_keyring_register_sem. Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: David Howells <dhowells@redhat.com>
2013-12-13KEYS: Remove files generated when SYSTEM_TRUSTED_KEYRING=yKirill Tkhai
Always remove generated SYSTEM_TRUSTED_KEYRING files while doing make mrproper. Signed-off-by: Kirill Tkhai <tkhai@yandex.ru> Signed-off-by: David Howells <dhowells@redhat.com>
2013-12-13X.509: Fix certificate gatheringDavid Howells
Fix the gathering of certificates from both the source tree and the build tree to correctly calculate the pathnames of all the certificates. The problem was that if the default generated cert, signing_key.x509, didn't exist then it would not have a path attached and if it did, it would have a path attached. This means that the contents of kernel/.x509.list would change between the first compilation in a directory and the second. After the second it would remain stable because the signing_key.x509 file exists. The consequence was that the kernel would get relinked unconditionally on the second recompilation. The second recompilation would also show something like this: X.509 certificate list changed CERTS kernel/x509_certificate_list - Including cert /home/torvalds/v2.6/linux/signing_key.x509 AS kernel/system_certificates.o LD kernel/built-in.o which is why the relink would happen. Unfortunately, it isn't a simple matter of just sticking a path on the front of the filename of the certificate in the build directory as make can't then work out how to build it. So the path has to be prepended to the name for sorting and duplicate elimination and then removed for the make rule if it is in the build tree. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: David Howells <dhowells@redhat.com>
2013-12-12rcu: Remove "extern" from function declarations in kernel/rcu/rcu.hTeodora Baluta
Function prototypes don't need to have the "extern" keyword since this is the default behavior. Its explicit use is redundant. This commit therefore removes them. Signed-off-by: Teodora Baluta <teobaluta@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-12-12rcu/torture: Dynamically allocate SRCU output buffer to avoid overflowChen Gang
If the rcutorture SRCU output exceeds 4096 bytes, for example, if you have more than about 75 CPUs, it will overflow the current statically allocated buffer. This commit therefore replaces this static buffer with a dynamically buffer whose size is based on the number of CPUs. Benefits: - Avoids both buffer overflow and output truncation. - Handles an arbitrarily large number of CPUs. - Straightforward implementation. Shortcomings: - Some memory is wasted: 1 cpu now comsumes 50 - 60 bytes, and this patch provides 200 bytes. Therefore, for 1K CPUs, roughly 100KB of memory will be wasted. However, the memory is freed immediately after printing, so this wastage should not be a problem in practice. Testing (Fedora16 2 CPUs, 2GB RAM x86_64): - as module, with/without "torture_type=srcu". - build-in not boot runnable, with/without "torture_type=srcu". - build-in let boot runnable, with/without "torture_type=srcu". Signed-off-by: Chen Gang <gang.chen@asianux.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-12-12rcu: Don't activate RCU core on NO_HZ_FULL CPUsPaul E. McKenney
Whenever a CPU receives a scheduling-clock interrupt, RCU checks to see if the RCU core needs anything from this CPU. If so, RCU raises RCU_SOFTIRQ to carry out any needed processing. This approach has worked well historically, but it is undesirable on NO_HZ_FULL CPUs. Such CPUs are expected to spend almost all of their time in userspace, so that scheduling-clock interrupts can be disabled while there is only one runnable task on the CPU in question. Unfortunately, raising any softirq has the potential to wake up ksoftirqd, which would provide the second runnable task on that CPU, preventing disabling of scheduling-clock interrupts. What is needed instead is for RCU to leave NO_HZ_FULL CPUs alone, relying on the grace-period kthreads' quiescent-state forcing to do any needed RCU work on behalf of those CPUs. This commit therefore refrains from raising RCU_SOFTIRQ on any NO_HZ_FULL CPUs during any grace periods that have been in effect for less than one second. The one-second limit handles the case where an inappropriate workload is running on a NO_HZ_FULL CPU that features lots of scheduling-clock interrupts, but no idle or userspace time. Reported-by: Mike Galbraith <bitbucket@online.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Mike Galbraith <bitbucket@online.de> Toasted-by: Frederic Weisbecker <fweisbec@gmail.com>
2013-12-12rcu: Warn on allegedly impossible rcu_read_unlock_special() from irqLai Jiangshan
After commit #10f39bb1b2c1 (rcu: protect __rcu_read_unlock() against scheduler-using irq handlers), it is no longer possible to enter the main body of rcu_read_lock_special() from an NMI, interrupt, or softirq handler. In theory, this implies that the check for "in_irq() || in_serving_softirq()" must always fail, so that in theory this check could be removed entirely. In practice, this commit wraps this condition with a WARN_ON_ONCE(). If this warning never triggers, then the condition will be removed entirely. [ paulmck: And one way of triggering the WARN_ON() is if a scheduling clock interrupt occurs in an RCU read-side critical section, setting RCU_READ_UNLOCK_NEED_QS, which is handled by rcu_read_unlock_special(). Updated this commit to return if only that bit was set. ] Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-12-12Merge tag 'keys-devel-20131210' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull misc keyrings fixes from David Howells: "These break down into five sets: - A patch to error handling in the big_key type for huge payloads. If the payload is larger than the "low limit" and the backing store allocation fails, then big_key_instantiate() doesn't clear the payload pointers in the key, assuming them to have been previously cleared - but only one of them is. Unfortunately, the garbage collector still calls big_key_destroy() when sees one of the pointers with a weird value in it (and not NULL) which it then tries to clean up. - Three patches to fix the keyring type: * A patch to fix the hash function to correctly divide keyrings off from keys in the topology of the tree inside the associative array. This is only a problem if searching through nested keyrings - and only if the hash function incorrectly puts the a keyring outside of the 0 branch of the root node. * A patch to fix keyrings' use of the associative array. The __key_link_begin() function initially passes a NULL key pointer to assoc_array_insert() on the basis that it's holding a place in the tree whilst it does more allocation and stuff. This is only a problem when a node contains 16 keys that match at that level and we want to add an also matching 17th. This should easily be manufactured with a keyring full of keyrings (without chucking any other sort of key into the mix) - except for (a) above which makes it on average adding the 65th keyring. * A patch to fix searching down through nested keyrings, where any keyring in the set has more than 16 keyrings and none of the first keyrings we look through has a match (before the tree iteration needs to step to a more distal node). Test in keyutils test suite: http://git.kernel.org/cgit/linux/kernel/git/dhowells/keyutils.git/commit/?id=8b4ae963ed92523aea18dfbb8cab3f4979e13bd1 - A patch to fix the big_key type's use of a shmem file as its backing store causing audit messages and LSM check failures. This is done by setting S_PRIVATE on the file to avoid LSM checks on the file (access to the shmem file goes through the keyctl() interface and so is gated by the LSM that way). This isn't normally a problem if a key is used by the context that generated it - and it's currently only used by libkrb5. Test in keyutils test suite: http://git.kernel.org/cgit/linux/kernel/git/dhowells/keyutils.git/commit/?id=d9a53cbab42c293962f2f78f7190253fc73bd32e - A patch to add a generated file to .gitignore. - A patch to fix the alignment of the system certificate data such that it it works on s390. As I understand it, on the S390 arch, symbols must be 2-byte aligned because loading the address discards the least-significant bit" * tag 'keys-devel-20131210' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: KEYS: correct alignment of system_certificate_list content in assembly file Ignore generated file kernel/x509_certificate_list security: shmem: implement kernel private shmem inodes KEYS: Fix searching of nested keyrings KEYS: Fix multiple key add into associative array KEYS: Fix the keyring hash function KEYS: Pre-clear struct key on allocation
2013-12-12futex: move user address verification up to common codeLinus Torvalds
When debugging the read-only hugepage case, I was confused by the fact that get_futex_key() did an access_ok() only for the non-shared futex case, since the user address checking really isn't in any way specific to the private key handling. Now, it turns out that the shared key handling does effectively do the equivalent checks inside get_user_pages_fast() (it doesn't actually check the address range on x86, but does check the page protections for being a user page). So it wasn't actually a bug, but the fact that we treat the address differently for private and shared futexes threw me for a loop. Just move the check up, so that it gets done for both cases. Also, use the 'rw' parameter for the type, even if it doesn't actually matter any more (it's a historical artifact of the old racy i386 "page faults from kernel space don't check write protections"). Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-12futex: fix handling of read-only-mapped hugepagesLinus Torvalds
The hugepage code had the exact same bug that regular pages had in commit 7485d0d3758e ("futexes: Remove rw parameter from get_futex_key()"). The regular page case was fixed by commit 9ea71503a8ed ("futex: Fix regression with read only mappings"), but the transparent hugepage case (added in a5b338f2b0b1: "thp: update futex compound knowledge") case remained broken. Found by Dave Jones and his trinity tool. Reported-and-tested-by: Dave Jones <davej@fedoraproject.org> Cc: stable@kernel.org # v2.6.38+ Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Mel Gorman <mgorman@suse.de> Cc: Darren Hart <dvhart@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-11perf: Optimize ring-buffer write by depending on control dependenciesPeter Zijlstra
Remove a full barrier from the ring-buffer write path by relying on a control dependency to order a LOAD -> STORE scenario. Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-8alv40z6ikk57jzbaobnxrjl@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-12-11sched/fair: Rework sched_fair time accountingPeter Zijlstra
Christian suffers from a bad BIOS that wrecks his i5's TSC sync. This results in him occasionally seeing time going backwards - which crashes the scheduler ... Most of our time accounting can actually handle that except the most common one; the tick time update of sched_fair. There is a further problem with that code; previously we assumed that because we get a tick every TICK_NSEC our time delta could never exceed 32bits and math was simpler. However, ever since Frederic managed to get NO_HZ_FULL merged; this is no longer the case since now a task can run for a long time indeed without getting a tick. It only takes about ~4.2 seconds to overflow our u32 in nanoseconds. This means we not only need to better deal with time going backwards; but also means we need to be able to deal with large deltas. This patch reworks the entire code and uses mul_u64_u32_shr() as proposed by Andy a long while ago. We express our virtual time scale factor in a u32 multiplier and shift right and the 32bit mul_u64_u32_shr() implementation reduces to a single 32x32->64 multiply if the time delta is still short (common case). For 64bit a 64x64->128 multiply can be used if ARCH_SUPPORTS_INT128. Reported-and-Tested-by: Christian Engelmayer <cengelma@gmx.at> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: fweisbec@gmail.com Cc: Paul Turner <pjt@google.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20131118172706.GI3866@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-12-11sched: Initialize power_orig for overlapping groupsPeter Zijlstra
Yinghai reported that he saw a /0 in sg_capacity on his EX parts. Make sure to always initialize power_orig now that we actually use it. Ideally build_sched_domains() -> init_sched_groups_power() would also initialize this; but for some yet unexplained reason some setups seem to miss updates there. Reported-by: Yinghai Lu <yinghai@kernel.org> Tested-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-l8ng2m9uml6fhibln8wqpom7@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-12-10KEYS: correct alignment of system_certificate_list content in assembly fileHendrik Brueckner
Apart from data-type specific alignment constraints, there are also architecture-specific alignment requirements. For example, on s390 symbols must be on even addresses implying a 2-byte alignment. If the system_certificate_list_end symbol is on an odd address and if this address is loaded, the least-significant bit is ignored. As a result, the load_system_certificate_list() fails to load the certificates because of a wrong certificate length calculation. To be safe, align system_certificate_list on an 8-byte boundary. Also improve the length calculation of the system_certificate_list content. Introduce a system_certificate_list_size (8-byte aligned because of unsigned long) variable that stores the length. Let the linker calculate this size by introducing a start and end label for the certificate content. Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com> Signed-off-by: David Howells <dhowells@redhat.com>
2013-12-10Ignore generated file kernel/x509_certificate_listRusty Russell
$ git status # On branch pending-rebases # Untracked files: # (use "git add <file>..." to include in what will be committed) # # kernel/x509_certificate_list nothing added to commit but untracked files present (use "git add" to track) $ Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: David Howells <dhowells@redhat.com>
2013-12-09rcu: Provide better diagnostics for blocking in RCU callback functionsPaul E. McKenney
Currently blocking in an RCU callback function will result in "scheduling while atomic", which could be triggered for any number of reasons. To aid debugging, this patch introduces a rcu_callback_map that is used to tie the inappropriate voluntary context switch back to the fact that the function is being invoked from within a callback. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-12-09rcu: Improve SRCU's grace-period commentsPaul E. McKenney
This commit documents the memory-barrier guarantees provided by synchronize_srcu() and call_srcu(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-12-09rcu: Fix CONFIG_RCU_FANOUT_EXACT for odd fanout/leaf valuesPaul E. McKenney
Each element of the rcu_state structure's ->levelspread[] array is intended to contain the per-level fanout, where the zero-th element corresponds to the root of the rcu_node tree, and the last element corresponds to the leaves. In the CONFIG_RCU_FANOUT_EXACT case, this means that the last element should be filled in from CONFIG_RCU_FANOUT_LEAF (or from the rcu_fanout_leaf boot parameter, if provided) and that the remaining elements should be filled in from CONFIG_RCU_FANOUT. Unfortunately, the current code in rcu_init_levelspread() takes the opposite approach, placing CONFIG_RCU_FANOUT_LEAF in the zero-th element and CONFIG_RCU_FANOUT in the remaining elements. For typical power-of-two values, this generates odd but functional rcu_node trees. However, other values, for example CONFIG_RCU_FANOUT=3 and CONFIG_RCU_FANOUT_LEAF=2, generate trees that can leave some CPUs out of the grace-period computation, resulting in too-short grace periods and therefore a broken RCU implementation. This commit therefore fixes rcu_init_levelspread() to set the last ->levelspread[] array element from CONFIG_RCU_FANOUT_LEAF and the remaining elements from CONFIG_RCU_FANOUT, thus generating the intended rcu_node trees. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-12-09rcu: Fix coccinelle warningsFengguang Wu
This commit fixes the following coccinelle warning: kernel/rcu/tree.c:712:9-10: WARNING: return of 0/1 in function 'rcu_lockdep_current_cpu_online' with return type bool Return statements in functions returning bool should use true/false instead of 1/0. Generated by: coccinelle/misc/boolreturn.cocci Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-12-09posix-timers: Convert abuses of BUG_ON to WARN_ONFrederic Weisbecker
The posix cpu timers code makes a heavy use of BUG_ON() but none of these concern fatal issues that require to stop the machine. So let's just warn the user when some internal state slips out of our hands. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andrew Morton <akpm@linux-foundation.org>
2013-12-09posix-timers: Remove remaining uses of tasklist_lockFrederic Weisbecker
The remaining uses of tasklist_lock were mostly about synchronizing against sighand modifications, getting coherent and safe group samples and also thread/process wide timers list handling. All of this is already safely synchronizable with the target's sighand lock. Let's use it on these places instead. Also update the comments about locking. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andrew Morton <akpm@linux-foundation.org>
2013-12-09posix-timers: Use sighand lock instead of tasklist_lock on timer deletionFrederic Weisbecker
Timer deletion doesn't need the tasklist lock. We need to protect against: * concurrent access to the lists p->cputime_expires and p->sighand->cputime_expires * task reaping that may also delete the timer list entry * timer firing We already hold the timer lock which protects us against concurrent timer firing. The rest only need the targets sighand to be locked. So hold it and drop the use of tasklist_lock there. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andrew Morton <akpm@linux-foundation.org>
2013-12-09posix-timers: Use sighand lock instead of tasklist_lock for task clock sampleFrederic Weisbecker
There is no need for the tasklist_lock just to take a process wide clock sample. All we need is to get a coherent sample that doesn't race with exit() and exec(): * exit() may be concurrently reaping a task and flushing its time * sighand is unstable under exit() and exec(), and the latter also result in group leader that can change To protect against these, locking the target's sighand is enough. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andrew Morton <akpm@linux-foundation.org>
2013-12-09posix-timers: Consolidate posix_cpu_clock_get()Frederic Weisbecker
Consolidate the clock sampling common code used for both local and remote targets. Note that this introduces a tiny user ABI change: if a PID is passed to clock_gettime() along the clockid, we used to forbid a process wide clock sample when that PID doesn't belong to a group leader. Now after this patch we allow process wide clock samples if that PID belongs to the current task, even if the current task is not the group leader. But local process wide clock samples are allowed if PID == 0 (current task) even if the current task is not the group leader. So in the end this should be no big deal as this actually harmonize the behaviour when the remote sample is actually a local one. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andrew Morton <akpm@linux-foundation.org>
2013-12-09posix-timers: Remove useless clock sample on timers cleanupFrederic Weisbecker
a0b2062b0904ef07944c4a6e4d0f88ee44f1e9f2 ("posix_timers: fix racy timer delta caching on task exit") forgot to remove the arguments used for timer caching. Fix this leftover. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andrew Morton <akpm@linux-foundation.org>
2013-12-09posix-timers: Remove dead task special caseFrederic Weisbecker
Now that we've removed all the optimizations that could result in NULL timer's targets, we can remove all the associated special case handling. Also add some warnings on NULL targets to spot any possible leftover. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andrew Morton <akpm@linux-foundation.org>
2013-12-09posix-timers: Cleanup reaped target handlingFrederic Weisbecker
When a timer's target is seen to be buried, for example on calls to timer_gettime(), the posix cpu timers code behaves a bit like a garbage collector and releases early the reference to the task. Then again, this optimization complicates the code for no much value: it's up to the user to release the timer and its associated ressources by calling timer_delete() after it buries the target tasks. Remove this to simplify the code. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andrew Morton <akpm@linux-foundation.org>
2013-12-09posix-timers: Remove dead process posix cpu timers cachingFrederic Weisbecker
Now that we removed dead thread posix cpu timers caching, lets remove the dead process wide version. This caching is similar to the per thread version but it should be even more rare: * If the process id dead, we are not reading its timers status from a thread belonging to its group since they are all dead. So this caching only concern remote process timers reads. Now posix cpu timers using itimers or timer_settime() can't do remote process timers anyway so it's not even clear if there is actually a user for this caching. * Unlike per thread timers caching, this only applies to zombies targets. Buried targets' process wide timers return 0 values. But then again, timer_gettime() can't read remote process timers, so if the process is dead, there can't be any reader left anyway. Then again this caching seem to complicate the code for corner cases that are probably not worth it. So lets get rid of it. Also remove the sample snapshot on dying process timer that is now useless, as suggested by Kosaki. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andrew Morton <akpm@linux-foundation.org>
2013-12-09posix-timers: Remove dead thread posix cpu timers cachingFrederic Weisbecker
When a task is exiting or has exited, its posix cpu timers don't tick anymore and won't elapse further. It's too late for them to expire. So any further call to timer_gettime() on these timers will return the same remaining expiry time. The current code optimize this by caching the remaining delta and storing it where we use to save the absolute expiration time. This way, the future calls to timer_gettime() won't need to compute the difference between the absolute expiration time and the current time anymore. Now this optimization doesn't seem to bring much value. Computing the timer remaining delta is not very costly. Fetching the timer value OTOH can be costly in two ways: * CPUCLOCK_SCHED read requires to lock the target's rq. But some optimizations are on the way to make task_sched_runtime() not holding the rq lock of a non-running target. * CPUCLOCK_VIRT/CPUCLOCK_PROF read simply consist in fetching current->utime/current->stime except when the system uses full dynticks cputime accounting. The latter requires a per task lock in order to correctly compute user and system time. But once the target is dead, this lock shouldn't be contended anyway. All in one this caching doesn't seem to be justified. Given that it complicates the code significantly for few wins, let's remove it on single thread timers. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andrew Morton <akpm@linux-foundation.org>
2013-12-07PCI: Disable Bus Master only on kexec rebootKhalid Aziz
Add a flag to tell the PCI subsystem that kernel is shutting down in preparation to kexec a kernel. Add code in PCI subsystem to use this flag to clear Bus Master bit on PCI devices only in case of kexec reboot. This fixes a power-off problem on Acer Aspire V5-573G and likely other machines and avoids any other issues caused by clearing Bus Master bit on PCI devices in normal shutdown path. The problem was introduced by b566a22c2332 ("PCI: disable Bus Master on PCI device shutdown"). This patch is based on discussion at http://marc.info/?l=linux-pci&m=138425645204355&w=2 Link: https://bugzilla.kernel.org/show_bug.cgi?id=63861 Reported-by: Chang Liu <cl91tp@gmail.com> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Konstantin Khlebnikov <koct9i@gmail.com> Cc: stable@vger.kernel.org # v3.5+
2013-12-06cgroup: fix cgroup_create() error handling pathTejun Heo
ae7f164a09 ("cgroup: move cgroup->subsys[] assignment to online_css()") moved cgroup->subsys[] assignements later in cgroup_create() but didn't update error handling path accordingly leading to the following oops and leaking later css's after an online_css() failure. The oops is from cgroup destruction path being invoked on the partially constructed cgroup which is not ready to handle empty slots in cgrp->subsys[] array. BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [<ffffffff810eeaa8>] cgroup_destroy_locked+0x118/0x2f0 PGD a780a067 PUD aadbe067 PMD 0 Oops: 0000 [#1] SMP Modules linked in: CPU: 6 PID: 7360 Comm: mkdir Not tainted 3.13.0-rc2+ #69 Hardware name: task: ffff8800b9dbec00 ti: ffff8800a781a000 task.ti: ffff8800a781a000 RIP: 0010:[<ffffffff810eeaa8>] [<ffffffff810eeaa8>] cgroup_destroy_locked+0x118/0x2f0 RSP: 0018:ffff8800a781bd98 EFLAGS: 00010282 RAX: ffff880586903878 RBX: ffff880586903800 RCX: ffff880586903820 RDX: ffff880586903860 RSI: ffff8800a781bdb0 RDI: ffff880586903820 RBP: ffff8800a781bde8 R08: ffff88060e0b8048 R09: ffffffff811d7bc1 R10: 000000000000008c R11: 0000000000000001 R12: ffff8800a72286c0 R13: 0000000000000000 R14: ffffffff81cf7a40 R15: 0000000000000001 FS: 00007f60ecda57a0(0000) GS:ffff8806272c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000008 CR3: 00000000a7a03000 CR4: 00000000000007e0 Stack: ffff880586903860 ffff880586903910 ffff8800a72286c0 ffff880586903820 ffffffff81cf7a40 ffff880586903800 ffff88060e0b8018 ffffffff81cf7a40 ffff8800b9dbec00 ffff8800b9dbf098 ffff8800a781bec8 ffffffff810ef5bf Call Trace: [<ffffffff810ef5bf>] cgroup_mkdir+0x55f/0x5f0 [<ffffffff811c90ae>] vfs_mkdir+0xee/0x140 [<ffffffff811cb07e>] SyS_mkdirat+0x6e/0xf0 [<ffffffff811c6a19>] SyS_mkdir+0x19/0x20 [<ffffffff8169e569>] system_call_fastpath+0x16/0x1b This patch moves reference bumping inside online_css() loop, clears css_ar[] as css's are brought online successfully, and updates err_destroy path so that either a css is fully online and destroyed by cgroup_destroy_locked() or the error path frees it. This creates a duplicate css free logic in the error path but it will be cleaned up soon. v2: Li pointed out that cgroup_destroy_locked() would do NULL-deref if invoked with a cgroup which doesn't have all css's populated. Update cgroup_destroy_locked() so that it skips NULL css's. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Reported-by: Vladimir Davydov <vdavydov@parallels.com> Cc: stable@vger.kernel.org # v3.12+
2013-12-06Merge tag 'trace-fixes-3.13-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "A regression showed up that there's a large delay when enabling all events. This was prevalent when FTRACE_SELFTEST was enabled which enables all events several times, and caused the system bootup to pause for over a minute. This was tracked down to an addition of a synchronize_sched() performed when system call tracepoints are unregistered. The synchronize_sched() is needed between the unregistering of the system call tracepoint and a deletion of a tracing instance buffer. But placing the synchronize_sched() in the unreg of *every* system call tracepoint is a bit overboard. A single synchronize_sched() before the deletion of the instance is sufficient" * tag 'trace-fixes-3.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Only run synchronize_sched() at instance deletion time
2013-12-05tracing: Only run synchronize_sched() at instance deletion timeSteven Rostedt
It has been reported that boot up with FTRACE_SELFTEST enabled can take a very long time. There can be stalls of over a minute. This was tracked down to the synchronize_sched() called when a system call event is disabled. As the self tests enable and disable thousands of events, this makes the synchronize_sched() get called thousands of times. The synchornize_sched() was added with d562aff93bfb53 "tracing: Add support for SOFT_DISABLE to syscall events" which caused this regression (added in 3.13-rc1). The synchronize_sched() is to protect against the events being accessed when a tracer instance is being deleted. When an instance is being deleted all the events associated to it are unregistered. The synchronize_sched() makes sure that no more users are running when it finishes. Instead of calling synchronize_sched() for all syscall events, we only need to call it once, after the events are unregistered and before the instance is deleted. The event_mutex is held during this action to prevent new users from enabling events. Link: http://lkml.kernel.org/r/20131203124120.427b9661@gandalf.local.home Reported-by: Petr Mladek <pmladek@suse.cz> Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com> Acked-by: Petr Mladek <pmladek@suse.cz> Tested-by: Petr Mladek <pmladek@suse.cz> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-12-05sched/numa: Drop idx field of task_numa_env structWanpeng Li
Drop unused idx field of task_numa_env struct. Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/1386241817-5051-2-git-send-email-liwanp@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-12-04Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: - timekeeping: Cure a subtle drift issue on GENERIC_TIME_VSYSCALL_OLD - nohz: Make CONFIG_NO_HZ=n and nohz=off command line option behave the same way. Fixes a long standing load accounting wreckage. - clocksource/ARM: Kconfig update to avoid ARM=n wreckage - clocksource/ARM: Fixlets for the AT91 and SH clocksource/clockevents - Trivial documentation update and kzalloc conversion from akpms pile * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: nohz: Fix another inconsistency between CONFIG_NO_HZ=n and nohz=off time: Fix 1ns/tick drift w/ GENERIC_TIME_VSYSCALL_OLD clocksource: arm_arch_timer: Hide eventstream Kconfig on non-ARM clocksource: sh_tmu: Add clk_prepare/unprepare support clocksource: sh_tmu: Release clock when sh_tmu_register() fails clocksource: sh_mtu2: Add clk_prepare/unprepare support clocksource: sh_mtu2: Release clock when sh_mtu2_register() fails ARM: at91: rm9200: switch back to clockevents_config_and_register tick: Document tick_do_timer_cpu timer: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...) NOHZ: Check for nohz active instead of nohz enabled
2013-12-04Merge branch 'timers/core-v2' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into timers/core Pull dynticks updates from Frederic Weisbecker: * Fix a bug where posix cpu timers requeued due to interval got ignored on full dynticks CPUs (not a regression though as it only impacts full dynticks and the bug is there since we merged full dynticks). * Optimizations and cleanups on the use of per CPU APIs to improve code readability, performance and debuggability in the nohz subsystem; * Optimize posix cpu timer by sparing stub workqueue queue with full dynticks off case * Rename some functions to extend with *_this_cpu() suffix for clarity * Refine the naming of some context tracking subsystem state accessors * Trivial spelling fix by Paul Gortmaker Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-12-03rcu: Let the world know when RCU adjusts its geometryPaul E. McKenney
Some RCU bugs have been specific to the layout of the rcu_node tree, but RCU will silently adjust the tree at boot time if appropriate. This obscures valuable debugging information, so print a message when this happens. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-12-03rcu: Fix srcu_barrier() docbook headerPaul E. McKenney
The srcu_barrier() docbook header left out the "sp" argument, so this commit adds that argument's docbook text. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-12-03rcu: Allow task-level idle entry/exit nestingPaul E. McKenney
The current task-level idle entry/exit code forces an entry/exit on each call, regardless of the nesting level. This commit therefore properly accounts for nesting. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
2013-12-03rcu: Break call_rcu() deadlock involving scheduler and perfPaul E. McKenney
Dave Jones got the following lockdep splat: > ====================================================== > [ INFO: possible circular locking dependency detected ] > 3.12.0-rc3+ #92 Not tainted > ------------------------------------------------------- > trinity-child2/15191 is trying to acquire lock: > (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50 > > but task is already holding lock: > (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230 > > which lock already depends on the new lock. > > > the existing dependency chain (in reverse order) is: > > -> #3 (&ctx->lock){-.-...}: > [<ffffffff810cc243>] lock_acquire+0x93/0x200 > [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80 > [<ffffffff811500ff>] __perf_event_task_sched_out+0x2df/0x5e0 > [<ffffffff81091b83>] perf_event_task_sched_out+0x93/0xa0 > [<ffffffff81732052>] __schedule+0x1d2/0xa20 > [<ffffffff81732f30>] preempt_schedule_irq+0x50/0xb0 > [<ffffffff817352b6>] retint_kernel+0x26/0x30 > [<ffffffff813eed04>] tty_flip_buffer_push+0x34/0x50 > [<ffffffff813f0504>] pty_write+0x54/0x60 > [<ffffffff813e900d>] n_tty_write+0x32d/0x4e0 > [<ffffffff813e5838>] tty_write+0x158/0x2d0 > [<ffffffff811c4850>] vfs_write+0xc0/0x1f0 > [<ffffffff811c52cc>] SyS_write+0x4c/0xa0 > [<ffffffff8173d4e4>] tracesys+0xdd/0xe2 > > -> #2 (&rq->lock){-.-.-.}: > [<ffffffff810cc243>] lock_acquire+0x93/0x200 > [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80 > [<ffffffff810980b2>] wake_up_new_task+0xc2/0x2e0 > [<ffffffff81054336>] do_fork+0x126/0x460 > [<ffffffff81054696>] kernel_thread+0x26/0x30 > [<ffffffff8171ff93>] rest_init+0x23/0x140 > [<ffffffff81ee1e4b>] start_kernel+0x3f6/0x403 > [<ffffffff81ee1571>] x86_64_start_reservations+0x2a/0x2c > [<ffffffff81ee1664>] x86_64_start_kernel+0xf1/0xf4 > > -> #1 (&p->pi_lock){-.-.-.}: > [<ffffffff810cc243>] lock_acquire+0x93/0x200 > [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90 > [<ffffffff810979d1>] try_to_wake_up+0x31/0x350 > [<ffffffff81097d62>] default_wake_function+0x12/0x20 > [<ffffffff81084af8>] autoremove_wake_function+0x18/0x40 > [<ffffffff8108ea38>] __wake_up_common+0x58/0x90 > [<ffffffff8108ff59>] __wake_up+0x39/0x50 > [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0 > [<ffffffff81111450>] __call_rcu+0x140/0x820 > [<ffffffff81111b8d>] call_rcu+0x1d/0x20 > [<ffffffff81093697>] cpu_attach_domain+0x287/0x360 > [<ffffffff81099d7e>] build_sched_domains+0xe5e/0x10a0 > [<ffffffff81efa7fc>] sched_init_smp+0x3b7/0x47a > [<ffffffff81ee1f4e>] kernel_init_freeable+0xf6/0x202 > [<ffffffff817200be>] kernel_init+0xe/0x190 > [<ffffffff8173d22c>] ret_from_fork+0x7c/0xb0 > > -> #0 (&rdp->nocb_wq){......}: > [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0 > [<ffffffff810cc243>] lock_acquire+0x93/0x200 > [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90 > [<ffffffff8108ff43>] __wake_up+0x23/0x50 > [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0 > [<ffffffff81111450>] __call_rcu+0x140/0x820 > [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30 > [<ffffffff81149abf>] put_ctx+0x4f/0x70 > [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230 > [<ffffffff81056b8d>] do_exit+0x30d/0xcc0 > [<ffffffff8105893c>] do_group_exit+0x4c/0xc0 > [<ffffffff810589c4>] SyS_exit_group+0x14/0x20 > [<ffffffff8173d4e4>] tracesys+0xdd/0xe2 > > other info that might help us debug this: > > Chain exists of: > &rdp->nocb_wq --> &rq->lock --> &ctx->lock > > Possible unsafe locking scenario: > > CPU0 CPU1 > ---- ---- > lock(&ctx->lock); > lock(&rq->lock); > lock(&ctx->lock); > lock(&rdp->nocb_wq); > > *** DEADLOCK *** > > 1 lock held by trinity-child2/15191: > #0: (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230 > > stack backtrace: > CPU: 2 PID: 15191 Comm: trinity-child2 Not tainted 3.12.0-rc3+ #92 > ffffffff82565b70 ffff880070c2dbf8 ffffffff8172a363 ffffffff824edf40 > ffff880070c2dc38 ffffffff81726741 ffff880070c2dc90 ffff88022383b1c0 > ffff88022383aac0 0000000000000000 ffff88022383b188 ffff88022383b1c0 > Call Trace: > [<ffffffff8172a363>] dump_stack+0x4e/0x82 > [<ffffffff81726741>] print_circular_bug+0x200/0x20f > [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0 > [<ffffffff810c6439>] ? get_lock_stats+0x19/0x60 > [<ffffffff8100b2f4>] ? native_sched_clock+0x24/0x80 > [<ffffffff810cc243>] lock_acquire+0x93/0x200 > [<ffffffff8108ff43>] ? __wake_up+0x23/0x50 > [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90 > [<ffffffff8108ff43>] ? __wake_up+0x23/0x50 > [<ffffffff8108ff43>] __wake_up+0x23/0x50 > [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0 > [<ffffffff81111450>] __call_rcu+0x140/0x820 > [<ffffffff8109bc8f>] ? local_clock+0x3f/0x50 > [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30 > [<ffffffff81149abf>] put_ctx+0x4f/0x70 > [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230 > [<ffffffff81056b8d>] do_exit+0x30d/0xcc0 > [<ffffffff810c9af5>] ? trace_hardirqs_on_caller+0x115/0x1e0 > [<ffffffff810c9bcd>] ? trace_hardirqs_on+0xd/0x10 > [<ffffffff8105893c>] do_group_exit+0x4c/0xc0 > [<ffffffff810589c4>] SyS_exit_group+0x14/0x20 > [<ffffffff8173d4e4>] tracesys+0xdd/0xe2 The underlying problem is that perf is invoking call_rcu() with the scheduler locks held, but in NOCB mode, call_rcu() will with high probability invoke the scheduler -- which just might want to use its locks. The reason that call_rcu() needs to invoke the scheduler is to wake up the corresponding rcuo callback-offload kthread, which does the job of starting up a grace period and invoking the callbacks afterwards. One solution (championed on a related problem by Lai Jiangshan) is to simply defer the wakeup to some point where scheduler locks are no longer held. Since we don't want to unnecessarily incur the cost of such deferral, the task before us is threefold: 1. Determine when it is likely that a relevant scheduler lock is held. 2. Defer the wakeup in such cases. 3. Ensure that all deferred wakeups eventually happen, preferably sooner rather than later. We use irqs_disabled_flags() as a proxy for relevant scheduler locks being held. This works because the relevant locks are always acquired with interrupts disabled. We may defer more often than needed, but that is at least safe. The wakeup deferral is tracked via a new field in the per-CPU and per-RCU-flavor rcu_data structure, namely ->nocb_defer_wakeup. This flag is checked by the RCU core processing. The __rcu_pending() function now checks this flag, which causes rcu_check_callbacks() to initiate RCU core processing at each scheduling-clock interrupt where this flag is set. Of course this is not sufficient because scheduling-clock interrupts are often turned off (the things we used to be able to count on!). So the flags are also checked on entry to any state that RCU considers to be idle, which includes both NO_HZ_IDLE idle state and NO_HZ_FULL user-mode-execution state. This approach should allow call_rcu() to be invoked regardless of what locks you might be holding, the key word being "should". Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org>
2013-12-03rcu: Fix and comment ordering around wait_event()Paul E. McKenney
It is all too easy to forget that wait_event() does not necessarily imply a full memory barrier. The case where it does not is where the condition transitions to true just as wait_event() starts execution. This is actually a feature: The standard use of wait_event() involves locking, in which case the locks provide the needed ordering (you hold a lock across the wake_up() and acquire that same lock after wait_event() returns). Given that I did forget that wait_event() does not necessarily imply a full memory barrier in one case, this commit fixes that case. This commit also adds comments calling out the placement of existing memory barriers relied on by wait_event() calls. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-12-03rcu: Kick CPU halfway to RCU CPU stall warningPaul E. McKenney
When an RCU CPU stall warning occurs, the CPU invokes resched_cpu() on itself. This can help move the grace period forward in some situations, but it would be even better to do this -before- the RCU CPU stall warning. This commit therefore causes resched_cpu() to be called every five jiffies once the system is halfway to an RCU CPU stall warning. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-12-02posix-timers: Fix full dynticks CPUs kick on timer reschedulingFrederic Weisbecker
A posix CPU timer can be rearmed while it is firing or after it is notified with a signal. This can happen for example with timers that were set with a non zero interval in timer_settime(). This rearming can happen in two places: 1) On timer firing time, which happens on the target's tick. If the timer can't trigger a signal because it is ignored, it reschedules itself to honour the timer interval. 2) On signal handling from the timer's notification target. This one can be a different task than the timer's target itself. Once the signal is notified, the notification target rearms the timer, again to honour the timer interval. When a timer is rearmed, we need to notify the full dynticks CPUs such that they restart their tick in case they are running tasks that may have a share in elapsing this timer. Now the 1st case above handles full dynticks CPUs with a call to posix_cpu_timer_kick_nohz() from the posix cpu timer firing code. But the second case ignores the fact that some CPUs may run non-idle tasks with their tick off. As a result, when a timer is resheduled after its signal notification, the full dynticks CPUs may completely ignore it and not tick on the timer as expected This patch fixes this bug by handling both cases in one. All we need is to move the kick to the rearming common code in posix_cpu_timer_schedule(). Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Olivier Langlois <olivier@olivierlanglois.net>
2013-12-02posix-timers: Spare workqueue if there is no full dynticks CPU to kickFrederic Weisbecker
After a posix cpu timer is set, a workqueue is scheduled in order to kick the full dynticks CPUs and let them restart their tick if necessary in case the task they are running is concerned by the new timer. This kick is implemented by way of IPIs, which require interrupts to be enabled, hence the need for a workqueue to raise them because the posix cpu timer set path has interrupts disabled. Now if there is no full dynticks CPU on the system, the workqueue is still scheduled but it simply won't send any IPI and return immediately. So lets spare that worqueue when it is not needed. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org>
2013-12-02context_tracking: Wrap static key check into more intuitive function nameFrederic Weisbecker
Use a function with a meaningful name to check the global context tracking state. static_key_false() is a bit confusing for reviewers. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org>
2013-12-02nohz: Convert a few places to use local per cpu accessesFrederic Weisbecker
A few functions use remote per CPU access APIs when they deal with local values. Just do the right conversion to improve performance, code readability and debug checks. While at it, lets extend some of these function names with *_this_cpu() suffix in order to display their purpose more clearly. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org>
2013-12-02Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: - Correction of fuzzy and fragile IRQ_RETVAL macro - IRQ related resume fix affecting only XEN - ARM/GIC fix for chained GIC controllers * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip: Gic: fix boot for chained gics irq: Enable all irqs unconditionally in irq_resume genirq: Correct fuzzy and fragile IRQ_RETVAL() definition
2013-12-02Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Various smaller fixlets, all over the place" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/doc: Fix generation of device-drivers sched: Expose preempt_schedule_irq() sched: Fix a trivial typo in comments sched: Remove unused variable in 'struct sched_domain' sched: Avoid NULL dereference on sd_busy sched: Check sched_domain before computing group power MAINTAINERS: Update file patterns in the lockdep and scheduler entries
2013-12-02Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Misc kernel and tooling fixes" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tools lib traceevent: Fix conversion of pointer to integer of different size perf/trace: Properly use u64 to hold event_id perf: Remove fragile swevent hlist optimization ftrace, perf: Avoid infinite event generation loop tools lib traceevent: Fix use of multiple options in processing field perf header: Fix possible memory leaks in process_group_desc() perf header: Fix bogus group name perf tools: Tag thread comm as overriden