summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2014-07-30Josh has movedJosh Triplett
My IBM email addresses haven't worked for years; also map some old-but-functional forwarding addresses to my canonical address. Update my GPG key fingerprint; I moved to 4096R a long time ago. Update description. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-07-30kexec: export free_huge_page to VMCOREINFOAtsushi Kumagai
PG_head_mask was added into VMCOREINFO to filter huge pages in b3acc56bfe1 ("kexec: save PG_head_mask in VMCOREINFO"), but makedumpfile still need another symbol to filter *hugetlbfs* pages. If a user hope to filter user pages, makedumpfile tries to exclude them by checking the condition whether the page is anonymous, but hugetlbfs pages aren't anonymous while they also be user pages. We know it's possible to detect them in the same way as PageHuge(), so we need the start address of free_huge_page(): int PageHuge(struct page *page) { if (!PageCompound(page)) return 0; page = compound_head(page); return get_compound_page_dtor(page) == free_huge_page; } For that reason, this patch changes free_huge_page() into public to export it to VMCOREINFO. Signed-off-by: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Acked-by: Baoquan He <bhe@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-07-27Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "A bunch of fixes for perf and kprobes: - revert a commit that caused a perf group regression - silence dmesg spam - fix kprobe probing errors on ia64 and ppc64 - filter kprobe faults from userspace - lockdep fix for perf exit path - prevent perf #GP in KVM guest - correct perf event and filters" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: kprobes: Fix "Failed to find blacklist" probing errors on ia64 and ppc64 kprobes/x86: Don't try to resolve kprobe faults from userspace perf/x86/intel: Avoid spamming kernel log for BTS buffer failure perf/x86/intel: Protect LBR and extra_regs against KVM lying perf: Fix lockdep warning on process exit perf/x86/intel/uncore: Fix SNB-EP/IVT Cbox filter mappings perf/x86/intel: Use proper dTLB-load-misses event on IvyBridge perf: Revert ("perf: Always destroy groups on exit")
2014-07-21tracing: Fix wraparound problems in "uptime" trace clockTony Luck
The "uptime" trace clock added in: commit 8aacf017b065a805d27467843490c976835eb4a5 tracing: Add "uptime" trace clock that uses jiffies has wraparound problems when the system has been up more than 1 hour 11 minutes and 34 seconds. It converts jiffies to nanoseconds using: (u64)jiffies_to_usecs(jiffy) * 1000ULL but since jiffies_to_usecs() only returns a 32-bit value, it truncates at 2^32 microseconds. An additional problem on 32-bit systems is that the argument is "unsigned long", so fixing the return value only helps until 2^32 jiffies (49.7 days on a HZ=1000 system). Avoid these problems by using jiffies_64 as our basis, and not converting to nanoseconds (we do convert to clock_t because user facing API must not be dependent on internal kernel HZ values). Link: http://lkml.kernel.org/p/99d63c5bfe9b320a3b428d773825a37095bf6a51.1405708254.git.tony.luck@intel.com Cc: stable@vger.kernel.org # 3.10+ Fixes: 8aacf017b065 "tracing: Add "uptime" trace clock that uses jiffies" Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-07-19Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Thomas Gleixner: "The locking department delivers: - A rather large and intrusive bundle of fixes to address serious performance regressions introduced by the new rwsem / mcs technology. Simpler solutions have been discussed, but they would have been ugly bandaids with more risk than doing the right thing. - Make the rwsem spin on owner technology opt-in for architectures and enable it only on the known to work ones. - A few fixes to the lockdep userspace library" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rwsem: Add CONFIG_RWSEM_SPIN_ON_OWNER locking/mutex: Disable optimistic spinning on some architectures locking/rwsem: Reduce the size of struct rw_semaphore locking/rwsem: Rename 'activity' to 'count' locking/spinlocks/mcs: Micro-optimize osq_unlock() locking/spinlocks/mcs: Introduce and use init macro and function for osq locks locking/spinlocks/mcs: Convert osq lock to atomic_t to reduce overhead locking/spinlocks/mcs: Rename optimistic_spin_queue() to optimistic_spin_node() locking/rwsem: Allow conservative optimistic spinning when readers have lock tools/liblockdep: Account for bitfield changes in lockdeps lock_acquire tools/liblockdep: Remove debug print left over from development tools/liblockdep: Fix comparison of a boolean value with a value of 2
2014-07-19Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Thomas Gleixner: "Prevent a possible divide by zero in the debugging code" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix possible divide by zero in avg_atom() calculation
2014-07-19Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Thomas Gleixner: "A single fix for a long standing issue in the alarm timer subsystem, which was noticed recently when people finally started to use alarm timers for serious work" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: alarmtimer: Fix bug where relative alarm timers were treated as absolute
2014-07-19Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU fixes from Thomas Gleixner: "Two RCU patches: - Address a serious performance regression on open/close caused by commit ac1bea85781e ("Make cond_resched() report RCU quiescent states") - Export RCU debug functions. Not a regression, but enablement to address a serious recursion bug in the sl*b allocators in 3.17" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rcu: Reduce overhead of cond_resched() checks for RCU rcu: Export debug_init_rcu_head() and and debug_init_rcu_head()
2014-07-18Merge tag 'pm+acpi-3.16-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI and power management fixes from Rafael Wysocki: "These are a few recent regression fixes, a revert of the ACPI video commit I promised, a system resume fix related to request_firmware(), an ACPI video quirk for one more Win8-oriented BIOS, an ACPI device enumeration documentation update and a few fixes for ARM cpufreq drivers. Specifics: - Fix for a recently introduced NULL pointer dereference in the core system suspend code occuring when platforms without ACPI attempt to use the "freeze" sleep state from Zhang Rui. - Fix for a recently introduced build warning in cpufreq headers from Brian W Hart. - Fix for a 3.13 cpufreq regression related to sysem resume that triggers on some systems with multiple CPU clusters from Viresh Kumar. - Fix for a 3.4 regression in request_firmware() resulting in WARN_ON()s on some systems during system resume from Takashi Iwai. - Revert of the ACPI video commit that changed the default value of the video.brightness_switch_enabled command line argument to 0 as it has been reported to break existing setups. - ACPI device enumeration documentation update to take recent code changes into account and make the documentation match the code again from Darren Hart. - Fixes for the sa1110, imx6q, kirkwood, and cpu0 cpufreq drivers from Linus Walleij, Nicolas Del Piano, Quentin Armitage, Viresh Kumar. - New ACPI video blacklist entry for HP ProBook 4540s from Hans de Goede" * tag 'pm+acpi-3.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: make table sentinel macros unsigned to match use cpufreq: move policy kobj to policy->cpu at resume cpufreq: cpu0: OPPs can be populated at runtime cpufreq: kirkwood: Reinstate cpufreq driver for ARCH_KIRKWOOD cpufreq: imx6q: Select PM_OPP cpufreq: sa1110: set memory type for h3600 ACPI / video: Add use_native_backlight quirk for HP ProBook 4540s PM / sleep: fix freeze_ops NULL pointer dereferences PM / sleep: Fix request_firmware() error at resume Revert "ACPI / video: change acpi-video brightness_switch_enabled default to 0" ACPI / documentation: Remove reference to acpi_platform_device_ids from enumeration.txt
2014-07-18kprobes: Fix "Failed to find blacklist" probing errors on ia64 and ppc64Masami Hiramatsu
On ia64 and ppc64, function pointers do not point to the entry address of the function, but to the address of a function descriptor (which contains the entry address and misc data). Since the kprobes code passes the function pointer stored by NOKPROBE_SYMBOL() to kallsyms_lookup_size_offset() for initalizing its blacklist, it fails and reports many errors, such as: Failed to find blacklist 0001013168300000 Failed to find blacklist 0001013000f0a000 [...] To fix this bug, use arch_deref_entry_point() to get the function entry address for kallsyms_lookup_size_offset() instead of the raw function pointer. Suzuki also pointed out that blacklist entries should also be updated as well. Reported-by: Tony Luck <tony.luck@gmail.com> Fixed-by: Suzuki K. Poulose <suzuki@in.ibm.com> Tested-by: Tony Luck <tony.luck@intel.com> Tested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (for powerpc) Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: sparse@chrisli.org Cc: Paul Mackerras <paulus@samba.org> Cc: akataria@vmware.com Cc: anil.s.keshavamurthy@intel.com Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Chris Wright <chrisw@sous-sol.org> Cc: yrl.pp-manager.tt@hitachi.com Cc: Kevin Hao <haokexin@gmail.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: rdunlap@infradead.org Cc: dl9pf@gmx.de Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: David S. Miller <davem@davemloft.net> Cc: linux-ia64@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/20140717114411.13401.2632.stgit@kbuild-fedora.novalocal Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-17Merge tag 'trace-fixes-v3.16-rc5-v2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "A few more fixes for ftrace infrastructure. I was cleaning out my INBOX and found two fixes from zhangwei from a year ago that were lost in my mail. These fix an inconsistency between trace_puts() and the way trace_printk() works. The reason this is important to fix is because when trace_printk() doesn't have any arguments, it turns into a trace_puts(). Not being able to enable a stack trace against trace_printk() because it does not have any arguments is quite confusing. Also, the fix is rather trivial and low risk. While porting some changes to PowerPC I discovered that it still has the function graph tracer filter bug that if you also enable stack tracing the function graph tracer filter is ignored. I fixed that up. Finally, Martin Lau, fixed a bug that would cause readers of the ftrace ring buffer to block forever even though it was suppose to be NONBLOCK" This also includes the fix from an earlier pull request: "Oleg Nesterov fixed a memory leak that happens if a user creates a tracing instance, sets up a filter in an event, and then removes that instance. The filter allocates memory that is never freed when the instance is destroyed" * tag 'trace-fixes-v3.16-rc5-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ring-buffer: Fix polling on trace_pipe tracing: Add TRACE_ITER_PRINTK flag check in __trace_puts/__trace_bputs tracing: Fix graph tracer with stack tracer on other archs tracing: Add ftrace_trace_stack into __trace_puts/__trace_bputs tracing: instance_rmdir() leaks ftrace_event_file->filter
2014-07-16Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Tooling fixes and an Intel PMU driver fixlet" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Do not allow optimized switch for non-cloned events perf/x86/intel: ignore CondChgd bit to avoid false NMI handling perf symbols: Get kernel start address by symbol name perf tools: Fix segfault in cumulative.callchain report
2014-07-16locking/rwsem: Add CONFIG_RWSEM_SPIN_ON_OWNERDavidlohr Bueso
Just like with mutexes (CONFIG_MUTEX_SPIN_ON_OWNER), encapsulate the dependencies for rwsem optimistic spinning. No logical changes here as it continues to depend on both SMP and the XADD algorithm variant. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Acked-by: Jason Low <jason.low2@hp.com> [ Also make it depend on ARCH_SUPPORTS_ATOMIC_RMW. ] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1405112406-13052-2-git-send-email-davidlohr@hp.com Cc: aswin@hp.com Cc: Chris Mason <clm@fb.com> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Josef Bacik <jbacik@fusionio.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16locking/mutex: Disable optimistic spinning on some architecturesPeter Zijlstra
The optimistic spin code assumes regular stores and cmpxchg() play nice; this is found to not be true for at least: parisc, sparc32, tile32, metag-lock1, arc-!llsc and hexagon. There is further wreckage, but this in particular seemed easy to trigger, so blacklist this. Opt in for known good archs. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Reported-by: Mikulas Patocka <mpatocka@redhat.com> Cc: David Miller <davem@davemloft.net> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: James Bottomley <James.Bottomley@hansenpartnership.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Jason Low <jason.low2@hp.com> Cc: Waiman Long <waiman.long@hp.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: John David Anglin <dave.anglin@bell.net> Cc: James Hogan <james.hogan@imgtec.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: stable@vger.kernel.org Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: sparclinux@vger.kernel.org Link: http://lkml.kernel.org/r/20140606175316.GV13930@laptop.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16locking/rwsem: Rename 'activity' to 'count'Peter Zijlstra
There are two definitions of struct rw_semaphore, one in linux/rwsem.h and one in linux/rwsem-spinlock.h. For some reason they have different names for the initial field. This makes it impossible to use C99 named initialization for __RWSEM_INITIALIZER() -- or we have to duplicate that entire thing along with the structure definitions. The simpler patch is renaming the rwsem-spinlock variant to match the regular rwsem. This allows us to switch to C99 named initialization. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-bmrZolsbGmautmzrerog27io@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16sched: Fix possible divide by zero in avg_atom() calculationMateusz Guzik
proc_sched_show_task() does: if (nr_switches) do_div(avg_atom, nr_switches); nr_switches is unsigned long and do_div truncates it to 32 bits, which means it can test non-zero on e.g. x86-64 and be truncated to zero for division. Fix the problem by using div64_ul() instead. As a side effect calculations of avg_atom for big nr_switches are now correct. Signed-off-by: Mateusz Guzik <mguzik@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: stable@vger.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1402750809-31991-1-git-send-email-mguzik@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16locking/spinlocks/mcs: Micro-optimize osq_unlock()Jason Low
In the unlock function of the cancellable MCS spinlock, the first thing we do is to retrive the current CPU's osq node. However, due to the changes made in the previous patch, in the common case where the lock is not contended, we wouldn't need to access the current CPU's osq node anymore. This patch optimizes this by only retriving this CPU's osq node after we attempt the initial cmpxchg to unlock the osq and found that its contended. Signed-off-by: Jason Low <jason.low2@hp.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Scott Norton <scott.norton@hp.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Waiman Long <waiman.long@hp.com> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1405358872-3732-5-git-send-email-jason.low2@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16locking/spinlocks/mcs: Introduce and use init macro and function for osq locksJason Low
Currently, we initialize the osq lock by directly setting the lock's values. It would be preferable if we use an init macro to do the initialization like we do with other locks. This patch introduces and uses a macro and function for initializing the osq lock. Signed-off-by: Jason Low <jason.low2@hp.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Scott Norton <scott.norton@hp.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Waiman Long <waiman.long@hp.com> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fusionio.com> Link: http://lkml.kernel.org/r/1405358872-3732-4-git-send-email-jason.low2@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16locking/spinlocks/mcs: Convert osq lock to atomic_t to reduce overheadJason Low
The cancellable MCS spinlock is currently used to queue threads that are doing optimistic spinning. It uses per-cpu nodes, where a thread obtaining the lock would access and queue the local node corresponding to the CPU that it's running on. Currently, the cancellable MCS lock is implemented by using pointers to these nodes. In this patch, instead of operating on pointers to the per-cpu nodes, we store the CPU numbers in which the per-cpu nodes correspond to in atomic_t. A similar concept is used with the qspinlock. By operating on the CPU # of the nodes using atomic_t instead of pointers to those nodes, this can reduce the overhead of the cancellable MCS spinlock by 32 bits (on 64 bit systems). Signed-off-by: Jason Low <jason.low2@hp.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Scott Norton <scott.norton@hp.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Waiman Long <waiman.long@hp.com> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Chris Mason <clm@fb.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Josef Bacik <jbacik@fusionio.com> Link: http://lkml.kernel.org/r/1405358872-3732-3-git-send-email-jason.low2@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16locking/spinlocks/mcs: Rename optimistic_spin_queue() to optimistic_spin_node()Jason Low
Currently, the per-cpu nodes structure for the cancellable MCS spinlock is named "optimistic_spin_queue". However, in a follow up patch in the series we will be introducing a new structure that serves as the new "handle" for the lock. It would make more sense if that structure is named "optimistic_spin_queue". Additionally, since the current use of the "optimistic_spin_queue" structure are "nodes", it might be better if we rename them to "node" anyway. This preparatory patch renames all current "optimistic_spin_queue" to "optimistic_spin_node". Signed-off-by: Jason Low <jason.low2@hp.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Scott Norton <scott.norton@hp.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Waiman Long <waiman.long@hp.com> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Chris Mason <clm@fb.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Josef Bacik <jbacik@fusionio.com> Link: http://lkml.kernel.org/r/1405358872-3732-2-git-send-email-jason.low2@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16locking/rwsem: Allow conservative optimistic spinning when readers have lockJason Low
Commit 4fc828e24cd9 ("locking/rwsem: Support optimistic spinning") introduced a major performance regression for workloads such as xfs_repair which mix read and write locking of the mmap_sem across many threads. The result was xfs_repair ran 5x slower on 3.16-rc2 than on 3.15 and using 20x more system CPU time. Perf profiles indicate in some workloads that significant time can be spent spinning on !owner. This is because we don't set the lock owner when readers(s) obtain the rwsem. In this patch, we'll modify rwsem_can_spin_on_owner() such that we'll return false if there is no lock owner. The rationale is that if we just entered the slowpath, yet there is no lock owner, then there is a possibility that a reader has the lock. To be conservative, we'll avoid spinning in these situations. This patch reduced the total run time of the xfs_repair workload from about 4 minutes 24 seconds down to approximately 1 minute 26 seconds, back to close to the same performance as on 3.15. Retesting of AIM7, which were some of the workloads used to test the original optimistic spinning code, confirmed that we still get big performance gains with optimistic spinning, even with this additional regression fix. Davidlohr found that while the 'custom' workload took a performance hit of ~-14% to throughput for >300 users with this additional patch, the overall gain with optimistic spinning is still ~+45%. The 'disk' workload even improved by ~+15% at >1000 users. Tested-by: Dave Chinner <dchinner@redhat.com> Acked-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Jason Low <jason.low2@hp.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1404532172.2572.30.camel@j-VirtualBox Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16perf: Fix lockdep warning on process exitPeter Zijlstra
Sasha Levin reported: > While fuzzing with trinity inside a KVM tools guest running the latest -next > kernel I've stumbled on the following spew: > > ====================================================== > [ INFO: possible circular locking dependency detected ] > 3.15.0-next-20140613-sasha-00026-g6dd125d-dirty #654 Not tainted > ------------------------------------------------------- > trinity-c578/9725 is trying to acquire lock: > (&(&pool->lock)->rlock){-.-...}, at: __queue_work (kernel/workqueue.c:1346) > > but task is already holding lock: > (&ctx->lock){-.....}, at: perf_event_exit_task (kernel/events/core.c:7471 kernel/events/core.c:7533) > > which lock already depends on the new lock. > 1 lock held by trinity-c578/9725: > #0: (&ctx->lock){-.....}, at: perf_event_exit_task (kernel/events/core.c:7471 kernel/events/core.c:7533) > > Call Trace: > dump_stack (lib/dump_stack.c:52) > print_circular_bug (kernel/locking/lockdep.c:1216) > __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182) > lock_acquire (./arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602) > _raw_spin_lock (include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151) > __queue_work (kernel/workqueue.c:1346) > queue_work_on (kernel/workqueue.c:1424) > free_object (lib/debugobjects.c:209) > __debug_check_no_obj_freed (lib/debugobjects.c:715) > debug_check_no_obj_freed (lib/debugobjects.c:727) > kmem_cache_free (mm/slub.c:2683 mm/slub.c:2711) > free_task (kernel/fork.c:221) > __put_task_struct (kernel/fork.c:250) > put_ctx (include/linux/sched.h:1855 kernel/events/core.c:898) > perf_event_exit_task (kernel/events/core.c:907 kernel/events/core.c:7478 kernel/events/core.c:7533) > do_exit (kernel/exit.c:766) > do_group_exit (kernel/exit.c:884) > get_signal_to_deliver (kernel/signal.c:2347) > do_signal (arch/x86/kernel/signal.c:698) > do_notify_resume (arch/x86/kernel/signal.c:751) > int_signal (arch/x86/kernel/entry_64.S:600) Urgh.. so the only way I can make that happen is through: perf_event_exit_task_context() raw_spin_lock(&child_ctx->lock); unclone_ctx(child_ctx) put_ctx(ctx->parent_ctx); raw_spin_unlock_irqrestore(&child_ctx->lock); And we can avoid this by doing the change below. I can't immediately see how this changed recently, but given that you say it's easy to reproduce, lets fix this. Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Dave Jones <davej@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140623141242.GB19860@laptop.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16perf: Revert ("perf: Always destroy groups on exit")Peter Zijlstra
Vince reported that commit 15a2d4de0eab5 ("perf: Always destroy groups on exit") causes a regression with grouped events. In particular his read_group_attached.c test fails. https://github.com/deater/perf_event_tests/blob/master/tests/bugs/read_group_attached.c Because of the context switch optimization in perf_event_context_sched_out() the 'original' event may end up in the child process and when that exits the change in the patch in question destroys the actual grouping. Therefore revert that change and only destroy inherited groups. Reported-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-zedy3uktcp753q8fw8dagx7a@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-15ring-buffer: Fix polling on trace_pipeMartin Lau
ring_buffer_poll_wait() should always put the poll_table to its wait_queue even there is immediate data available. Otherwise, the following epoll and read sequence will eventually hang forever: 1. Put some data to make the trace_pipe ring_buffer read ready first 2. epoll_ctl(efd, EPOLL_CTL_ADD, trace_pipe_fd, ee) 3. epoll_wait() 4. read(trace_pipe_fd) till EAGAIN 5. Add some more data to the trace_pipe ring_buffer 6. epoll_wait() -> this epoll_wait() will block forever ~ During the epoll_ctl(efd, EPOLL_CTL_ADD,...) call in step 2, ring_buffer_poll_wait() returns immediately without adding poll_table, which has poll_table->_qproc pointing to ep_poll_callback(), to its wait_queue. ~ During the epoll_wait() call in step 3 and step 6, ring_buffer_poll_wait() cannot add ep_poll_callback() to its wait_queue because the poll_table->_qproc is NULL and it is how epoll works. ~ When there is new data available in step 6, ring_buffer does not know it has to call ep_poll_callback() because it is not in its wait queue. Hence, block forever. Other poll implementation seems to call poll_wait() unconditionally as the very first thing to do. For example, tcp_poll() in tcp.c. Link: http://lkml.kernel.org/p/20140610060637.GA14045@devbig242.prn2.facebook.com Cc: stable@vger.kernel.org # 2.6.27 Fixes: 2a2cc8f7c4d0 "ftrace: allow the event pipe to be polled" Reviewed-by: Chris Mason <clm@fb.com> Signed-off-by: Martin Lau <kafai@fb.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-07-15tracing: Add TRACE_ITER_PRINTK flag check in __trace_puts/__trace_bputszhangwei(Jovi)
The TRACE_ITER_PRINTK check in __trace_puts/__trace_bputs is missing, so add it, to be consistent with __trace_printk/__trace_bprintk. Those functions are all called by the same function: trace_printk(). Link: http://lkml.kernel.org/p/51E7A7D6.8090900@huawei.com Cc: stable@vger.kernel.org # 3.11+ Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-07-15tracing: Fix graph tracer with stack tracer on other archsSteven Rostedt (Red Hat)
Running my ftrace tests on PowerPC, it failed the test that checks if function_graph tracer is affected by the stack tracer. It was. Looking into this, I found that the update_function_graph_func() must be called even if the trampoline function is not changed. This is because archs like PowerPC do not support ftrace_ops being passed by assembly and instead uses a helper function (what the trampoline function points to). Since this function is not changed even when multiple ftrace_ops are added to the code, the test that falls out before calling update_function_graph_func() will miss that the update must still be done. Call update_function_graph_function() for all calls to update_ftrace_function() Cc: stable@vger.kernel.org # 3.3+ Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-07-15tracing: Add ftrace_trace_stack into __trace_puts/__trace_bputszhangwei(Jovi)
Currently trace option stacktrace is not applicable for trace_printk with constant string argument, the reason is in __trace_puts/__trace_bputs ftrace_trace_stack is missing. In contrast, when using trace_printk with non constant string argument(will call into __trace_printk/__trace_bprintk), then trace option stacktrace is workable, this inconstant result will confuses users a lot. Link: http://lkml.kernel.org/p/51E7A7C9.9040401@huawei.com Cc: stable@vger.kernel.org # 3.10+ Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-07-15PM / sleep: fix freeze_ops NULL pointer dereferencesZhang Rui
This patch fixes a NULL pointer dereference issue introduced by commit 1f0b63866fc1 (ACPI / PM: Hold ACPI scan lock over the "freeze" sleep state). Fixes: 1f0b63866fc1 (ACPI / PM: Hold ACPI scan lock over the "freeze" sleep state) Link: http://marc.info/?l=linux-pm&m=140541182017443&w=2 Reported-and-tested-by: Alexander Stein <alexander.stein@systec-electronic.com> Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-07-15PM / sleep: Fix request_firmware() error at resumeTakashi Iwai
The commit [247bc037: PM / Sleep: Mitigate race between the freezer and request_firmware()] introduced the finer state control, but it also leads to a new bug; for example, a bug report regarding the firmware loading of intel BT device at suspend/resume: https://bugzilla.novell.com/show_bug.cgi?id=873790 The root cause seems to be a small window between the process resume and the clear of usermodehelper lock. The request_firmware() function checks the UMH lock and gives up when it's in UMH_DISABLE state. This is for avoiding the invalid f/w loading during suspend/resume phase. The problem is, however, that usermodehelper_enable() is called at the end of thaw_processes(). Thus, a thawed process in between can kick off the f/w loader code path (in this case, via btusb_setup_intel()) even before the call of usermodehelper_enable(). Then usermodehelper_read_trylock() returns an error and request_firmware() spews WARN_ON() in the end. This oneliner patch fixes the issue just by setting to UMH_FREEZING state again before restarting tasks, so that the call of request_firmware() will be blocked until the end of this function instead of returning an error. Fixes: 247bc0374254 (PM / Sleep: Mitigate race between the freezer and request_firmware()) Link: https://bugzilla.novell.com/show_bug.cgi?id=873790 Cc: 3.4+ <stable@vger.kernel.org> # 3.4+ Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-07-14tracing: instance_rmdir() leaks ftrace_event_file->filterOleg Nesterov
instance_rmdir() path destroys the event files but forgets to free file->filter. Change remove_event_file_dir() to free_event_filter(). Link: http://lkml.kernel.org/p/20140711190638.GA19517@redhat.com Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Tom Zanussi <tom.zanussi@linux.intel.com> Cc: "zhangwei(Jovi)" <jovi.zhangwei@huawei.com> Cc: stable@vger.kernel.org # 3.11+ Fixes: f6a84bdc75b5 "tracing: Introduce remove_event_file_dir()" Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-07-10Merge branch 'for-3.16-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "Mostly fixes for the fallouts from the recent cgroup core changes. The decoupled nature of cgroup dynamic hierarchy management (hierarchies are created dynamically on mount but may or may not be reused once unmounted depending on remaining usages) led to more ugliness being added to kernfs. Hopefully, this is the last of it" * 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cpuset: break kernfs active protection in cpuset_write_resmask() cgroup: fix a race between cgroup_mount() and cgroup_kill_sb() kernfs: introduce kernfs_pin_sb() cgroup: fix mount failure in a corner case cpuset,mempolicy: fix sleeping function called from invalid context cgroup: fix broken css_has_online_children()
2014-07-10Merge branch 'for-3.16-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: "Two workqueue fixes. Both are one liners. One fixes missing uevent for workqueue files on sysfs. The other one fixes missing zeroing of NUMA cpu masks which can lead to oopses among other things" * 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: zero cpumask of wq_numa_possible_cpumask on init workqueue: fix dev_set_uevent_suppress() imbalance
2014-07-08alarmtimer: Fix bug where relative alarm timers were treated as absoluteJohn Stultz
Sharvil noticed with the posix timer_settime interface, using the CLOCK_REALTIME_ALARM or CLOCK_BOOTTIME_ALARM clockid, if the users tried to specify a relative time timer, it would incorrectly be treated as absolute regardless of the state of the flags argument. This patch corrects this, properly checking the absolute/relative flag, as well as adds further error checking that no invalid flag bits are set. Reported-by: Sharvil Nanavati <sharvil@google.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Sharvil Nanavati <sharvil@google.com> Cc: stable <stable@vger.kernel.org> #3.0+ Link: http://lkml.kernel.org/r/1404767171-6902-1-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-07-07workqueue: zero cpumask of wq_numa_possible_cpumask on initYasuaki Ishimatsu
When hot-adding and onlining CPU, kernel panic occurs, showing following call trace. BUG: unable to handle kernel paging request at 0000000000001d08 IP: [<ffffffff8114acfd>] __alloc_pages_nodemask+0x9d/0xb10 PGD 0 Oops: 0000 [#1] SMP ... Call Trace: [<ffffffff812b8745>] ? cpumask_next_and+0x35/0x50 [<ffffffff810a3283>] ? find_busiest_group+0x113/0x8f0 [<ffffffff81193bc9>] ? deactivate_slab+0x349/0x3c0 [<ffffffff811926f1>] new_slab+0x91/0x300 [<ffffffff815de95a>] __slab_alloc+0x2bb/0x482 [<ffffffff8105bc1c>] ? copy_process.part.25+0xfc/0x14c0 [<ffffffff810a3c78>] ? load_balance+0x218/0x890 [<ffffffff8101a679>] ? sched_clock+0x9/0x10 [<ffffffff81105ba9>] ? trace_clock_local+0x9/0x10 [<ffffffff81193d1c>] kmem_cache_alloc_node+0x8c/0x200 [<ffffffff8105bc1c>] copy_process.part.25+0xfc/0x14c0 [<ffffffff81114d0d>] ? trace_buffer_unlock_commit+0x4d/0x60 [<ffffffff81085a80>] ? kthread_create_on_node+0x140/0x140 [<ffffffff8105d0ec>] do_fork+0xbc/0x360 [<ffffffff8105d3b6>] kernel_thread+0x26/0x30 [<ffffffff81086652>] kthreadd+0x2c2/0x300 [<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60 [<ffffffff815f20ec>] ret_from_fork+0x7c/0xb0 [<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60 In my investigation, I found the root cause is wq_numa_possible_cpumask. All entries of wq_numa_possible_cpumask is allocated by alloc_cpumask_var_node(). And these entries are used without initializing. So these entries have wrong value. When hot-adding and onlining CPU, wq_update_unbound_numa() is called. wq_update_unbound_numa() calls alloc_unbound_pwq(). And alloc_unbound_pwq() calls get_unbound_pool(). In get_unbound_pool(), worker_pool->node is set as follow: 3592 /* if cpumask is contained inside a NUMA node, we belong to that node */ 3593 if (wq_numa_enabled) { 3594 for_each_node(node) { 3595 if (cpumask_subset(pool->attrs->cpumask, 3596 wq_numa_possible_cpumask[node])) { 3597 pool->node = node; 3598 break; 3599 } 3600 } 3601 } But wq_numa_possible_cpumask[node] does not have correct cpumask. So, wrong node is selected. As a result, kernel panic occurs. By this patch, all entries of wq_numa_possible_cpumask are allocated by zalloc_cpumask_var_node to initialize them. And the panic disappeared. Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org Fixes: bce903809ab3 ("workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]")
2014-07-05Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: "A few minor fixlets in ARM SoC irq drivers and a fix for a memory leak which I introduced in the last round of cleanups :(" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Fix memory leak when calling irq_free_hwirqs() irqchip: spear_shirq: Fix interrupt offset irqchip: brcmstb-l2: Level-2 interrupts are edge sensitive irqchip: armada-370-xp: Mask all interrupts during initialization.
2014-07-05genirq: Fix memory leak when calling irq_free_hwirqs()Keith Busch
irq_free_hwirqs() always calls irq_free_descs() with a cnt == 0 which makes it a no-op since the interrupt count to free is decremented in itself. Fixes: 7b6ef1262549f6afc5c881aaef80beb8fd15f908 Signed-off-by: Keith Busch <keith.busch@intel.com> Acked-by: David Rientjes <rientjes@google.com> Link: http://lkml.kernel.org/r/1404167084-8070-1-git-send-email-keith.busch@intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-07-03Merge tag 'trace-fixes-v3.16-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Oleg Nesterov found and fixed a bug in the perf/ftrace/uprobes code where running: # perf probe -x /lib/libc.so.6 syscall # echo 1 >> /sys/kernel/debug/tracing/events/probe_libc/enable # perf record -e probe_libc:syscall whatever kills the uprobe. Along the way he found some other minor bugs and clean ups that he fixed up making it a total of 4 patches. Doing unrelated work, I found that the reading of the ftrace trace file disables all function tracer callbacks. This was fine when ftrace was the only user, but now that it's used by perf and kprobes, this is a bug where reading trace can disable kprobes and perf. A very unexpected side effect and should be fixed" * tag 'trace-fixes-v3.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Remove ftrace_stop/start() from reading the trace file tracing/uprobes: Fix the usage of uprobe_buffer_enable() in probe_event_enable() tracing/uprobes: Kill the bogus UPROBE_HANDLER_REMOVE code in uprobe_dispatcher() uprobes: Change unregister/apply to WARN() if uprobe/consumer is gone tracing/uprobes: Revert "Support mix of ftrace and perf"
2014-07-03kernel/printk/printk.c: revert "printk: enable interrupts before calling ↵Andrew Morton
console_trylock_for_printk()" Revert commit 939f04bec1a4 ("printk: enable interrupts before calling console_trylock_for_printk()"). Andreas reported: : None of the post 3.15 kernel boot for me. They all hang at the GRUB : screen telling me it loaded and started the kernel, but the kernel : itself stops before it prints anything (or even replaces the GRUB : background graphics). 939f04bec1a4 is modest latency reduction. Revert it until we understand the reason for these failures. Reported-by: Andreas Bombe <aeb@debian.org> Cc: Jan Kara <jack@suse.cz> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-07-02perf: Do not allow optimized switch for non-cloned eventsJiri Olsa
The context check in perf_event_context_sched_out allows non-cloned context to be part of the optimized schedule out switch. This could move non-cloned context into another workload child. Once this child exits, the context is closed and leaves all original (parent) events in closed state. Any other new cloned event will have closed state and not measure anything. And probably causing other odd bugs. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1403598026-2310-2-git-send-email-jolsa@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-01cpuset: break kernfs active protection in cpuset_write_resmask()Tejun Heo
Writing to either "cpuset.cpus" or "cpuset.mems" file flushes cpuset_hotplug_work so that cpu or memory hotunplug doesn't end up migrating tasks off a cpuset after new resources are added to it. As cpuset_hotplug_work calls into cgroup core via cgroup_transfer_tasks(), this flushing adds the dependency to cgroup core locking from cpuset_write_resmak(). This used to be okay because cgroup interface files were protected by a different mutex; however, 8353da1f91f1 ("cgroup: remove cgroup_tree_mutex") simplified the cgroup core locking and this dependency became a deadlock hazard - cgroup file removal performed under cgroup core lock tries to drain on-going file operation which is trying to flush cpuset_hotplug_work blocked on the same cgroup core lock. The locking simplification was done because kernfs added an a lot easier way to deal with circular dependencies involving kernfs active protection. Let's use the same strategy in cpuset and break active protection in cpuset_write_resmask(). While it isn't the prettiest, this is a very rare, likely unique, situation which also goes away on the unified hierarchy. The commands to trigger the deadlock warning without the patch and the lockdep output follow. localhost:/ # mount -t cgroup -o cpuset xxx /cpuset localhost:/ # mkdir /cpuset/tmp localhost:/ # echo 1 > /cpuset/tmp/cpuset.cpus localhost:/ # echo 0 > cpuset/tmp/cpuset.mems localhost:/ # echo $$ > /cpuset/tmp/tasks localhost:/ # echo 0 > /sys/devices/system/cpu/cpu1/online ====================================================== [ INFO: possible circular locking dependency detected ] 3.16.0-rc1-0.1-default+ #7 Not tainted ------------------------------------------------------- kworker/1:0/32649 is trying to acquire lock: (cgroup_mutex){+.+.+.}, at: [<ffffffff8110e3d7>] cgroup_transfer_tasks+0x37/0x150 but task is already holding lock: (cpuset_hotplug_work){+.+...}, at: [<ffffffff81085412>] process_one_work+0x192/0x520 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (cpuset_hotplug_work){+.+...}: ... -> #1 (s_active#175){++++.+}: ... -> #0 (cgroup_mutex){+.+.+.}: ... other info that might help us debug this: Chain exists of: cgroup_mutex --> s_active#175 --> cpuset_hotplug_work Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(cpuset_hotplug_work); lock(s_active#175); lock(cpuset_hotplug_work); lock(cgroup_mutex); *** DEADLOCK *** 2 locks held by kworker/1:0/32649: #0: ("events"){.+.+.+}, at: [<ffffffff81085412>] process_one_work+0x192/0x520 #1: (cpuset_hotplug_work){+.+...}, at: [<ffffffff81085412>] process_one_work+0x192/0x520 stack backtrace: CPU: 1 PID: 32649 Comm: kworker/1:0 Not tainted 3.16.0-rc1-0.1-default+ #7 ... Call Trace: [<ffffffff815a5f78>] dump_stack+0x72/0x8a [<ffffffff810c263f>] print_circular_bug+0x10f/0x120 [<ffffffff810c481e>] check_prev_add+0x43e/0x4b0 [<ffffffff810c4ee6>] validate_chain+0x656/0x7c0 [<ffffffff810c53d2>] __lock_acquire+0x382/0x660 [<ffffffff810c57a9>] lock_acquire+0xf9/0x170 [<ffffffff815aa13f>] mutex_lock_nested+0x6f/0x380 [<ffffffff8110e3d7>] cgroup_transfer_tasks+0x37/0x150 [<ffffffff811129c0>] hotplug_update_tasks_insane+0x110/0x1d0 [<ffffffff81112bbd>] cpuset_hotplug_update_tasks+0x13d/0x180 [<ffffffff811148ec>] cpuset_hotplug_workfn+0x18c/0x630 [<ffffffff810854d4>] process_one_work+0x254/0x520 [<ffffffff810875dd>] worker_thread+0x13d/0x3d0 [<ffffffff8108e0c8>] kthread+0xf8/0x100 [<ffffffff815acaec>] ret_from_fork+0x7c/0xb0 Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Li Zefan <lizefan@huawei.com> Tested-by: Li Zefan <lizefan@huawei.com>
2014-07-01tracing: Remove ftrace_stop/start() from reading the trace fileSteven Rostedt (Red Hat)
Disabling reading and writing to the trace file should not be able to disable all function tracing callbacks. There's other users today (like kprobes and perf). Reading a trace file should not stop those from happening. Cc: stable@vger.kernel.org # 3.0+ Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-06-30tracing/uprobes: Fix the usage of uprobe_buffer_enable() in probe_event_enable()Oleg Nesterov
The usage of uprobe_buffer_enable() added by dcad1a20 is very wrong, 1. uprobe_buffer_enable() and uprobe_buffer_disable() are not balanced, _enable() should be called only if !enabled. 2. If uprobe_buffer_enable() fails probe_event_enable() should clear tp.flags and free event_file_link. 3. If uprobe_register() fails it should do uprobe_buffer_disable(). Link: http://lkml.kernel.org/p/20140627170146.GA18332@redhat.com Acked-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Fixes: dcad1a204f72 "tracing/uprobes: Fetch args before reserving a ring buffer" Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-06-30tracing/uprobes: Kill the bogus UPROBE_HANDLER_REMOVE code in ↵Oleg Nesterov
uprobe_dispatcher() I do not know why dd9fa555d7bb "tracing/uprobes: Move argument fetching to uprobe_dispatcher()" added the UPROBE_HANDLER_REMOVE, but it looks wrong. OK, perhaps it makes sense to avoid store_trace_args() if the tracee is nacked by uprobe_perf_filter(). But then we should kill the same code in uprobe_perf_func() and unify the TRACE/PROFILE filtering (we need to do this anyway to mix perf/ftrace). Until then this code actually adds the pessimization because uprobe_perf_filter() will be called twice and return T in likely case. Link: http://lkml.kernel.org/p/20140627170143.GA18329@redhat.com Acked-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-06-30uprobes: Change unregister/apply to WARN() if uprobe/consumer is goneOleg Nesterov
Add WARN_ON's into uprobe_unregister() and uprobe_apply() to ensure that nobody tries to play with the dead uprobe/consumer. This helps to catch the bugs like the one fixed by the previous patch. In the longer term we should fix this poorly designed interface. uprobe_register() should return "struct uprobe *" which should be passed to apply/unregister. Plus other semantic changes, see the changelog in commit 41ccba029e94. Link: http://lkml.kernel.org/p/20140627170140.GA18322@redhat.com Acked-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-06-30tracing/uprobes: Revert "Support mix of ftrace and perf"Oleg Nesterov
This reverts commit 43fe98913c9f67e3b523615ee3316f9520a623e0. This patch is very wrong. Firstly, this change leads to unbalanced uprobe_unregister(). Just for example, # perf probe -x /lib/libc.so.6 syscall # echo 1 >> /sys/kernel/debug/tracing/events/probe_libc/enable # perf record -e probe_libc:syscall whatever after that uprobe is dead (unregistered) but the user of ftrace/perf can't know this, and it looks as if nobody hits this probe. This would be easy to fix, but there are other reasons why it is not simple to mix ftrace and perf. If nothing else, they can't share the same ->consumer.filter. This is fixable too, but probably we need to fix the poorly designed uprobe_register() interface first. At least "register" and "apply" should be clearly separated. Link: http://lkml.kernel.org/p/20140627170136.GA18319@redhat.com Cc: Tom Zanussi <tom.zanussi@linux.intel.com> Cc: "zhangwei(Jovi)" <jovi.zhangwei@huawei.com> Cc: stable@vger.kernel.org # v3.14 Acked-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-06-30cgroup: fix a race between cgroup_mount() and cgroup_kill_sb()Li Zefan
We've converted cgroup to kernfs so cgroup won't be intertwined with vfs objects and locking, but there are dark areas. Run two instances of this script concurrently: for ((; ;)) { mount -t cgroup -o cpuacct xxx /cgroup umount /cgroup } After a while, I saw two mount processes were stuck at retrying, because they were waiting for a subsystem to become free, but the root associated with this subsystem never got freed. This can happen, if thread A is in the process of killing superblock but hasn't called percpu_ref_kill(), and at this time thread B is mounting the same cgroup root and finds the root in the root list and performs percpu_ref_try_get(). To fix this, we try to increase both the refcnt of the superblock and the percpu refcnt of cgroup root. v2: - we should try to get both the superblock refcnt and cgroup_root refcnt, because cgroup_root may have no superblock assosiated with it. - adjust/add comments. tj: Updated comments. Renamed @sb to @pinned_sb. Cc: <stable@vger.kernel.org> # 3.15 Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-06-30cgroup: fix mount failure in a corner caseLi Zefan
# cat test.sh #! /bin/bash mount -t cgroup -o cpu xxx /cgroup umount /cgroup mount -t cgroup -o cpu,cpuacct xxx /cgroup umount /cgroup # ./test.sh mount: xxx already mounted or /cgroup busy mount: according to mtab, xxx is already mounted on /cgroup It's because the cgroupfs_root of the first mount was under destruction asynchronously. Fix this by delaying and then retrying mount for this case. v3: - put the refcnt immediately after getting it. (Tejun) v2: - use percpu_ref_tryget_live() rather that introducing percpu_ref_alive(). (Tejun) - adjust comment. tj: Updated the comment a bit. Cc: <stable@vger.kernel.org> # 3.15 Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-06-25cpuset,mempolicy: fix sleeping function called from invalid contextGu Zheng
When runing with the kernel(3.15-rc7+), the follow bug occurs: [ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586 [ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python [ 9969.441175] INFO: lockdep is turned off. [ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G A 3.15.0-rc7+ #85 [ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012 [ 9969.706052] ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18 [ 9969.795323] ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c [ 9969.884710] ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000 [ 9969.974071] Call Trace: [ 9970.003403] [<ffffffff8162f523>] dump_stack+0x4d/0x66 [ 9970.065074] [<ffffffff8109995a>] __might_sleep+0xfa/0x130 [ 9970.130743] [<ffffffff81633e6c>] mutex_lock_nested+0x3c/0x4f0 [ 9970.200638] [<ffffffff811ba5dc>] ? kmem_cache_alloc+0x1bc/0x210 [ 9970.272610] [<ffffffff81105807>] cpuset_mems_allowed+0x27/0x140 [ 9970.344584] [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150 [ 9970.409282] [<ffffffff811b1385>] __mpol_dup+0xe5/0x150 [ 9970.471897] [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150 [ 9970.536585] [<ffffffff81068c86>] ? copy_process.part.23+0x606/0x1d40 [ 9970.613763] [<ffffffff810bf28d>] ? trace_hardirqs_on+0xd/0x10 [ 9970.683660] [<ffffffff810ddddf>] ? monotonic_to_bootbased+0x2f/0x50 [ 9970.759795] [<ffffffff81068cf0>] copy_process.part.23+0x670/0x1d40 [ 9970.834885] [<ffffffff8106a598>] do_fork+0xd8/0x380 [ 9970.894375] [<ffffffff81110e4c>] ? __audit_syscall_entry+0x9c/0xf0 [ 9970.969470] [<ffffffff8106a8c6>] SyS_clone+0x16/0x20 [ 9971.030011] [<ffffffff81642009>] stub_clone+0x69/0x90 [ 9971.091573] [<ffffffff81641c29>] ? system_call_fastpath+0x16/0x1b The cause is that cpuset_mems_allowed() try to take mutex_lock(&callback_mutex) under the rcu_read_lock(which was hold in __mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock protection region to protect the access to cpuset only in current_cpuset_is_being_rebound(). So that we can avoid this bug. This patch is a temporary solution that just addresses the bug mentioned above, can not fix the long-standing issue about cpuset.mems rebinding on fork(): "When the forker's task_struct is duplicated (which includes ->mems_allowed) and it races with an update to cpuset_being_rebound in update_tasks_nodemask() then the task's mems_allowed doesn't get updated. And the child task's mems_allowed can be wrong if the cpuset's nodemask changes before the child has been added to the cgroup's tasklist." Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable <stable@vger.kernel.org>
2014-06-25Merge tag 'trace-fixes-v3.16-rc1-v2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing cleanups and fixes from Steven Rostedt: "This includes three patches from Oleg Nesterov. The first is a fix to a race condition that happens between enabling/disabling syscall tracepoints and new process creations (the check to go into the ptrace path for a process can be set when it shouldn't, or not set when it should). Not a major bug but one that should be fixed and even applied to stable. The other two patches are cleanup/fixes that are not that critical, but for an -rc1 release would be nice to have. They both deal with syscall tracepoints. It also includes a patch to introduce a new macro for the TRACE_EVENT() format called __field_struct(). Originally, __field() was used to record any variable into a trace event, but with the addition of setting the "is signed" attribute, the check causes anything but a primitive variable to fail to compile. That is, structs and unions can't be used as they once were. When the "is signed" check was introduce there were only primitive variables being recorded. But that will change soon and it was reported that __field() causes build failures. To solve the __field() issue, __field_struct() is introduced to allow trace_events to be able to record complex types too" * tag 'trace-fixes-v3.16-rc1-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Add __field_struct macro for TRACE_EVENT() tracing: syscall_regfunc() should not skip kernel threads tracing: Change syscall_*regfunc() to check PF_KTHREAD and use for_each_process_thread() tracing: Fix syscall_*regfunc() vs copy_process() race
2014-06-25Merge branch 'urgent.2014.06.23a' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/urgent Pull RCU fixes from Paul E. McKenney: " This series includes the following: 1. Export a pair of debug-object interfaces for RCU that will allow the slab allocators to avoid a recursion bug located by Sasha Levin. Strictly speaking, this is not a regression, but it would be good to enable the fix. 2. Address a serious performance regression on an open/close micro-benchmark located by Dave Hansen. The offending commit is ac1bea85781e (Make cond_resched() report RCU quiescent states). " Signed-off-by: Ingo Molnar <mingo@kernel.org>