summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2009-04-03Merge branch 'core-cleanups-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: ptrace: remove a useless goto
2009-04-03Merge branch 'stacktrace-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'stacktrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: symbols, stacktrace: look up init symbols after module symbols
2009-04-03Merge branch 'rcu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: rcu: rcu_barrier VS cpu_hotplug: Ensure callbacks in dead cpu are migrated to online cpu
2009-04-03Merge branch 'ipi-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'ipi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: s390: remove arch specific smp_send_stop() panic: clean up kernel/panic.c panic, smp: provide smp_send_stop() wrapper on UP too panic: decrease oops_in_progress only after having done the panic generic-ipi: eliminate WARN_ON()s during oops/panic generic-ipi: cleanups generic-ipi: remove CSD_FLAG_WAIT generic-ipi: remove kmalloc() generic IPI: simplify barriers and locking
2009-04-03Merge branch 'locking-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: locking: rename trace_softirq_[enter|exit] => lockdep_softirq_[enter|exit] lockdep: remove duplicate CONFIG_DEBUG_LOCKDEP definitions lockdep: require framepointers for x86 lockdep: remove extra "irq" string lockdep: fix incorrect state name
2009-04-03Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (28 commits) trivial: Update my email address trivial: NULL noise: drivers/mtd/tests/mtd_*test.c trivial: NULL noise: drivers/media/dvb/frontends/drx397xD_fw.h trivial: Fix misspelling of "Celsius". trivial: remove unused variable 'path' in alloc_file() trivial: fix a pdlfush -> pdflush typo in comment trivial: jbd header comment typo fix for JBD_PARANOID_IOFAIL trivial: wusb: Storage class should be before const qualifier trivial: drivers/char/bsr.c: Storage class should be before const qualifier trivial: h8300: Storage class should be before const qualifier trivial: fix where cgroup documentation is not correctly referred to trivial: Give the right path in Documentation example trivial: MTD: remove EOL from MODULE_DESCRIPTION trivial: Fix typo in bio_split()'s documentation trivial: PWM: fix of #endif comment trivial: fix typos/grammar errors in Kconfig texts trivial: Fix misspelling of firmware trivial: cgroups: documentation typo and spelling corrections trivial: Update contact info for Jochen Hein trivial: fix typo "resgister" -> "register" ...
2009-04-03irq: fix cpumask memory leak on offstack cpumask kernelsYinghai Lu
Need to free the old cpumask for affinity and pending_mask. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> LKML-Reference: <49D18FF0.50707@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-03Document the slow work thread poolDavid Howells
Document the slow work thread pool. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
2009-04-03Make the slow work pool configurableDavid Howells
Make the slow work pool configurable through /proc/sys/kernel/slow-work. (*) /proc/sys/kernel/slow-work/min-threads The minimum number of threads that should be in the pool as long as it is in use. This may be anywhere between 2 and max-threads. (*) /proc/sys/kernel/slow-work/max-threads The maximum number of threads that should in the pool. This may be anywhere between min-threads and 255 or NR_CPUS * 2, whichever is greater. (*) /proc/sys/kernel/slow-work/vslow-percentage The percentage of active threads in the pool that may be used to execute very slow work items. This may be between 1 and 99. The resultant number is bounded to between 1 and one fewer than the number of active threads. This ensures there is always at least one thread that can process very slow work items, and always at least one thread that won't. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Steve Dickson <steved@redhat.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
2009-04-03Make slow-work thread pool actually dynamicDavid Howells
Make the slow-work thread pool actually dynamic in the number of threads it contains. With this patch, it will both create additional threads when it has extra work to do, and cull excess threads that aren't doing anything. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Steve Dickson <steved@redhat.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
2009-04-03Create a dynamically sized pool of threads for doing very slow work itemsDavid Howells
Create a dynamically sized pool of threads for doing very slow work items, such as invoking mkdir() or rmdir() - things that may take a long time and may sleep, holding mutexes/semaphores and hogging a thread, and are thus unsuitable for workqueues. The number of threads is always at least a settable minimum, but more are started when there's more work to do, up to a limit. Because of the nature of the load, it's not suitable for a 1-thread-per-CPU type pool. A system with one CPU may well want several threads. This is used by FS-Cache to do slow caching operations in the background, such as looking up, creating or deleting cache objects. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Steve Dickson <steved@redhat.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
2009-04-03blktrace: fix pdu_len when tracing packet command requestsLi Zefan
Impact: output all of packet commands - not just the first 4 / 8 bytes Since commit d7e3c3249ef23b4617393c69fe464765b4ff1645 ("block: add large command support"), struct request->cmd has been changed from unsinged char cmd[BLK_MAX_CDB] to unsigned char *cmd. v1 -> v2: by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> - make sure rq->cmd_len is always intialized, and then we can use rq->cmd_len instead of BLK_MAX_CDB. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jens Axboe <jens.axboe@oracle.com> LKML-Reference: <49D4507E.2060602@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-03blktrace: small cleanup in blk_msg_write()Li Zefan
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: "Alan D. Brunelle" <alan.brunelle@hp.com> Cc: Jens Axboe <jens.axboe@oracle.com> LKML-Reference: <49D5BB56.7000807@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-03blktrace: NUL-terminate user space messagesCarl Henrik Lunde
Impact: fix corrupted blkparse output Make sure messages from user space are NUL-terminated strings, otherwise we could dump random memory to the block trace file. Additionally, I've limited the message to BLK_TN_MAX_MSG-1 characters, because the last character would be stripped by vscnprintf anyway. Signed-off-by: Carl Henrik Lunde <chlunde@ping.uio.no> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: "Alan D. Brunelle" <alan.brunelle@hp.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20090403122714.GT5178@kernel.dk> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-03kmemtrace: small cleanupsIngo Molnar
Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> LKML-Reference: <161be9ca8a27b432c4a6ab79f47788c4521652ae.1237813499.git.eduard.munteanu@linux360.ro> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-03kmemtrace: restore original tracing data binary format, improve ABIEduard - Gabriel Munteanu
When kmemtrace was ported to ftrace, the marker strings were taken as an indication of how the traced data was being exposed to the userspace. However, the actual format had been binary, not text. This restores the original binary format, while also adding an origin CPU field (since ftrace doesn't expose the data per-CPU to userspace), and re-adding the timestamp field. It also drops arch-independent field sizing where it didn't make sense, so pointers won't always be 64 bits wide like they used to. Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> LKML-Reference: <161be9ca8a27b432c4a6ab79f47788c4521652ae.1237813499.git.eduard.munteanu@linux360.ro> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-03kmemtrace: kmemtrace_alloc() must fill type_idEduard - Gabriel Munteanu
Impact: fix trace output kmemtrace_alloc() was not filling type_id, which allowed garbage to make it into tracing data. Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> LKML-Reference: <284dba2732a144849d5aa82258fe0de2ad8dcb0b.1237813499.git.eduard.munteanu@linux360.ro> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-03kmemtrace: use tracepointsEduard - Gabriel Munteanu
kmemtrace now uses tracepoints instead of markers. We no longer need to use format specifiers to pass arguments. Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> [ folded: Use the new TP_PROTO and TP_ARGS to fix the build. ] [ folded: fix build when CONFIG_KMEMTRACE is disabled. ] [ folded: define tracepoints when CONFIG_TRACEPOINTS is enabled. ] Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> LKML-Reference: <ae61c0f37156db8ec8dc0d5778018edde60a92e3.1237813499.git.eduard.munteanu@linux360.ro> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-03kmemtrace, rcu: fix rcupreempt.c data structure dependenciesIngo Molnar
Impact: cleanup We want to remove percpu.h from rcupreempt.h, but if we do that the percpu primitives there wont build anymore. Move them to the .c file instead. Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> Cc: paulmck@linux.vnet.ibm.com LKML-Reference: <1237898630.25315.83.camel@penberg-laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-03kmemtrace, rcu: fix rcu_tree_trace.c data structure dependenciesIngo Molnar
Impact: cleanup We want to remove rcutree internals from the public rcutree.h file for upcoming kmemtrace changes - but kernel/rcutree_trace.c depends on them. Introduce kernel/rcutree.h for internal definitions. (Probably all the other data types from include/linux/rcutree.h could be moved here too - except rcu_data.) Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> Cc: paulmck@linux.vnet.ibm.com LKML-Reference: <1237898630.25315.83.camel@penberg-laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-03kmemtrace, rcu: fix linux/rcutree.h and linux/rcuclassic.h dependenciesIngo Molnar
Impact: build fix for all non-x86 architectures We want to remove percpu.h from rcuclassic.h/rcutree.h (for upcoming kmemtrace changes) but that would break the DECLARE_PER_CPU based declarations in these files. Move the quiescent counter management functions to their respective RCU implementation .c files - they were slightly above the inlining limit anyway. Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> Cc: paulmck@linux.vnet.ibm.com LKML-Reference: <1237898630.25315.83.camel@penberg-laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-02Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: Remove two unneeded exports and make two symbols static in fs/mpage.c Cleanup after commit 585d3bc06f4ca57f975a5a1f698f65a45ea66225 Trim includes of fdtable.h Don't crap into descriptor table in binfmt_som Trim includes in binfmt_elf Don't mess with descriptor table in load_elf_binary() Get rid of indirect include of fs_struct.h New helper - current_umask() check_unsafe_exec() doesn't care about signal handlers sharing New locking/refcounting for fs_struct Take fs_struct handling to new file (fs/fs_struct.c) Get rid of bumping fs_struct refcount in pivot_root(2) Kill unsharing fs_struct in __set_personality()
2009-04-02Allow rwlocks to re-enable interruptsRobin Holt
Pass the original flags to rwlock arch-code, so that it can re-enable interrupts if implemented for that architecture. Initially, make __raw_read_lock_flags and __raw_write_lock_flags stubs which just do the same thing as non-flags variants. Signed-off-by: Petr Tesarik <ptesarik@suse.cz> Signed-off-by: Robin Holt <holt@sgi.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <linux-arch@vger.kernel.org> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02Factor out #ifdefs from kernel/spinlock.c to LOCK_CONTENDED_FLAGSRobin Holt
SGI has observed that on large systems, interrupts are not serviced for a long period of time when waiting for a rwlock. The following patch series re-enables irqs while waiting for the lock, resembling the code which is already there for spinlocks. I only made the ia64 version, because the patch adds some overhead to the fast path. I assume there is currently no demand to have this for other architectures, because the systems are not so large. Of course, the possibility to implement raw_{read|write}_lock_flags for any architecture is still there. This patch: The new macro LOCK_CONTENDED_FLAGS expands to the correct implementation depending on the config options, so that IRQ's are re-enabled when possible, but they remain disabled if CONFIG_LOCKDEP is set. Signed-off-by: Petr Tesarik <ptesarik@suse.cz> Signed-off-by: Robin Holt <holt@sgi.com> Cc: <linux-arch@vger.kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02relay: fix for possible loss/corruption of produced subbufsAravind Srinivasan
Fix possible loss/corruption of produced subbufs in relay_subbufs_consumed(). When buf->subbufs_produced wraps around after UINT_MAX and buf->subbufs_consumed is still < UINT_MAX, the condition if (buf->subbufs_consumed > buf->subbufs_produced) will be true even for certain valid values of subbufs_consumed. This may lead to loss or corruption of produced subbufs. Signed-off-by: Aravind Srinivasan <raa.aars@gmail.com> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Tom Zanussi <zanussi@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02kexec: vmcoreinfo_data[] can become staticDmitri Vorobiev
The vmcoreinfo_data[] array is not used outside of kernel/kexec.c, and can therefore become static. This patch adds the relevant keyword to the definition of the array. Noticed by sparse. Signed-off-by: Dmitri Vorobiev <dmitri.vorobiev@movial.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02kexec: add dmesg log symbols to /proc/vmcoreinfo listsNeil Horman
It would be nice to be able to extract the dmesg log from a vmcore file without needing to keep the debug symbols for the running kernel handy all the time. We have a facility to do this in /proc/vmcore. This patch adds the log_buf and log_end symbols to the vmcoreinfo area so that tools (like makedumpfile) can easily extract the dmesg logs from a vmcore image. [akpm@linux-foundation.org: several fixes and cleanups] [akpm@linux-foundation.org: fix unused log_buf_kexec_setup()] [akpm@linux-foundation.org: build fix] Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Cc: Simon Horman <horms@verge.net.au> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Simon Horman <horms@verge.net.au> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02pids: kill signal_struct-> __pgrp/__session and friendsOleg Nesterov
We are wasting 2 words in signal_struct without any reason to implement task_pgrp_nr() and task_session_nr(). task_session_nr() has no callers since 2e2ba22ea4fd4bb85f0fa37c521066db6775cbef, we can remove it. task_pgrp_nr() is still (I believe wrongly) used in fs/autofsX and fs/coda. This patch reimplements task_pgrp_nr() via task_pgrp_nr_ns(), and kills __pgrp/__session and the related helpers. The change in drivers/char/tty_io.c is cosmetic, but hopefully makes sense anyway. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Alan Cox <number6@the-village.bc.nu> [tty parts] Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02pids: refactor vnr/nr_ns helpers to make them safeOleg Nesterov
Inho, the safety rules for vnr/nr_ns helpers are horrible and buggy. task_pid_nr_ns(task) needs rcu/tasklist depending on task == current. As for "special" pids, vnr/nr_ns helpers always need rcu. However, if task != current, they are unsafe even under rcu lock, we can't trust task->group_leader without the special checks. And almost every helper has a callsite which needs a fix. Also, it is a bit annoying that the implementations of, say, task_pgrp_vnr() and task_pgrp_nr_ns() are not "symmetrical". This patch introduces the new helper, __task_pid_nr_ns(), which is always safe to use, and turns all other helpers into the trivial wrappers. After this I'll send another patch which converts task_tgid_xxx() as well, they're are a bit special. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Louis Rilling <Louis.Rilling@kerlabs.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02pids: improve get_task_pid() to fix the unsafe sys_wait4()->task_pgrp()Oleg Nesterov
sys_wait4() does get_pid(task_pgrp(current)), this is not safe. We can add rcu lock/unlock around, but we already have get_task_pid() which can be improved to handle the special pids in more reliable manner. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Louis Rilling <Louis.Rilling@kerlabs.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02sysctl: fix suid_dumpable and lease-break-time sysctlsMatthew Wilcox
Arne de Bruijn points out that commit 76fdbb25f963de5dc1e308325f0578a2f92b1c2d ("coredump masking: bound suid_dumpable sysctl") mistakenly limits lease-break-time instead of suid_dumpable. Signed-off-by: Matthew Wilcox <matthew@wil.cx> Reported-by: Arne de Bruijn <kernelbt@arbruijn.dds.nl> Cc: Kawai, Hidehiro <hidehiro.kawai.ez@hitachi.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02proc_sysctl: use CONFIG_PROC_SYSCTL around ipc and utsname proc_handlersSerge E. Hallyn
As pointed out by Cedric Le Goater (in response to Alexey's original comment wrt mqns), ipc_sysctl.c and utsname_sysctl.c are using CONFIG_PROC_FS, not CONFIG_PROC_SYSCTL, to determine whether to define the proc_handlers. Change that. Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Acked-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02workqueue: avoid recursion in run_workqueue()Lai Jiangshan
1) lockdep will complain when run_workqueue() performs recursion. 2) The recursive implementation of run_workqueue() means that flush_workqueue() and its documentation are inconsistent. This may hide deadlocks and other bugs. 3) The recursion in run_workqueue() will poison cwq->current_work, but flush_work() and __cancel_work_timer(), etcetera need a reliable cwq->current_work. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Eric Dumazet <dada1@cosmosbay.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02ptrace_untrace: fix the SIGNAL_STOP_STOPPED checkOleg Nesterov
This bug is ancient too. ptrace_untrace() must not resume the task if the group stop in progress, we should set TASK_STOPPED instead. Unfortunately, we still have problems here: - if the process/thread was traced, SIGNAL_STOP_STOPPED does not necessary means this thread group is stopped. - ptrace breaks the bookkeeping of ->group_stop_count. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02ptrace_detach: the wrong wakeup breaks the ERESTARTxxx logicOleg Nesterov
Another ancient bug. Consider this trivial test-case, int main(void) { int pid = fork(); if (pid) { ptrace(PTRACE_ATTACH, pid, NULL, NULL); wait(NULL); ptrace(PTRACE_DETACH, pid, NULL, NULL); } else { pause(); printf("WE HAVE A KERNEL BUG!!!\n"); } return 0; } the child must not "escape" for sys_pause(), but it can and this was seen in practice. This is because ptrace_detach does: if (!child->exit_state) wake_up_process(child); this wakeup can happen after this child has already restarted sys_pause(), because it gets another wakeup from ptrace_untrace(). With or without this patch, perhaps sys_pause() needs a fix. But this wakeup also breaks the SIGNAL_STOP_STOPPED logic in ptrace_untrace(). Remove this wakeup. The caller saw this task in TASK_TRACED state, and unless it was SIGKILL'ed in between __ptrace_unlink()->ptrace_untrace() should handle this case correctly. If it was SIGKILL'ed, we don't need to wakup the dying tracee too. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Jerome Marchand <jmarchan@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02forget_original_parent: do not abuse child->ptrace_entryOleg Nesterov
By discussion with Roland. - Use ->sibling instead of ->ptrace_entry to chain the need to be release_task'd childs. Nobody else can use ->sibling, this task is EXIT_DEAD and nobody can find it on its own list. - rename ptrace_dead to dead_childs. - Now that we don't have the "parallel" untrace code, change back reparent_thread() to return void, pass dead_childs as an argument. Actually, I don't understand why do we notify /sbin/init when we reparent a zombie, probably it is better to reap it unconditionally. [akpm@linux-foundation.org: s/childs/children/] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Metzger, Markus T" <markus.t.metzger@intel.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02forget_original_parent: split out the un-ptrace partOleg Nesterov
By discussion with Roland. - Rename ptrace_exit() to exit_ptrace(), and change it to do all the necessary work with ->ptraced list by its own. - Move this code from exit.c to ptrace.c - Update the comment in ptrace_detach() to explain the rechecking of the child->ptrace. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Metzger, Markus T" <markus.t.metzger@intel.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02reparent_thread: fix a zombie leak if /sbin/init ignores SIGCHLDOleg Nesterov
If /sbin/init ignores SIGCHLD and we re-parent a zombie, it is leaked. reparent_thread() does do_notify_parent() which sets ->exit_signal = -1 in this case. This means that nobody except us can reap it, the detached task is not visible to do_wait(). Change reparent_thread() to return a boolean (like __pthread_detach) to indicate that the thread is dead and must be released. Also change forget_original_parent() to add the child to ptrace_dead list in this case. The naming becomes insane, the next patch does the cleanup. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02reparent_thread: fix the "is it traced" checkOleg Nesterov
reparent_thread() uses ptrace_reparented() to check whether this thread is ptraced, in that case we should not notify the new parent. But ptrace_reparented() is not exactly correct when the reparented thread is traced by /sbin/init, because forget_original_parent() has already changed ->real_parent. Currently, the only problem is the false notification. But with the next patch the kernel crash in this (yes, pathological) case. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02reparent_thread: don't call kill_orphaned_pgrp() if task_detached()Oleg Nesterov
If task_detached(p) == T, then either a) p is not the main thread, we will find the group leader on the ->children list. or b) p is the group leader but its ->exit_state = EXIT_DEAD. This can only happen when the last sub-thread has died, but in that case that thread has already called kill_orphaned_pgrp() from exit_notify(). In both cases kill_orphaned_pgrp() looks bogus. Move the task_detached() check up and simplify the code, this is also right from the "common sense" pov: we should do nothing with the detached childs, except move them to the new parent's ->children list. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02ptrace: fix possible zombie leak on PTRACE_DETACHOleg Nesterov
When ptrace_detach() takes tasklist, the tracee can be SIGKILL'ed. If it has already passed exit_notify() we can leak a zombie, because a) ptracing disables the auto-reaping logic, and b) ->real_parent was not notified about the child's death. ptrace_detach() should follow the ptrace_exit's logic, change the code accordingly. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Roland McGrath <roland@redhat.com> Tested-by: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02ptrace: reintroduce __ptrace_detach() as a callee of ptrace_exit()Oleg Nesterov
No functional changes, preparation for the next patch. Move the "should we release this child" logic into the separate handler, __ptrace_detach(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02ptrace: simplify ptrace_exit()->ignoring_children() pathOleg Nesterov
ignoring_children() takes parent->sighand->siglock and checks k_sigaction[SIGCHLD] atomically. But this buys nothing, we can't get the "really" wrong result even if we race with sigaction(SIGCHLD). If we read the "stale" sa_handler/sa_flags we can pretend it was changed right after the check. Remove spin_lock(->siglock), and kill "int ign" which caches the result of ignoring_children() which becomes rather trivial. Perhaps it makes sense to export this helper, do_notify_parent() can use it too. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02ptrace: kill __ptrace_detach(), fix ->exit_state checkOleg Nesterov
Move the code from __ptrace_detach() to its single caller and kill this helper. Also, fix the ->exit_state check, we shouldn't wake up EXIT_DEAD tasks. Actually, I think task_is_stopped_or_traced() makes more sense, but this needs another patch. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02signals: SI_USER: Masquerade si_pid when crossing pid ns boundarySukadev Bhattiprolu
When sending a signal to a descendant namespace, set ->si_pid to 0 since the sender does not have a pid in the receiver's namespace. Note: - If rt_sigqueueinfo() sets si_code to SI_USER when sending a signal across a pid namespace boundary, the value in ->si_pid will be cleared to 0. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02signals: protect cinit from blocked fatal signalsSukadev Bhattiprolu
Normally SIG_DFL signals to global and container-init are dropped early. But if a signal is blocked when it is posted, we cannot drop the signal since the receiver may install a handler before unblocking the signal. Once this signal is queued however, the receiver container-init has no way of knowing if the signal was sent from an ancestor or descendant namespace. This patch ensures that contianer-init drops all SIG_DFL signals in get_signal_to_deliver() except SIGKILL/SIGSTOP. If SIGSTOP/SIGKILL originate from a descendant of container-init they are never queued (i.e dropped in sig_ignored() in an earler patch). If SIGSTOP/SIGKILL originate from parent namespace, the signal is queued and container-init processes the signal. IOW, if get_signal_to_deliver() sees a sig_kernel_only() signal for global or container-init, the signal must have been generated internally or must have come from an ancestor ns and we process the signal. Further, the signal_group_exit() check was needed to cover the case of a multi-threaded init sending SIGKILL to other threads when doing an exit() or exec(). But since the new sig_kernel_only() check covers the SIGKILL, the signal_group_exit() check is no longer needed and can be removed. Finally, now that we have all pieces in place, set SIGNAL_UNKILLABLE for container-inits. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02signals: zap_pid_ns_process() should use force_sig()Sukadev Bhattiprolu
send_signal() assumes that signals with SEND_SIG_PRIV are generated from within the same namespace. So any nested container-init processes become immune to the SIGKILL generated by kill_proc_info() in zap_pid_ns_processes(). Use force_sig() in zap_pid_ns_processes() instead - force_sig() clears the SIGNAL_UNKILLABLE flag ensuring the signal is processed by container-inits. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02signals: protect cinit from unblocked SIG_DFL signalsSukadev Bhattiprolu
Drop early any SIG_DFL or SIG_IGN signals to container-init from within the same container. But queue SIGSTOP and SIGKILL to the container-init if they are from an ancestor container. Blocked, fatal signals (i.e when SIG_DFL is to terminate) from within the container can still terminate the container-init. That will be addressed in the next patch. Note: To be bisect-safe, SIGNAL_UNKILLABLE will be set for container-inits in a follow-on patch. Until then, this patch is just a preparatory step. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02signals: add from_ancestor_ns parameter to send_signal()Sukadev Bhattiprolu
send_signal() (or its helper) needs to determine the pid namespace of the sender. But a signal sent via kill_pid_info_as_uid() comes from within the kernel and send_signal() does not need to determine the pid namespace of the sender. So define a helper for send_signal() which takes an additional parameter, 'from_ancestor_ns' and have kill_pid_info_as_uid() use that helper directly. The 'from_ancestor_ns' parameter will be used in a follow-on patch. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02signals: protect init from unwanted signals moreOleg Nesterov
(This is a modified version of the patch submitted by Oleg Nesterov http://lkml.org/lkml/2008/11/18/249 and tries to address comments that came up in that discussion) init ignores the SIG_DFL signals but we queue them anyway, including SIGKILL. This is mostly OK, the signal will be dropped silently when dequeued, but the pending SIGKILL has 2 bad implications: - it implies fatal_signal_pending(), so we confuse things like wait_for_completion_killable/lock_page_killable. - for the sub-namespace inits, the pending SIGKILL can mask (legacy_queue) the subsequent SIGKILL from the parent namespace which must kill cinit reliably. (preparation, cinits don't have SIGNAL_UNKILLABLE yet) The patch can't help when init is ptraced, but ptracing of init is not "safe" anyway. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>