summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2011-07-20Merge branch 'rcu/urgent' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu into core/urgent
2011-07-20security,rcu: Convert call_rcu(whitelist_item_free) to kfree_rcu()Lai Jiangshan
The rcu callback whitelist_item_free() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(whitelist_item_free). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: James Morris <jmorris@namei.org> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-07-20md,rcu: Convert call_rcu(free_conf) to kfree_rcu()Lai Jiangshan
The rcu callback free_conf() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(free_conf). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: NeilBrown <neilb@suse.de> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-07-20signal: align __lock_task_sighand() irq disabling and RCUPaul E. McKenney
The __lock_task_sighand() function calls rcu_read_lock() with interrupts and preemption enabled, but later calls rcu_read_unlock() with interrupts disabled. It is therefore possible that this RCU read-side critical section will be preempted and later RCU priority boosted, which means that rcu_read_unlock() will call rt_mutex_unlock() in order to deboost itself, but with interrupts disabled. This results in lockdep splats, so this commit nests the RCU read-side critical section within the interrupt-disabled region of code. This prevents the RCU read-side critical section from being preempted, and thus prevents the attempt to deboost with interrupts disabled. It is quite possible that a better long-term fix is to make rt_mutex_unlock() disable irqs when acquiring the rt_mutex structure's ->wait_lock. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-07-20softirq,rcu: Inform RCU of irq_exit() activityPeter Zijlstra
The rcu_read_unlock_special() function relies on in_irq() to exclude scheduler activity from interrupt level. This fails because exit_irq() can invoke the scheduler after clearing the preempt_count() bits that in_irq() uses to determine that it is at interrupt level. This situation can result in failures as follows: $task IRQ SoftIRQ rcu_read_lock() /* do stuff */ <preempt> |= UNLOCK_BLOCKED rcu_read_unlock() --t->rcu_read_lock_nesting irq_enter(); /* do stuff, don't use RCU */ irq_exit(); sub_preempt_count(IRQ_EXIT_OFFSET); invoke_softirq() ttwu(); spin_lock_irq(&pi->lock) rcu_read_lock(); /* do stuff */ rcu_read_unlock(); rcu_read_unlock_special() rcu_report_exp_rnp() ttwu() spin_lock_irq(&pi->lock) /* deadlock */ rcu_read_unlock_special(t); Ed can simply trigger this 'easy' because invoke_softirq() immediately does a ttwu() of ksoftirqd/# instead of doing the in-place softirq stuff first, but even without that the above happens. Cure this by also excluding softirqs from the rcu_read_unlock_special() handler and ensuring the force_irqthreads ksoftirqd/# wakeup is done from full softirq context. [ Alternatively, delaying the ->rcu_read_lock_nesting decrement until after the special handling would make the thing more robust in the face of interrupts as well. And there is a separate patch for that. ] Cc: Thomas Gleixner <tglx@linutronix.de> Reported-and-tested-by: Ed Tomlinson <edt@aei.ca> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-07-20sched: Add irq_{enter,exit}() to scheduler_ipi()Peter Zijlstra
Ensure scheduler_ipi() calls irq_{enter,exit} when it does some actual work. Traditionally we never did any actual work from the resched IPI and all magic happened in the return from interrupt path. Now that we do do some work, we need to ensure irq_{enter,exit} are called so that we don't confuse things. This affects things like timekeeping, NO_HZ and RCU, basically everything with a hook in irq_enter/exit. Explicit examples of things going wrong are: sched_clock_cpu() -- has a callback when leaving NO_HZ state to take a new reading from GTOD and TSC. Without this callback, time is stuck in the past. RCU -- needs in_irq() to work in order to avoid some nasty deadlocks Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-07-20rcu: protect __rcu_read_unlock() against scheduler-using irq handlersPaul E. McKenney
The addition of RCU read-side critical sections within runqueue and priority-inheritance lock critical sections introduced some deadlock cycles, for example, involving interrupts from __rcu_read_unlock() where the interrupt handlers call wake_up(). This situation can cause the instance of __rcu_read_unlock() invoked from interrupt to do some of the processing that would otherwise have been carried out by the task-level instance of __rcu_read_unlock(). When the interrupt-level instance of __rcu_read_unlock() is called with a scheduler lock held from interrupt-entry/exit situations where in_irq() returns false, deadlock can result. This commit resolves these deadlocks by using negative values of the per-task ->rcu_read_lock_nesting counter to indicate that an instance of __rcu_read_unlock() is in flight, which in turn prevents instances from interrupt handlers from doing any special processing. This patch is inspired by Steven Rostedt's earlier patch that similarly made __rcu_read_unlock() guard against interrupt-mediated recursion (see https://lkml.org/lkml/2011/7/15/326), but this commit refines Steven's approach to avoid the need for preemption disabling on the __rcu_read_unlock() fastpath and to also avoid the need for manipulating a separate per-CPU variable. This patch avoids need for preempt_disable() by instead using negative values of the per-task ->rcu_read_lock_nesting counter. Note that nested rcu_read_lock()/rcu_read_unlock() pairs are still permitted, but they will never see ->rcu_read_lock_nesting go to zero, and will therefore never invoke rcu_read_unlock_special(), thus preventing them from seeing the RCU_READ_UNLOCK_BLOCKED bit should it be set in ->rcu_read_unlock_special. This patch also adds a check for ->rcu_read_unlock_special being negative in rcu_check_callbacks(), thus preventing the RCU_READ_UNLOCK_NEED_QS bit from being set should a scheduling-clock interrupt occur while __rcu_read_unlock() is exiting from an outermost RCU read-side critical section. Of course, __rcu_read_unlock() can be preempted during the time that ->rcu_read_lock_nesting is negative. This could result in the setting of the RCU_READ_UNLOCK_BLOCKED bit after __rcu_read_unlock() checks it, and would also result it this task being queued on the corresponding rcu_node structure's blkd_tasks list. Therefore, some later RCU read-side critical section would enter rcu_read_unlock_special() to clean up -- which could result in deadlock if that critical section happened to be in the scheduler where the runqueue or priority-inheritance locks were held. This situation is dealt with by making rcu_preempt_note_context_switch() check for negative ->rcu_read_lock_nesting, thus refraining from queuing the task (and from setting RCU_READ_UNLOCK_BLOCKED) if we are already exiting from the outermost RCU read-side critical section (in other words, we really are no longer actually in that RCU read-side critical section). In addition, rcu_preempt_note_context_switch() invokes rcu_read_unlock_special() to carry out the cleanup in this case, which clears out the ->rcu_read_unlock_special bits and dequeues the task (if necessary), in turn avoiding needless delay of the current RCU grace period and needless RCU priority boosting. It is still illegal to call rcu_read_unlock() while holding a scheduler lock if the prior RCU read-side critical section has ever had either preemption or irqs enabled. However, the common use case is legal, namely where then entire RCU read-side critical section executes with irqs disabled, for example, when the scheduler lock is held across the entire lifetime of the RCU read-side critical section. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-07-20slab: shrink sizeof(struct kmem_cache)Eric Dumazet
Reduce high order allocations for some setups. (NR_CPUS=4096 -> we need 64KB per kmem_cache struct) We now allocate exact needed size (using nr_cpu_ids and nr_node_ids) This also makes code a bit smaller on x86_64, since some field offsets are less than the 127 limit : Before patch : # size mm/slab.o text data bss dec hex filename 22605 361665 32 384302 5dd2e mm/slab.o After patch : # size mm/slab.o text data bss dec hex filename 22349 353473 8224 384046 5dc2e mm/slab.o CC: Andrew Morton <akpm@linux-foundation.org> Reported-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-07-20sched: Avoid creating superfluous NUMA domains on non-NUMA systemsPeter Zijlstra
When creating sched_domains, stop when we've covered the entire target span instead of continuing to create domains, only to later find they're redundant and throw them away again. This avoids single node systems from touching funny NUMA sched_domain creation code and reduces the risks of the new SD_OVERLAP code. Requested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Anton Blanchard <anton@samba.org> Cc: mahesh@linux.vnet.ibm.com Cc: benh@kernel.crashing.org Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/1311180177.29152.57.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-07-20sched: Allow for overlapping sched_domain spansPeter Zijlstra
Allow for sched_domain spans that overlap by giving such domains their own sched_group list instead of sharing the sched_groups amongst each-other. This is needed for machines with more than 16 nodes, because sched_domain_node_span() will generate a node mask from the 16 nearest nodes without regard if these masks have any overlap. Currently sched_domains have a sched_group that maps to their child sched_domain span, and since there is no overlap we share the sched_group between the sched_domains of the various CPUs. If however there is overlap, we would need to link the sched_group list in different ways for each cpu, and hence sharing isn't possible. In order to solve this, allocate private sched_groups for each CPU's sched_domain but have the sched_groups share a sched_group_power structure such that we can uniquely track the power. Reported-and-tested-by: Anton Blanchard <anton@samba.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-08bxqw9wis3qti9u5inifh3y@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-07-20sched: Break out cpu_power from the sched_group structurePeter Zijlstra
In order to prepare for non-unique sched_groups per domain, we need to carry the cpu_power elsewhere, so put a level of indirection in. Reported-and-tested-by: Anton Blanchard <anton@samba.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-qkho2byuhe4482fuknss40ad@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-07-20HID: emsff: properly handle emsff_init failureAxel Lin
emsff_init() may fail, let's properly handle the failure. Signed-off-by: Axel Lin <axel.lin@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2011-07-20superblock: move pin_sb_for_writeback() to fs/super.cDave Chinner
The per-sb shrinker has the same requirement as the writeback threads of ensuring that the superblock is usable and pinned for the time it takes to run the work. Both need to take a passive reference to the sb, take a read lock on the s_umount lock and then only continue if an unmount is not in progress. pin_sb_for_writeback() does this exactly, so move it to fs/super.c and rename it to grab_super_passive() and exporting it via fs/internal.h for all the VFS code to be able to use. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20inode: move to per-sb LRU locksDave Chinner
With the inode LRUs moving to per-sb structures, there is no longer a need for a global inode_lru_lock. The locking can be made more fine-grained by moving to a per-sb LRU lock, isolating the LRU operations of different filesytsems completely from each other. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20inode: Make unused inode LRU per superblockDave Chinner
The inode unused list is currently a global LRU. This does not match the other global filesystem cache - the dentry cache - which uses per-superblock LRU lists. Hence we have related filesystem object types using different LRU reclaimation schemes. To enable a per-superblock filesystem cache shrinker, both of these caches need to have per-sb unused object LRU lists. Hence this patch converts the global inode LRU to per-sb LRUs. The patch only does rudimentary per-sb propotioning in the shrinker infrastructure, as this gets removed when the per-sb shrinker callouts are introduced later on. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20inode: convert inode_stat.nr_unused to per-cpu countersDave Chinner
Before we split up the inode_lru_lock, the unused inode counter needs to be made independent of the global inode_lru_lock. Convert it to per-cpu counters to do this. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20vmscan: add customisable shrinker batch sizeDave Chinner
For shrinkers that have their own cond_resched* calls, having shrink_slab break the work down into small batches is not paticularly efficient. Add a custom batchsize field to the struct shrinker so that shrinkers can use a larger batch size if they desire. A value of zero (uninitialised) means "use the default", so behaviour is unchanged by this patch. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20vmscan: reduce wind up shrinker->nr when shrinker can't do workDave Chinner
When a shrinker returns -1 to shrink_slab() to indicate it cannot do any work given the current memory reclaim requirements, it adds the entire total_scan count to shrinker->nr. The idea ehind this is that whenteh shrinker is next called and can do work, it will do the work of the previously aborted shrinker call as well. However, if a filesystem is doing lots of allocation with GFP_NOFS set, then we get many, many more aborts from the shrinkers than we do successful calls. The result is that shrinker->nr winds up to it's maximum permissible value (twice the current cache size) and then when the next shrinker call that can do work is issued, it has enough scan count built up to free the entire cache twice over. This manifests itself in the cache going from full to empty in a matter of seconds, even when only a small part of the cache is needed to be emptied to free sufficient memory. Under metadata intensive workloads on ext4 and XFS, I'm seeing the VFS caches increase memory consumption up to 75% of memory (no page cache pressure) over a period of 30-60s, and then the shrinker empties them down to zero in the space of 2-3s. This cycle repeats over and over again, with the shrinker completely trashing the inode and dentry caches every minute or so the workload continues. This behaviour was made obvious by the shrink_slab tracepoints added earlier in the series, and made worse by the patch that corrected the concurrent accounting of shrinker->nr. To avoid this problem, stop repeated small increments of the total scan value from winding shrinker->nr up to a value that can cause the entire cache to be freed. We still need to allow it to wind up, so use the delta as the "large scan" threshold check - if the delta is more than a quarter of the entire cache size, then it is a large scan and allowed to cause lots of windup because we are clearly needing to free lots of memory. If it isn't a large scan then limit the total scan to half the size of the cache so that windup never increases to consume the whole cache. Reducing the total scan limit further does not allow enough wind-up to maintain the current levels of performance, whilst a higher threshold does not prevent the windup from freeing the entire cache under sustained workloads. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20vmscan: shrinker->nr updates race and go wrongDave Chinner
shrink_slab() allows shrinkers to be called in parallel so the struct shrinker can be updated concurrently. It does not provide any exclusio for such updates, so we can get the shrinker->nr value increasing or decreasing incorrectly. As a result, when a shrinker repeatedly returns a value of -1 (e.g. a VFS shrinker called w/ GFP_NOFS), the shrinker->nr goes haywire, sometimes updating with the scan count that wasn't used, sometimes losing it altogether. Worse is when a shrinker does work and that update is lost due to racy updates, which means the shrinker will do the work again! Fix this by making the total_scan calculations independent of shrinker->nr, and making the shrinker->nr updates atomic w.r.t. to other updates via cmpxchg loops. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20vmscan: add shrink_slab tracepointsDave Chinner
It is impossible to understand what the shrinkers are actually doing without instrumenting the code, so add a some tracepoints to allow insight to be gained. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20make d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err)Al Viro
... and simplify the living hell out of callers Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20deuglify squashfs_lookup()Al Viro
d_splice_alias(NULL, dentry) is equivalent to d_add(dentry, NULL), NULL so no need for that if (inode) ... in there (or ERR_PTR(0), for that matter) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20nfsd4_list_rec_dir(): don't bother with reopening rec_fileAl Viro
just rewind it to the beginning before vfs_readdir() and be done with that... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20kill useless checks for sb->s_op == NULLAl Viro
never is... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20btrfs: kill magical embedded struct superblockAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20get rid of pointless checks for dentry->sb == NULLAl Viro
it never is... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20Make ->d_sb assign-once and always non-NULLAl Viro
New helper (non-exported, fs/internal.h-only): __d_alloc(sb, name). Allocates dentry, sets its ->d_sb to given superblock and sets ->d_op accordingly. Old d_alloc(NULL, name) callers are converted to that (all of them know what superblock they want). d_alloc() itself is left only for parent != NULl case; uses __d_alloc(), inserts result into the list of parent's children. Note that now ->d_sb is assign-once and never NULL *and* ->d_parent is never NULL either. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20unexport kern_path_parent()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20switch vfs_path_lookup() to struct pathAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20kill lookup_create()Al Viro
folded into the only caller (kern_path_create()) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20devtmpfs: get rid of bogus mkdir in create_path()Al Viro
We do _NOT_ want to mkdir the path itself - we are preparing to mknod it, after all. Normally it'll fail with -ENOENT and just do nothing, but if somebody has created the parent in the meanwhile, we'll get buggered... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20switch devtmpfs to kern_path_create()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20switch devtmpfs object creation/removal to separate kernel threadAl Viro
... and give it a namespace where devtmpfs would be mounted on root, thus avoiding abuses of vfs_path_lookup() (it was never intended to be used with LOOKUP_PARENT). Games with credentials are also gone. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20make sure that nsproxy_cache is initialized early enoughAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20switch do_spufs_create() to user_path_create(), fix double-unlockAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20new helpers: kern_path_create/user_path_createAl Viro
combination of kern_path_parent() and lookup_create(). Does *not* expose struct nameidata to caller. Syscalls converted to that... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20kill LOOKUP_CONTINUEAl Viro
LOOKUP_PARENT is equivalent to it now Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20nfs: LOOKUP_{OPEN,CREATE,EXCL} is set only on the last stepAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20cifs_lookup(): LOOKUP_OPEN is set only on the last componentAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20ceph: LOOKUP_OPEN is set only when it's the last componentAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20jfs_ci_revalidate() is safe from RCU modeAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20LOOKUP_CREATE and LOOKUP_RENAME_TARGET can be set only on the last stepAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20no need to check for LOOKUP_OPEN in ->create() instancesAl Viro
... it will be set in nd->flag for all cases with non-NULL nd (i.e. when called from do_last()). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20don't pass nameidata to vfs_create() from ecryptfs_create()Al Viro
Instead of playing with removal of LOOKUP_OPEN, mangling (and restoring) nd->path, just pass NULL to vfs_create(). The whole point of what's being done there is to suppress any attempts to open file by underlying fs, which is what nd == NULL indicates. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20don't transliterate lower bits of ->intent.open.flags to FMODE_...Al Viro
->create() instances are much happier that way... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20Don't pass nameidata when calling vfs_create() from mknod()Al Viro
All instances can cope with that now (and ceph one actually starts working properly). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20fix mknod() on nfs4 (hopefully)Al Viro
a) check the right flags in ->create() (LOOKUP_OPEN, not LOOKUP_CREATE) b) default (!LOOKUP_OPEN) open_flags is O_CREAT|O_EXCL|FMODE_READ, not 0 c) lookup_instantiate_filp() should be done only with LOOKUP_OPEN; otherwise we need to issue CLOSE, lest we leak stateid on server. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20nameidata_to_nfs_open_context() doesn't need nameidata, actually...Al Viro
just open flags; switched to passing just those and renamed to create_nfs_open_context() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20nfs_open_context doesn't need struct path eitherAl Viro
just dentry, please... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20nfs4_opendata doesn't need struct path eitherAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>