summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2010-12-20xfs: Pull EFI/EFD handling out from under the AIL lockDave Chinner
EFI/EFD interactions are protected from races by the AIL lock. They are the only type of log items that require the the AIL lock to serialise internal state, so they need to be separated from the AIL lock before we can do bulk insert operations on the AIL. To acheive this, convert the counter of the number of extents in the EFI to an atomic so it can be safely manipulated by EFD processing without locks. Also, convert the EFI state flag manipulations to use atomic bit operations so no locks are needed to record state changes. Finally, use the state bits to determine when it is safe to free the EFI and clean up the code to do this neatly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-20xfs: fix EFI transaction cancellation.Dave Chinner
XFS_EFI_CANCELED has not been set in the code base since xfs_efi_cancel() was removed back in 2006 by commit 065d312e15902976d256ddaf396a7950ec0350a8 ("[XFS] Remove unused iop_abort log item operation), and even then xfs_efi_cancel() was never called. I haven't tracked it back further than that (beyond git history), but it indicates that the handling of EFIs in cancelled transactions has been broken for a long time. Basically, when we get an IOP_UNPIN(lip, 1); call from xfs_trans_uncommit() (i.e. remove == 1), if we don't free the log item descriptor we leak it. Fix the behviour to be correct and kill the XFS_EFI_CANCELED flag. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-03Merge branch 'master' of /pub/scm/linux/kernel/git/torvalds/linux-2.6Steve French
2010-12-02reiserfs: don't acquire lock recursively in reiserfs_acl_chmodFrederic Weisbecker
reiserfs_acl_chmod() can be called by reiserfs_set_attr() and then take the reiserfs lock a second time. Thereafter it may call journal_begin() that definitely requires the lock not to be nested in order to release it before taking the journal mutex because the reiserfs lock depends on the journal mutex already. So, aviod nesting the lock in reiserfs_acl_chmod(). Reported-by: Pawel Zawora <pzawora@gmail.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Tested-by: Pawel Zawora <pzawora@gmail.com> Cc: Jeff Mahoney <jeffm@suse.com> Cc: <stable@kernel.org> [2.6.32.x+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-02cifs: add attribute cache timeout (actimeo) tunableSuresh Jayaraman
Currently, the attribute cache timeout for CIFS is hardcoded to 1 second. This means that the client might have to issue a QPATHINFO/QFILEINFO call every 1 second to verify if something has changes, which seems too expensive. On the other hand, if the timeout is hardcoded to a higher value, workloads that expect strict cache coherency might see unexpected results. Making attribute cache timeout as a tunable will allow us to make a tradeoff between performance and cache metadata correctness depending on the application/workload needs. Add 'actimeo' tunable that can be used to tune the attribute cache timeout. The default timeout is set to 1 second. Also, display actimeo option value in /proc/mounts. It appears to me that 'actimeo' and the proposed (but not yet merged) 'strictcache' option cannot coexist, so care must be taken that we reset the other option if one of them is set. Changes since last post: - fix option parsing and handle possible values correcly Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-12-02Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfsLinus Torvalds
* 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: only run xfs_error_test if error injection is active xfs: avoid moving stale inodes in the AIL xfs: delayed alloc blocks beyond EOF are valid after writeback xfs: push stale, pinned buffers on trylock failures xfs: fix failed write truncation handling.
2010-12-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: cifs: fix parsing of hostname in dfs referrals cifs: display fsc in /proc/mounts cifs: enable fscache iff fsc mount option is used explicitly cifs: allow fsc mount option only if CONFIG_CIFS_FSCACHE is set cifs: Handle extended attribute name cifs_acl to generate cifs acl blob (try #4) cifs: Misc. cleanup in cifsacl handling [try #4] cifs: trivial comment fix for cifs_invalidate_mapping [CIFS] fs/cifs/Kconfig: CIFS depends on CRYPTO_HMAC cifs: don't take extra tlink reference in initiate_cifs_search cifs: Percolate error up to the caller during get/set acls [try #4] cifs: fix another memleak, in cifs_root_iget cifs: fix potential use-after-free in cifs_oplock_break_put
2010-12-02NFS: Fix a memory leak in nfs_readdirTrond Myklebust
We need to ensure that the entries in the nfs_cache_array get cleared when the page is removed from the page cache. To do so, we use the freepage address_space operation. Change nfs_readdir_clear_array to use kmap_atomic(), so that the function can be safely called from all contexts. Finally, modify the cache_page_release helper to call nfs_readdir_clear_array directly, when dealing with an anonymous page from 'uncached_readdir'. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-02xfs: connect up buffer reclaim priority hooksDave Chinner
Now that the buffer reclaim infrastructure can handle different reclaim priorities for different types of buffers, reconnect the hooks in the XFS code that has been sitting dormant since it was ported to Linux. This should finally give use reclaim prioritisation that is on a par with the functionality that Irix provided XFS 15 years ago. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-02xfs: add a lru to the XFS buffer cacheDave Chinner
Introduce a per-buftarg LRU for memory reclaim to operate on. This is the last piece we need to put in place so that we can fully control the buffer lifecycle. This allows XFS to be responsibile for maintaining the working set of buffers under memory pressure instead of relying on the VM reclaim not to take pages we need out from underneath us. The implementation introduces a b_lru_ref counter into the buffer. This is currently set to 1 whenever the buffer is referenced and so is used to determine if the buffer should be added to the LRU or not when freed. Effectively it allows lazy LRU initialisation of the buffer so we do not need to touch the LRU list and locks in xfs_buf_find(). Instead, when the buffer is being released and we drop the last reference to it, we check the b_lru_ref count and if it is none zero we re-add the buffer reference and add the inode to the LRU. The b_lru_ref counter is decremented by the shrinker, and whenever the shrinker comes across a buffer with a zero b_lru_ref counter, if released the LRU reference on the buffer. In the absence of a lookup race, this will result in the buffer being freed. This counting mechanism is used instead of a reference flag so that it is simple to re-introduce buffer-type specific reclaim reference counts to prioritise reclaim more effectively. We still have all those hooks in the XFS code, so this will provide the infrastructure to re-implement that functionality. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-01ceph: Behave better when handling file lock replies.Herb Shiu
Fill in the local lock with response data if appropriate, and don't call posix_lock_file when reading locks. Signed-off-by: Herb Shiu <herb_shiu@tcloudcomputing.com> Acked-by: Greg Farnum <gregf@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-01ceph: pass lock information by struct file_lock instead of as individual params.Herb Shiu
Signed-off-by: Herb Shiu <herb_shiu@tcloudcomputing.com> Acked-by: Greg Farnum <gregf@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-01ceph: Handle file locks in replies from the MDS.Herb Shiu
Previously the kernel client incorrectly assumed everything was a directory. Signed-off-by: Herb Shiu <herb_shiu@tcloudcomputing.com> Acked-by: Greg Farnum <gregf@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-01ceph: avoid possible null deref in readdir after dir llseekSage Weil
last may be NULL, but we dereference it in the else branch without checking. Normally it doesn't trigger because last == NULL when fpos == 2, but it could happen on a newly opened dir if the user seeks forward. Reported-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-01xfs: only run xfs_error_test if error injection is activeDave Chinner
Recent tests writing lots of small files showed the flusher thread being CPU bound and taking a long time to do allocations on a debug kernel. perf showed this as the prime reason: samples pcnt function DSO _______ _____ ___________________________ _________________ 224648.00 36.8% xfs_error_test [kernel.kallsyms] 86045.00 14.1% xfs_btree_check_sblock [kernel.kallsyms] 39778.00 6.5% prandom32 [kernel.kallsyms] 37436.00 6.1% xfs_btree_increment [kernel.kallsyms] 29278.00 4.8% xfs_btree_get_rec [kernel.kallsyms] 27717.00 4.5% random32 [kernel.kallsyms] Walking btree blocks during allocation checking them requires each block (a cache hit, so no I/O) call xfs_error_test(), which then does a random32() call as the first operation. IOWs, ~50% of the CPU is being consumed just testing whether we need to inject an error, even though error injection is not active. Kill this overhead when error injection is not active by adding a global counter of active error traps and only calling into xfs_error_test when fault injection is active. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-01xfs: avoid moving stale inodes in the AILDave Chinner
When an inode has been marked stale because the cluster is being freed, we don't want to (re-)insert this inode into the AIL. There is a race condition where the cluster buffer may be unpinned before the inode is inserted into the AIL during transaction committed processing. If the buffer is unpinned before the inode item has been committed and inserted, then it is possible for the buffer to be released and hence processthe stale inode callbacks before the inode is inserted into the AIL. In this case, we then insert a clean, stale inode into the AIL which will never get removed by an IO completion. It will, however, get reclaimed and that triggers an assert in xfs_inode_free() complaining about freeing an inode still in the AIL. This race can be avoided by not moving stale inodes forward in the AIL during transaction commit completion processing. This closes the race condition by ensuring we never insert clean stale inodes into the AIL. It is safe to do this because a dirty stale inode, by definition, must already be in the AIL. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-01xfs: delayed alloc blocks beyond EOF are valid after writebackDave Chinner
There is an assumption in the parts of XFS that flushing a dirty file will make all the delayed allocation blocks disappear from an inode. That is, that after calling xfs_flush_pages() then ip->i_delayed_blks will be zero. This is an invalid assumption as we may have specualtive preallocation beyond EOF and they are recorded in ip->i_delayed_blks. A flush of the dirty pages of an inode will not change the state of these blocks beyond EOF, so a non-zero deeelalloc block count after a flush is valid. The bmap code has an invalid ASSERT() that needs to be removed, and the swapext code has a bug in that while it swaps the data forks around, it fails to swap the i_delayed_blks counter associated with the fork and hence can get the block accounting wrong. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-01xfs: push stale, pinned buffers on trylock failuresDave Chinner
As reported by Nick Piggin, XFS is suffering from long pauses under highly concurrent workloads when hosted on ramdisks. The problem is that an inode buffer is stuck in the pinned state in memory and as a result either the inode buffer or one of the inodes within the buffer is stopping the tail of the log from being moved forward. The system remains in this state until a periodic log force issued by xfssyncd causes the buffer to be unpinned. The main problem is that these are stale buffers, and are hence held locked until the transaction/checkpoint that marked them state has been committed to disk. When the filesystem gets into this state, only the xfssyncd can cause the async transactions to be committed to disk and hence unpin the inode buffer. This problem was encountered when scaling the busy extent list, but only the blocking lock interface was fixed to solve the problem. Extend the same fix to the buffer trylock operations - if we fail to lock a pinned, stale buffer, then force the log immediately so that when the next attempt to lock it comes around, it will have been unpinned. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-01xfs: fix failed write truncation handling.Dave Chinner
Since the move to the new truncate sequence we call xfs_setattr to truncate down excessively instanciated blocks. As shown by the testcase in kernel.org BZ #22452 that doesn't work too well. Due to the confusion of the internal inode size, and the VFS inode i_size it zeroes data that it shouldn't. But full blown truncate seems like overkill here. We only instanciate delayed allocations in the write path, and given that we never released the iolock we can't have converted them to real allocations yet either. The only nasty case is pre-existing preallocation which we need to skip. We already do this for page discard during writeback, so make the delayed allocation block punching a generic function and call it from the failed write path as well as xfs_aops_discard_page. The callers are responsible for ensuring that partial blocks are not truncated away, and that they hold the ilock. Based on a fix originally from Christoph Hellwig. This version used filesystem blocks as the range unit. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-01NFS: Ensure we use the correct cookie in nfs_readdir_xdr_fillerTrond Myklebust
We need to use the cookie from the previous array entry, not the actual cookie that we are searching for (except for the case of uncached_readdir). Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-11-30exec: copy-and-paste the fixes into compat_do_execve() pathsOleg Nesterov
Note: this patch targets 2.6.37 and tries to be as simple as possible. That is why it adds more copy-and-paste horror into fs/compat.c and uglifies fs/exec.c, this will be cleanuped later. compat_copy_strings() plays with bprm->vma/mm directly and thus has two problems: it lacks the RLIMIT_STACK check and argv/envp memory is not visible to oom killer. Export acct_arg_size() and get_arg_page(), change compat_copy_strings() to use get_arg_page(), change compat_do_execve() to do acct_arg_size(0) as do_execve() does. Add the fatal_signal_pending/cond_resched checks into compat_count() and compat_copy_strings(), this matches the code in fs/exec.c and certainly makes sense. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-30exec: make argv/envp memory visible to oom-killerOleg Nesterov
Brad Spengler published a local memory-allocation DoS that evades the OOM-killer (though not the virtual memory RLIMIT): http://www.grsecurity.net/~spender/64bit_dos.c execve()->copy_strings() can allocate a lot of memory, but this is not visible to oom-killer, nobody can see the nascent bprm->mm and take it into account. With this patch get_arg_page() increments current's MM_ANONPAGES counter every time we allocate the new page for argv/envp. When do_execve() succeds or fails, we change this counter back. Technically this is not 100% correct, we can't know if the new page is swapped out and turn MM_ANONPAGES into MM_SWAPENTS, but I don't think this really matters and everything becomes correct once exec changes ->mm or fails. Reported-by: Brad Spengler <spender@grsecurity.net> Reviewed-and-discussed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-30cifs: fix parsing of hostname in dfs referralsJeff Layton
The DFS referral parsing code does a memchr() call to find the '\\' delimiter that separates the hostname in the referral UNC from the sharename. It then uses that value to set the length of the hostname via pointer subtraction. Instead of subtracting the start of the hostname however, it subtracts the start of the UNC, which causes the code to pass in a hostname length that is 2 bytes too long. Regression introduced in commit 1a4240f4. Reported-and-Tested-by: Robbert Kouprie <robbert@exx.nl> Signed-off-by: Jeff Layton <jlayton@redhat.com> Cc: Wang Lei <wang840925@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: stable@kernel.org Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-11-30NFS: Fix a readdirplus bugTrond Myklebust
When comparing filehandles in the helper nfs_same_file(), we should not be using 'strncmp()': filehandles are not null terminated strings. Instead, we should just use the existing helper nfs_compare_fh(). Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-30GFS2: Merge glock state fields into a bitfieldSteven Whitehouse
We can only merge the fields into a bitfield if the locking rules for them are the same. In this case gl_spin covers all of the fields (write side) but a couple of them are used with GLF_LOCK as the read side lock, which should be ok since we know that the field in question won't be changing at the time. The gl_req setting has to be done earlier (in glock.c) in order to place it under gl_spin. The gl_reply setting also has to be brought under gl_spin in order to comply with the new rules. This saves 4*sizeof(unsigned int) per glock. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Bob Peterson <rpeterso@redhat.com>
2010-11-30GFS2: Fix uninitialised error value in previous patchSteven Whitehouse
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-11-30GFS2: fix recursive locking during rindex truncatesBenjamin Marzinski
When you truncate the rindex file, you need to avoid calling gfs2_rindex_hold, since you already hold it. However, if you haven't already read in the resource groups, you need to do that. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-11-30fuse: verify ioctl retriesMiklos Szeredi
Verify that the total length of the iovec returned in FUSE_IOCTL_RETRY doesn't overflow iov_length(). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: Tejun Heo <tj@kernel.org> CC: <stable@kernel.org> [2.6.31+]
2010-11-30fuse: fix ioctl when server is 32bitMiklos Szeredi
If a 32bit CUSE server is run on 64bit this results in EIO being returned to the caller. The reason is that FUSE_IOCTL_RETRY reply was defined to use 'struct iovec', which is different on 32bit and 64bit archs. Work around this by looking at the size of the reply to determine which struct was used. This is only needed if CONFIG_COMPAT is defined. A more permanent fix for the interface will be to use the same struct on both 32bit and 64bit. Reported-by: "ccmail111" <ccmail111@yahoo.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: Tejun Heo <tj@kernel.org> CC: <stable@kernel.org> [2.6.31+]
2010-11-30GFS2: reread rindex when necessary to grow rindexBenjamin Marzinski
When GFS2 grew the filesystem, it was never rereading the rindex file during the grow. This is necessary for large grows when the filesystem is almost full, and GFS2 needs to use some of the space allocated earlier in the grow to complete it. Now, if GFS2 fails to reserve the necessary space and the rindex file is not uptodate, it rereads it. Also, the only difference between gfs2_ri_update() and gfs2_ri_update_special() was that gfs2_ri_update_special() didn't clear out the existing resource groups, since you knew that it was only called when there were no resource groups. Attempting to clear out the resource groups when there are none takes almost no time, and rarely happens, so I simply removed gfs2_ri_update_special(). Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-11-30GFS2: Remove duplicate #defines from glock.hSteven Whitehouse
There are a number of duplicated #defines in glock.h plus one which is unused. This removes the extra definitions. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-11-30sched: Add 'autogroup' scheduling feature: automated per session task groupsMike Galbraith
A recurring complaint from CFS users is that parallel kbuild has a negative impact on desktop interactivity. This patch implements an idea from Linus, to automatically create task groups. Currently, only per session autogroups are implemented, but the patch leaves the way open for enhancement. Implementation: each task's signal struct contains an inherited pointer to a refcounted autogroup struct containing a task group pointer, the default for all tasks pointing to the init_task_group. When a task calls setsid(), a new task group is created, the process is moved into the new task group, and a reference to the preveious task group is dropped. Child processes inherit this task group thereafter, and increase it's refcount. When the last thread of a process exits, the process's reference is dropped, such that when the last process referencing an autogroup exits, the autogroup is destroyed. At runqueue selection time, IFF a task has no cgroup assignment, its current autogroup is used. Autogroup bandwidth is controllable via setting it's nice level through the proc filesystem: cat /proc/<pid>/autogroup Displays the task's group and the group's nice level. echo <nice level> > /proc/<pid>/autogroup Sets the task group's shares to the weight of nice <level> task. Setting nice level is rate limited for !admin users due to the abuse risk of task group locking. The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via the boot option noautogroup, and can also be turned on/off on the fly via: echo [01] > /proc/sys/kernel/sched_autogroup_enabled ... which will automatically move tasks to/from the root task group. Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Markus Trippelsdorf <markus@trippelsdorf.de> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Paul Turner <pjt@google.com> Cc: Oleg Nesterov <oleg@redhat.com> [ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ] Signed-off-by: Ingo Molnar <mingo@elte.hu> LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-30ARM: 6485/5: proc/vmcore - allow archs to override vmcore_elf_check_arch()Mika Westerberg
Allow architectures to redefine this macro if needed. This is useful for example in architectures where 64-bit ELF vmcores are not supported. Specifying zero vmcore_elf64_check_arch() allows compiler to optimize away unnecessary parts of parse_crash_elf64_headers(). We also rename the macro to vmcore_elf64_check_arch() to reflect that it is used for 64-bit vmcores only. Signed-off-by: Mika Westerberg <mika.westerberg@iki.fi> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2010-11-30GFS2: Clean up of gdlm_lock functionSteven Whitehouse
The DLM never returns -EAGAIN in response to dlm_lock(), and even if it did, the test in gdlm_lock() was wrong anyway. Once that test is removed, it is possible to greatly simplify this code by simply using a "normal" error return code (0 for success). We then no longer need the LM_OUT_ASYNC return code which can be removed. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-11-30GFS2: Allow gfs2 to update quota usage values through the quotactl interfaceAbhijith Das
With this patch the gfs2_set_dqblk() function will be able to update the quota usage block count (FS_DQ_BCOUNT) in addition to the already supported FS_DQ_BHARD (limit) and FS_DQ_BSOFT (warn) fields of the dquot structure. Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-11-30GFS2: fs/gfs2/glock.h: Add __attribute__((format(printf,2,3)) to gfs2_print_dbgJoe Perches
Functions that use printf formatting, especially those that use %pV, should have their uses of printf format and arguments checked by the compiler. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-11-30GFS2: fs/gfs2/glock.c: Use printf extension %pVJoe Perches
Using %pV reduces the number of printk calls and eliminates any possible message interleaving from other printk calls. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-11-30GFS2: Clean up duplicated setattr codeSteven Whitehouse
While preparing the last patch I noticed that the gfs2_setattr_simple code had been duplicated into two other places. This patch updates those to call gfs2_setattr_simple rather than open coding it. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-11-30GFS2: Remove unreachable calls to vmtruncateSteven Whitehouse
Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-11-30GFS2: fs/gfs2/glock.c: Convert sprintf_symbol to %pSJoe Perches
Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-11-30GFS2: Change two WQ_RESCUERs into WQ_MEM_RECLAIMSteven Whitehouse
The WQ_RESCUER flag should only be used internally to the workqueue implementation. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Acked-by: Tejun Heo <tj@kernel.org>
2010-11-30xfs: convert xfsbud shrinker to a per-buftarg shrinker.Dave Chinner
Before we introduce per-buftarg LRU lists, split the shrinker implementation into per-buftarg shrinker callbacks. At the moment we wake all the xfsbufds to run the delayed write queues to free the dirty buffers and make their pages available for reclaim. However, with an LRU, we want to be able to free clean, unused buffers as well, so we need to separate the xfsbufd from the shrinker callbacks. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>
2010-12-16xfs: convert pag_ici_lock to a spin lockDave Chinner
now that we are using RCU protection for the inode cache lookups, the lock is only needed on the modification side. Hence it is not necessary for the lock to be a rwlock as there are no read side holders anymore. Convert it to a spin lock to reflect it's exclusive nature. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-17xfs: convert inode cache lookups to use RCU lockingDave Chinner
With delayed logging greatly increasing the sustained parallelism of inode operations, the inode cache locking is showing significant read vs write contention when inode reclaim runs at the same time as lookups. There is also a lot more write lock acquistions than there are read locks (4:1 ratio) so the read locking is not really buying us much in the way of parallelism. To avoid the read vs write contention, change the cache to use RCU locking on the read side. To avoid needing to RCU free every single inode, use the built in slab RCU freeing mechanism. This requires us to be able to detect lookups of freed inodes, so enѕure that ever freed inode has an inode number of zero and the XFS_IRECLAIM flag set. We already check the XFS_IRECLAIM flag in cache hit lookup path, but also add a check for a zero inode number as well. We canthen convert all the read locking lockups to use RCU read side locking and hence remove all read side locking. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com>
2010-12-16xfs: rcu free inodesDave Chinner
Introduce RCU freeing of XFS inodes so that we can convert lookup traversals to use rcu_read_lock() protection. This patch only introduces the RCU freeing to minimise the potential conflicts with mainline if this is merged into mainline via a VFS patchset. It abuses the i_dentry list for the RCU callback structure because the VFS patches make this a union so it is safe to use like this and simplifies and merge issues. This patch uses basic RCU freeing rather than SLAB_DESTROY_BY_RCU. The later lookup patches need the same "found free inode" protection regardless of the RCU freeing method used, so once again the RCU freeing method can be dealt with apprpriately at merge time without affecting any other code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-12-23xfs: don't truncate prealloc from frequently accessed inodesDave Chinner
A long standing problem for streaming writeѕ through the NFS server has been that the NFS server opens and closes file descriptors on an inode for every write. The result of this behaviour is that the ->release() function is called on every close and that results in XFS truncating speculative preallocation beyond the EOF. This has an adverse effect on file layout when multiple files are being written at the same time - they interleave their extents and can result in severe fragmentation. To avoid this problem, keep track of ->release calls made on a dirty inode. For most cases, an inode is only going to be opened once for writing and then closed again during it's lifetime in cache. Hence if there are multiple ->release calls when the inode is dirty, there is a good chance that the inode is being accessed by the NFS server. Hence set a flag the first time ->release is called while there are delalloc blocks still outstanding on the inode. If this flag is set when ->release is next called, then do no truncate away the speculative preallocation - leave it there so that subsequent writes do not need to reallocate the delalloc space. This will prevent interleaving of extents of different inodes written concurrently to the same AG. If we get this wrong, it is not a big deal as we truncate speculative allocation beyond EOF anyway in xfs_inactive() when the inode is thrown out of the cache. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2011-01-04xfs: dynamic speculative EOF preallocationDave Chinner
Currently the size of the speculative preallocation during delayed allocation is fixed by either the allocsize mount option of a default size. We are seeing a lot of cases where we need to recommend using the allocsize mount option to prevent fragmentation when buffered writes land in the same AG. Rather than using a fixed preallocation size by default (up to 64k), make it dynamic by basing it on the current inode size. That way the EOF preallocation will increase as the file size increases. Hence for streaming writes we are much more likely to get large preallocations exactly when we need it to reduce fragementation. For default settings, the size of the initial extents is determined by the number of parallel writers and the amount of memory in the machine. For 4GB RAM and 4 concurrent 32GB file writes: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..1048575]: 1048672..2097247 0 (1048672..2097247) 1048576 1: [1048576..2097151]: 5242976..6291551 0 (5242976..6291551) 1048576 2: [2097152..4194303]: 12583008..14680159 0 (12583008..14680159) 2097152 3: [4194304..8388607]: 25165920..29360223 0 (25165920..29360223) 4194304 4: [8388608..16777215]: 58720352..67108959 0 (58720352..67108959) 8388608 5: [16777216..33554423]: 117440584..134217791 0 (117440584..134217791) 16777208 6: [33554424..50331511]: 184549056..201326143 0 (184549056..201326143) 16777088 7: [50331512..67108599]: 251657408..268434495 0 (251657408..268434495) 16777088 and for 16 concurrent 16GB file writes: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..262143]: 2490472..2752615 0 (2490472..2752615) 262144 1: [262144..524287]: 6291560..6553703 0 (6291560..6553703) 262144 2: [524288..1048575]: 13631592..14155879 0 (13631592..14155879) 524288 3: [1048576..2097151]: 30408808..31457383 0 (30408808..31457383) 1048576 4: [2097152..4194303]: 52428904..54526055 0 (52428904..54526055) 2097152 5: [4194304..8388607]: 104857704..109052007 0 (104857704..109052007) 4194304 6: [8388608..16777215]: 209715304..218103911 0 (209715304..218103911) 8388608 7: [16777216..33554423]: 452984848..469762055 0 (452984848..469762055) 16777208 Because it is hard to take back specualtive preallocation, cases where there are large slow growing log files on a nearly full filesystem may cause premature ENOSPC. Hence as the filesystem nears full, the maximum dynamic prealloc size іs reduced according to this table (based on 4k block size): freespace max prealloc size >5% full extent (8GB) 4-5% 2GB (8GB >> 2) 3-4% 1GB (8GB >> 3) 2-3% 512MB (8GB >> 4) 1-2% 256MB (8GB >> 5) <1% 128MB (8GB >> 6) This should reduce the amount of space held in speculative preallocation for such cases. The allocsize mount option turns off the dynamic behaviour and fixes the prealloc size to whatever the mount option specifies. i.e. the behaviour is unchanged. Signed-off-by: Dave Chinner <dchinner@redhat.com>
2010-12-23xfs: use KM_NOFS for allocations during attribute list operationsDave Chinner
When listing attributes, we are doiing memory allocations under the inode ilock using only KM_SLEEP. This allows memory allocation to recurse back into the filesystem and do writeback, which may the ilock we already hold on the current inode. THis will deadlock. Hence use KM_NOFS for such allocations outside of transaction context to ensure that reclaim recursion does not occur. Reported-by: Nick Piggin <npiggin@gmail.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-23xfs: provide a inode iolock lockdep classDave Chinner
The XFS iolock needs to be re-initialised to a new lock class before it enters reclaim to prevent lockdep false positives. Unfortunately, this is not sufficient protection as inodes in the XFS_IRECLAIMABLE state can be recycled and not re-initialised before being reused. We need to re-initialise the lock state when transfering out of XFS_IRECLAIMABLE state to XFS_INEW, but we need to keep the same class as if the inode was just allocated. Hence we need a specific lockdep class variable for the iolock so that both initialisations use the same class. While there, add a specific class for inodes in the reclaim state so that it is easy to tell from lockdep reports what state the inode was in that generated the report. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-16xfs: factor duplicate code in xfs_alloc_ag_vextent_near into a helperChristoph Hellwig
Add a new xfs_alloc_find_best_extent that does a forward/backward search in the allocation btree. That code previously was existed two times in xfs_alloc_ag_vextent_near, once for each search direction. Based on an earlier patch from Dave Chinner. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>