linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2009-04-28	NFSD: Stricter buffer size checking in write_versions()	Chuck Lever
	While it's not likely today that there are enough NFS versions to overflow the output buffer in write_versions(), we should be more careful about detecting the end of the buffer. The number of NFS versions will only increase as NFSv4 minor versions are added. Note that this API doesn't behave the same as portlist. Here we attempt to display as many versions as will fit in the buffer, and do not provide any indication that an overflow would have occurred. I don't have any good rationale for that. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-28	NFSD: Stricter buffer size checking in write_recoverydir()	Chuck Lever
	While it's not likely a pathname will be longer than SIMPLE_TRANSACTION_SIZE, we should be more careful about just plopping it into the output buffer without bounds checking. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-28	SUNRPC: pass buffer size to svc_sock_names()	Chuck Lever
	Adjust the synopsis of svc_sock_names() to pass in the size of the output buffer. Add a documenting comment. This is a cosmetic change for now. A subsequent patch will make sure the buffer length is passed to one_sock_name(), where the length will actually be useful. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-28	SUNRPC: pass buffer size to svc_addsock()	Chuck Lever
	Adjust the synopsis of svc_addsock() to pass in the size of the output buffer. Add a documenting comment. This is a cosmetic change for now. A subsequent patch will make sure the buffer length is passed to one_sock_name(), where the length will actually be useful. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-28	NFSD: Prevent a buffer overflow in svc_xprt_names()	Chuck Lever
	The svc_xprt_names() function can overflow its buffer if it's so near the end of the passed in buffer that the "name too long" string still doesn't fit. Of course, it could never tell if it was near the end of the passed in buffer, since its only caller passes in zero as the buffer length. Let's make this API a little safer. Change svc_xprt_names() so it always checks for a buffer overflow, and change its only caller to pass in the correct buffer length. If svc_xprt_names() does overflow its buffer, it now fails with an ENAMETOOLONG errno, instead of trying to write a message at the end of the buffer. I don't like this much, but I can't figure out a clean way that's always safe to return some of the names, and an indication that the buffer was not long enough. The displayed error when doing a 'cat /proc/fs/nfsd/portlist' is "File name too long". Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-28	NFSD: move lockd_up() before svc_addsock()	Chuck Lever
	Clean up. A couple of years ago, a series of commits, finishing with commit 5680c446, swapped the order of the lockd_up() and svc_addsock() calls in __write_ports(). At that time lockd_up() needed to know the transport protocol of the passed-in socket to start a listener on the same transport protocol. These days, lockd_up() doesn't take a protocol argument; it always starts both a UDP and TCP listener. It's now more straightforward to try the lockd_up() first, then do a lockd_down() if the svc_addsock() fails. Careful review of this code shows that the svc_sock_names() call is used only to close the just-opened socket in case lockd_up() fails. So it is no longer needed if lockd_up() is done first. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-28	NFSD: Finish refactoring __write_ports()	Chuck Lever
	Clean up: Refactor transport name listing out of __write_ports() to make it easier to understand and maintain. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-28	NFSD: Note an additional requirement when passing TCP sockets to portlist	Chuck Lever
	User space must call listen(3) on SOCK_STREAM sockets passed into /proc/fs/nfsd/portlist, otherwise that listener is ignored. Document this. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-28	NFSD: Refactor socket creation out of __write_ports()	Chuck Lever
	Clean up: Refactor the socket creation logic out of __write_ports() to make it easier to understand and maintain. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-28	NFSD: Refactor portlist socket closing into a helper	Chuck Lever
	Clean up: Refactor the socket closing logic out of __write_ports() to make it easier to understand and maintain. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-28	NFSD: Refactor transport addition out of __write_ports()	Chuck Lever
	Clean up: Refactor transport addition out of __write_ports() to make it easier to understand and maintain. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-28	NFSD: Refactor transport removal out of __write_ports()	Chuck Lever
	Clean up: Refactor transport removal out of __write_ports() to make it easier to understand and maintain. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-28	fuse: export symbols to be used by CUSE	Tejun Heo
	Export the following symbols for CUSE. fuse_conn_put() fuse_conn_get() fuse_conn_kill() fuse_send_init() fuse_do_open() fuse_sync_release() fuse_direct_io() fuse_do_ioctl() fuse_file_poll() fuse_request_alloc() fuse_get_req() fuse_put_request() fuse_request_send() fuse_abort_conn() fuse_dev_release() fuse_dev_operations Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28	fuse: update fuse_conn_init() and separate out fuse_conn_kill()	Tejun Heo
	Update fuse_conn_init() such that it doesn't take @sb and move bdi registration into a separate function. Also separate out fuse_conn_kill() from fuse_put_super(). These will be used to implement cuse. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28	fuse: don't use inode in fuse_file_poll	Miklos Szeredi
	Use ff->fc and ff->nodeid instead of file->f_dentry->d_inode in the fuse_file_poll() implementation. This prepares this function for use by CUSE, where the inode is not owned by a fuse filesystem. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28	fuse: don't use inode in fuse_do_ioctl() helper	Miklos Szeredi
	Create a helper for sending an IOCTL request that doesn't use a struct inode. This prepares this function for use by CUSE, where the inode is not owned by a fuse filesystem. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28	fuse: don't use inode in fuse_sync_release()	Miklos Szeredi
	Make fuse_sync_release() a generic helper function that doesn't need a struct inode pointer. This makes it suitable for use by CUSE. Change return value of fuse_release_common() from int to void. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28	fuse: create fuse_do_open() helper for CUSE	Miklos Szeredi
	Create a helper for sending an OPEN request that doesn't need a struct inode pointer. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28	fuse: clean up args in fuse_finish_open() and fuse_release_fill()	Miklos Szeredi
	Move setting ff->fh, ff->nodeid and file->private_data outside fuse_finish_open(). Add ->open_flags member to struct fuse_file. This simplifies the argument passing to fuse_finish_open() and fuse_release_fill(), and paves the way for creating an open helper that doesn't need an inode pointer. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28	fuse: don't use inode in helpers called by fuse_direct_io()	Miklos Szeredi
	Use ff->fc and ff->nodeid instead of passing down the inode. This prepares this function for use by CUSE, where the inode is not owned by a fuse filesystem. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28	fuse: add members to struct fuse_file	Miklos Szeredi
	Add new members ->fc and ->nodeid to struct fuse_file. This will aid in converting functions for use by CUSE, where the inode is not owned by a fuse filesystem. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28	fuse: prepare fuse_direct_io() for CUSE	Miklos Szeredi
	Move code operating on the inode out from fuse_direct_io(). This prepares this function for use by CUSE, where the inode is not owned by a fuse filesystem. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28	fuse: clean up fuse_write_fill()	Miklos Szeredi
	Move out code from fuse_write_fill() which is not common to all callers. Remove two function arguments which become unnecessary. Also remove unnecessary memset(), the request is already initialized to zero. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28	fuse: use struct path in release structure	Miklos Szeredi
	Use struct path instead of separate dentry and vfsmount in req->misc.release. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28	fuse: destroy bdi on error	Miklos Szeredi
	Destroy bdi on error in fuse_fill_super(). This was an omission from commit 26c3679101dbccc054dcf370143941844ba70531 "fuse: destroy bdi on umount", which moved the bdi_destroy() call from fuse_conn_put() to fuse_put_super(). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: stable@kernel.org
2009-04-28	fuse: misc cleanups	Tejun Heo
	* fuse_file_alloc() was structured in weird way. The success path was split between else block and code following the block. Restructure the code such that it's easier to read and modify. * Unindent success path of fuse_release_common() to ease future changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28	block: implement blkdev_readpages	Jeff Moyer
	Doing a proper block dev ->readpages() speeds up the crazy dump(8) approach of using interleaved process IO. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-27	ext4: avoid unnecessary spinlock in critical POSIX ACL path	Theodore Ts'o
	If a filesystem supports POSIX ACL's, the VFS layer expects the filesystem to do POSIX ACL checks on any files not owned by the caller, and it does this for every single pathname component that it looks up. That obviously can be pretty expensive if the filesystem isn't careful about it, especially with locking. That's doubly sad, since the common case tends to be that there are no ACL's associated with the files in question. ext4 already caches the ACL data so that it doesn't have to look it up over and over again, but it does so by taking the inode->i_lock spinlock on every lookup. Which is a noticeable overhead even if it's a private lock, especially on CPU's where the serialization is expensive (eg Intel Netburst aka 'P4'). For the special case of not actually having any ACL's, all that locking is unnecessary. Even if somebody else were to be changing the ACL's on another CPU, we simply don't care - if we've seen a NULL ACL, we might as well use it. So just load the ACL speculatively without any locking, and if it was NULL, just use it. If it's non-NULL (either because we had a cached entry, or because the cache hasn't been filled in at all), it means that we'll need to get the lock and re-load it properly. (This commit was ported from a patch originally authored by Linus for ext3.) Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-27	ext3: avoid unnecessary spinlock in critical POSIX ACL path	Linus Torvalds
	If a filesystem supports POSIX ACL's, the VFS layer expects the filesystem to do POSIX ACL checks on any files not owned by the caller, and it does this for every single pathname component that it looks up. That obviously can be pretty expensive if the filesystem isn't careful about it, especially with locking. That's doubly sad, since the common case tends to be that there are no ACL's associated with the files in question. ext3 already caches the ACL data so that it doesn't have to look it up over and over again, but it does so by taking the inode->i_lock spinlock on every lookup. Which is a noticeable overhead even if it's a private lock, especially on CPU's where the serialization is expensive (eg Intel Netburst aka 'P4'). For the special case of not actually having any ACL's, all that locking is unnecessary. Even if somebody else were to be changing the ACL's on another CPU, we simply don't care - if we've seen a NULL ACL, we might as well use it. So just load the ACL speculatively without any locking, and if it was NULL, just use it. If it's non-NULL (either because we had a cached entry, or because the cache hasn't been filled in at all), it means that we'll need to get the lock and re-load it properly. This is noticeable even on Nehalem, which does locking quite well (much better than P4). From lmbench: Processor, Processes - times in microseconds - smaller is better -------------------------------------------------------------------- Host OS Mhz null null open slct fork exec sh call I/O stat clos TCP proc proc proc --------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- - before: nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.95 1.45 2.18 69.1 273. 1141 nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.95 1.48 2.28 69.9 253. 1140 nehalem.l Linux 2.6.30- 3193 0.04 0.10 0.95 1.42 2.19 68.6 284. 1141 - after: nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.92 1.44 2.12 68.3 282. 1094 nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.92 1.39 2.20 67.0 308. 1123 nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.92 1.39 2.36 67.4 293. 1148 where you can see what appears to be a roughly 3% improvement in stat and open/close latencies from just the removal of the locking overhead. Of course, this only matters for files you don't own (the owner never needs to do the ACL checks), but that's the common case for libraries, header files, and executables. As well as for the base components of any absolute pathname, even if you are the owner of the final file. [ At some point we probably want to move this ACL caching logic entirely into the VFS layer (and only call down to the filesystem when uncached), but in the meantime this improves ext3 a bit. A similar fix to btrfs makes a much bigger difference (15x improvement in lmbench) due to broken caching. ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Acked-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk>
2009-06-17	ext4: convert instrumentation from markers to tracepoints	Theodore Ts'o
	Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-17	jbd2: convert instrumentation from markers to tracepoints	Theodore Ts'o
	Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-27	eCryptfs: Fix min function comparison warning	Tyler Hicks
	This warning shows up on 64 bit builds: fs/ecryptfs/inode.c:693: warning: comparison of distinct pointer types lacks a cast Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2009-04-27	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable	Linus Torvalds
	* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: look for acls during btrfs_read_locked_inode Btrfs: fix acl caching Btrfs: Fix a bunch of printk() warnings. Btrfs: Fix a trivial warning using max() of u64 vs ULL. Btrfs: remove unused btrfs_bit_radix slab Btrfs: ratelimit IO error printks Btrfs: remove #if 0 code Btrfs: When shrinking, only update disk size on success Btrfs: fix deadlocks and stalls on dead root removal Btrfs: fix fallocate deadlock on inode extent lock Btrfs: kill btrfs_cache_create Btrfs: don't export symbols Btrfs: simplify makefile Btrfs: try to keep a healthy ratio of metadata vs data block groups
2009-04-27	ecryptfs: fix printk format warning	Randy Dunlap
	fs/ecryptfs/inode.c:670: warning: format '%d' expects type 'int', but argument 3 has type 'size_t' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Cc: Dustin Kirkland <kirkland@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2009-04-27	Btrfs: look for acls during btrfs_read_locked_inode	Chris Mason
	This changes btrfs_read_locked_inode() to peek ahead in the btree for acl items. If it is certain a given inode has no acls, it will set the in memory acl fields to null to avoid acl lookups completely. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27	Btrfs: fix acl caching	Chris Mason
	Linus noticed the btrfs code to cache acls wasn't properly caching a NULL acl when the inode didn't have any acls. This meant the common case of no acls resulted in expensive btree searches every time the kernel checked permissions (which is quite often). This is a modified version of Linus' original patch: Properly set initial acl fields to BTRFS_ACL_NOT_CACHED in the inode. This forces an acl lookup when permission checks are done. Fix btrfs_get_acl to avoid lookups and locking when the inode acls fields are set to null. Fix btrfs_get_acl to use the right return value from __btrfs_getxattr when deciding to cache a NULL acl. It was storing a NULL acl when __btrfs_getxattr return -ENOENT, but __btrfs_getxattr was actually returning -ENODATA for this case. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27	Merge branch 'for_linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6: ext2: missing unlock in ext2_quota_write() quota: remove obsolete comments in fs/quota/Makefile
2009-04-27	ext2: missing unlock in ext2_quota_write()	Dan Carpenter
	The inode->i_mutex should be unlocked. Found by smatch (http://repo.or.cz/w/smatch.git). Compile tested. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2009-04-27	quota: remove obsolete comments in fs/quota/Makefile	Christoph Hellwig
	Get rid of useless comments and the equally useless obj-y initialization. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2009-04-27	Btrfs: Fix a bunch of printk() warnings.	Joel Becker
	Just happened to notice a bunch of %llu vs u64 warnings. Here's a patch to cast them all. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27	Btrfs: Fix a trivial warning using max() of u64 vs ULL.	Joel Becker
	A small warning popped up on ia64 because inode-map.c was comparing a u64 object id with the ULL FIRST_FREE_OBJECTID. My first thought was that all the OBJECTID constants should contain the u64 cast because btrfs code deals entirely in u64s. But then I saw how large that was, and figured I'd just fix the max() call. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27	Btrfs: remove unused btrfs_bit_radix slab	Chris Mason
	Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27	Btrfs: ratelimit IO error printks	Chris Mason
	Btrfs has printks for various IO errors, including bad checksums and mismatches between what we expect the block headers to contain and what we actually find on the disk. Longer term we need a real reporting mechanism for this, but for now printk is going to have to do. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27	Btrfs: remove #if 0 code	Chris Mason
	Btrfs had some old code sitting around under #if 0, this drops it. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27	Btrfs: When shrinking, only update disk size on success	Chris Ball
	Previously, we updated a device's size prior to attempting a shrink operation. This patch moves the device resizing logic to only happen if the shrink completes successfully. In the process, it introduces a new field to btrfs_device -- disk_total_bytes -- to track the on-disk size. Signed-off-by: Chris Ball <cjb@laptop.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-25	ext4: Replace lock/unlock_super() with an explicit lock for resizing	Theodore Ts'o
	Use a separate lock to protect s_groups_count and the other block group descriptors which get changed via an on-line resize operation, so we can stop overloading the use of lock_super(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-25	ext4: Replace lock/unlock_super() with an explicit lock for the orphan list	Theodore Ts'o
	Use a separate lock to protect the orphan list, so we can stop overloading the use of lock_super(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01	ext4: ext4_mark_recovery_complete() doesn't need to use lock_super	Theodore Ts'o
	The function ext4_mark_recovery_complete() is called from two call paths: either (a) while mounting the filesystem, in which case there's no danger of any other CPU calling write_super() until the mount is completed, and (b) while remounting the filesystem read-write, in which case the fs core has already locked the superblock. This also allows us to take out a very vile unlock_super()/lock_super() pair in ext4_remount(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-25	ext4: Remove outdated comment about lock_super()	Theodore Ts'o
	ext4_fill_super() is no longer called by read_super(), and it is no longer called with the superblock locked. The unlock_super()/lock_super() is no longer present, so this comment is entirely superfluous. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01	ext4: Avoid races caused by on-line resizing and SMP memory reordering	Theodore Ts'o
	Ext4's on-line resizing adds a new block group and then, only at the last step adjusts s_groups_count. However, it's possible on SMP systems that another CPU could see the updated the s_group_count and not see the newly initialized data structures for the just-added block group. For this reason, it's important to insert a SMP read barrier after reading s_groups_count and before reading any (for example) the new block group descriptors allowed by the increased value of s_groups_count. Unfortunately, we rather blatently violate this locking protocol documented in fs/ext4/resize.c. Fortunately, (1) on-line resizes happen relatively rarely, and (2) it seems rare that the filesystem code will immediately try to use just-added block group before any memory ordering issues resolve themselves. So apparently problems here are relatively hard to hit, since ext3 has been vulnerable to the same issue for years with no one apparently complaining. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>