summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2011-01-15Add a dentry op to allow processes to be held during pathwalk transitDavid Howells
Add a dentry op (d_manage) to permit a filesystem to hold a process and make it sleep when it tries to transit away from one of that filesystem's directories during a pathwalk. The operation is keyed off a new dentry flag (DCACHE_MANAGE_TRANSIT). The filesystem is allowed to be selective about which processes it holds and which it permits to continue on or prohibits from transiting from each flagged directory. This will allow autofs to hold up client processes whilst letting its userspace daemon through to maintain the directory or the stuff behind it or mounted upon it. The ->d_manage() dentry operation: int (*d_manage)(struct path *path, bool mounting_here); takes a pointer to the directory about to be transited away from and a flag indicating whether the transit is undertaken by do_add_mount() or do_move_mount() skipping through a pile of filesystems mounted on a mountpoint. It should return 0 if successful and to let the process continue on its way; -EISDIR to prohibit the caller from skipping to overmounted filesystems or automounting, and to use this directory; or some other error code to return to the user. ->d_manage() is called with namespace_sem writelocked if mounting_here is true and no other locks held, so it may sleep. However, if mounting_here is true, it may not initiate or wait for a mount or unmount upon the parameter directory, even if the act is actually performed by userspace. Within fs/namei.c, follow_managed() is extended to check with d_manage() first on each managed directory, before transiting away from it or attempting to automount upon it. follow_down() is renamed follow_down_one() and should only be used where the filesystem deliberately intends to avoid management steps (e.g. autofs). A new follow_down() is added that incorporates the loop done by all other callers of follow_down() (do_add/move_mount(), autofs and NFSD; whilst AFS, NFS and CIFS do use it, their use is removed by converting them to use d_automount()). The new follow_down() calls d_manage() as appropriate. It also takes an extra parameter to indicate if it is being called from mount code (with namespace_sem writelocked) which it passes to d_manage(). follow_down() ignores automount points so that it can be used to mount on them. __follow_mount_rcu() is made to abort rcu-walk mode if it hits a directory with DCACHE_MANAGE_TRANSIT set on the basis that we're probably going to have to sleep. It would be possible to enter d_manage() in rcu-walk mode too, and have that determine whether to abort or not itself. That would allow the autofs daemon to continue on in rcu-walk mode. Note that DCACHE_MANAGE_TRANSIT on a directory should be cleared when it isn't required as every tranist from that directory will cause d_manage() to be invoked. It can always be set again when necessary. ========================== WHAT THIS MEANS FOR AUTOFS ========================== Autofs currently uses the lookup() inode op and the d_revalidate() dentry op to trigger the automounting of indirect mounts, and both of these can be called with i_mutex held. autofs knows that the i_mutex will be held by the caller in lookup(), and so can drop it before invoking the daemon - but this isn't so for d_revalidate(), since the lock is only held on _some_ of the code paths that call it. This means that autofs can't risk dropping i_mutex from its d_revalidate() function before it calls the daemon. The bug could manifest itself as, for example, a process that's trying to validate an automount dentry that gets made to wait because that dentry is expired and needs cleaning up: mkdir S ffffffff8014e05a 0 32580 24956 Call Trace: [<ffffffff885371fd>] :autofs4:autofs4_wait+0x674/0x897 [<ffffffff80127f7d>] avc_has_perm+0x46/0x58 [<ffffffff8009fdcf>] autoremove_wake_function+0x0/0x2e [<ffffffff88537be6>] :autofs4:autofs4_expire_wait+0x41/0x6b [<ffffffff88535cfc>] :autofs4:autofs4_revalidate+0x91/0x149 [<ffffffff80036d96>] __lookup_hash+0xa0/0x12f [<ffffffff80057a2f>] lookup_create+0x46/0x80 [<ffffffff800e6e31>] sys_mkdirat+0x56/0xe4 versus the automount daemon which wants to remove that dentry, but can't because the normal process is holding the i_mutex lock: automount D ffffffff8014e05a 0 32581 1 32561 Call Trace: [<ffffffff80063c3f>] __mutex_lock_slowpath+0x60/0x9b [<ffffffff8000ccf1>] do_path_lookup+0x2ca/0x2f1 [<ffffffff80063c89>] .text.lock.mutex+0xf/0x14 [<ffffffff800e6d55>] do_rmdir+0x77/0xde [<ffffffff8005d229>] tracesys+0x71/0xe0 [<ffffffff8005d28d>] tracesys+0xd5/0xe0 which means that the system is deadlocked. This patch allows autofs to hold up normal processes whilst the daemon goes ahead and does things to the dentry tree behind the automouter point without risking a deadlock as almost no locks are held in d_manage() and none in d_automount(). Signed-off-by: David Howells <dhowells@redhat.com> Was-Acked-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15Add a dentry op to handle automounting rather than abusing follow_link()David Howells
Add a dentry op (d_automount) to handle automounting directories rather than abusing the follow_link() inode operation. The operation is keyed off a new dentry flag (DCACHE_NEED_AUTOMOUNT). This also makes it easier to add an AT_ flag to suppress terminal segment automount during pathwalk and removes the need for the kludge code in the pathwalk algorithm to handle directories with follow_link() semantics. The ->d_automount() dentry operation: struct vfsmount *(*d_automount)(struct path *mountpoint); takes a pointer to the directory to be mounted upon, which is expected to provide sufficient data to determine what should be mounted. If successful, it should return the vfsmount struct it creates (which it should also have added to the namespace using do_add_mount() or similar). If there's a collision with another automount attempt, NULL should be returned. If the directory specified by the parameter should be used directly rather than being mounted upon, -EISDIR should be returned. In any other case, an error code should be returned. The ->d_automount() operation is called with no locks held and may sleep. At this point the pathwalk algorithm will be in ref-walk mode. Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is added to handle mountpoints. It will return -EREMOTE if the automount flag was set, but no d_automount() op was supplied, -ELOOP if we've encountered too many symlinks or mountpoints, -EISDIR if the walk point should be used without mounting and 0 if successful. The path will be updated to point to the mounted filesystem if a successful automount took place. __follow_mount() is replaced by follow_managed() which is more generic (especially with the patch that adds ->d_manage()). This handles transits from directories during pathwalk, including automounting and skipping over mountpoints (and holding processes with the next patch). __follow_mount_rcu() will jump out of RCU-walk mode if it encounters an automount point with nothing mounted on it. follow_dotdot*() does not handle automounts as you don't want to trigger them whilst following "..". I've also extracted the mount/don't-mount logic from autofs4 and included it here. It makes the mount go ahead anyway if someone calls open() or creat(), tries to traverse the directory, tries to chdir/chroot/etc. into the directory, or sticks a '/' on the end of the pathname. If they do a stat(), however, they'll only trigger the automount if they didn't also say O_NOFOLLOW. I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their inodes as automount points. This flag is automatically propagated to the dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate(). This saves NFS and could save AFS a private flag bit apiece, but is not strictly necessary. It would be preferable to do the propagation in d_set_d_op(), but that doesn't normally have access to the inode. [AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu() succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after that, rather than just returning with ungrabbed *path] Signed-off-by: David Howells <dhowells@redhat.com> Was-Acked-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15do_lookup() fixAl Viro
do_lookup() has a path leading from LOOKUP_RCU case to non-RCU crossing of mountpoints, which breaks things badly. If we hit need_revalidate: and do nothing in there, we need to come back into LOOKUP_RCU half of things, not to done: in non-RCU one. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-14Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfsLinus Torvalds
* 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: prevent NMI timeouts in cmn_err xfs: Add log level to assertion printk xfs: fix an assignment within an ASSERT() xfs: fix error handling for synchronous writes xfs: add FITRIM support xfs: ensure log covering transactions are synchronous xfs: serialise unaligned direct IOs xfs: factor common write setup code xfs: split buffered IO write path from xfs_file_aio_write xfs: split direct IO write path from xfs_file_aio_write xfs: introduce xfs_rw_lock() helpers for locking the inode xfs: factor post-write newsize updates xfs: factor common post-write isize handling code xfs: ensure sync write errors are returned
2011-01-14Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: block: restore multiple bd_link_disk_holder() support block cfq: compensate preempted queue even if it has no slice assigned block cfq: make queue preempt work for queues from different workload
2011-01-14Turn d_set_d_op() BUG_ON() into WARN_ON_ONCE()Linus Torvalds
It's indicative of a real problem, and it actually triggers with autofs4, but the BUG_ON() is excessive. The autofs4 case is being fixed (to only set d_op in the ->lookup method) but not merged yet. In the meantime this gets the code limping along. Reported-by: Alex Elder <aelder@sgi.com> Cc: Ian Kent <raven@themaw.net> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-14Merge branch 'for-2.6.38' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
* 'for-2.6.38' of git://linux-nfs.org/~bfields/linux: (62 commits) nfsd4: fix callback restarting nfsd: break lease on unlink, link, and rename nfsd4: break lease on nfsd setattr nfsd: don't support msnfs export option nfsd4: initialize cb_per_client nfsd4: allow restarting callbacks nfsd4: simplify nfsd4_cb_prepare nfsd4: give out delegations more quickly in 4.1 case nfsd4: add helper function to run callbacks nfsd4: make sure sequence flags are set after destroy_session nfsd4: re-probe callback on connection loss nfsd4: set sequence flag when backchannel is down nfsd4: keep finer-grained callback status rpc: allow xprt_class->setup to return a preexisting xprt rpc: keep backchannel xprt as long as server connection rpc: move sk_bc_xprt to svc_xprt nfsd4: allow backchannel recovery nfsd4: support BIND_CONN_TO_SESSION nfsd4: modify session list under cl_lock Documentation: fl_mylease no longer exists ... Fix up conflicts in fs/nfsd/vfs.c with the vfs-scale work. The vfs-scale work touched some msnfs cases, and this merge removes support for that entirely, so the conflict was trivial to resolve.
2011-01-14nfsd4: fix callback restartingJ. Bruce Fields
Ensure a new callback is added to the client's list of callbacks at most once. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-14cifs: add cruid= mount optionJeff Layton
In commit 3e4b3e1f we separated the "uid" mount option such that it no longer determined the owner of the credential cache by default. When we did this, we added a new option to cifs.upcall (--legacy-uid) to try to make it so that it would behave the same was as it did before. This ignored a rather important point -- the kernel has no way to know what options are being passed to cifs.upcall, so it doesn't know what uid it should use to determine whether to match an existing krb5 session. The simplest solution is to simply add a new "cruid=" mount option that only governs the uid owner of the credential cache for the mount. Unfortunately, this means that the --legacy-uid option in cifs.upcall was ill-considered and is now useless, but I don't see a better way to deal with this. A patch for the mount.cifs manpage will follow once this patch has been accepted. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-14cifs: cFYI the entire error code in map_smb_to_linux_errorJeff Layton
We currently only print the DOS error part. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-14block: restore multiple bd_link_disk_holder() supportTejun Heo
Commit e09b457b (block: simplify holder symlink handling) incorrectly assumed that there is only one link at maximum. dm may use multiple links and expects block layer to track reference count for each link, which is different from and unrelated to the exclusive device holder identified by @holder when the device is opened. Remove the single holder assumption and automatic removal of the link and revive the per-link reference count tracking. The code essentially behaves the same as before commit e09b457b sans the unnecessary kobject reference count dancing. While at it, note that this facility should not be used by anyone else than the current ones. Sysfs symlinks shouldn't be abused like this and the whole thing doesn't belong in the block layer at all. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Milan Broz <mbroz@redhat.com> Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-01-14afs: add afs_wq and use it instead of the system workqueueTejun Heo
flush_scheduled_work() is going away. afs needs to make sure all the works it has queued have finished before being unloaded and there can be arbitrary number of pending works. Add afs_wq and use it as the flush domain instead of the system workqueue. Also, convert cancel_delayed_work() + flush_scheduled_work() to cancel_delayed_work_sync() in afs_mntpt_kill_timer(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: David Howells <dhowells@redhat.com> Cc: linux-afs@lists.infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-14FS-Cache: Fix operation handlingAkshat Aranya
fscache_submit_exclusive_op() adds an operation to the pending list if other operations are pending. Fix the check for pending ops as n_ops must be greater than 0 at the point it is checked as it is incremented immediately before under lock. Signed-off-by: Akshat Aranya <aranya@nec-labs.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-14Merge branch 'vfs-scale-working' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin * 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin: kernel: fix hlist_bl again cgroups: Fix a lockdep warning at cgroup removal fs: namei fix ->put_link on wrong inode in do_filp_open
2011-01-14fs: namei fix ->put_link on wrong inode in do_filp_openNick Piggin
J. R. Okajima noticed that ->put_link is being attempted on the wrong inode, and suggested the way to fix it. I changed it a bit according to Al's suggestion to keep an explicit link path around. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-13Merge branch 'vfs-scale-working' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin * 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin: fs: fix do_last error case when need_reval_dot nfs: add missing rcu-walk check fs: hlist UP debug fixup fs: fix dropping of rcu-walk from force_reval_path fs: force_reval_path drop rcu-walk before d_invalidate fs: small rcu-walk documentation fixes Fixed up trivial conflicts in Documentation/filesystems/porting
2011-01-14fs: fix do_last error case when need_reval_dotJ. R. Okajima
When open(2) without O_DIRECTORY opens an existing dir, it should return EISDIR. In do_last(), the variable 'error' is initialized EISDIR, but it is changed by d_revalidate() which returns any positive to represent 'the target dir is valid.' Should we keep and return the initialized 'error' in this case. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-14nfs: add missing rcu-walk checkNick Piggin
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-14fs: fix dropping of rcu-walk from force_reval_pathNick Piggin
As J. R. Okajima noted, force_reval_path passes in the same dentry to d_revalidate as the one in the nameidata structure (other callers pass in a child), so the locking breaks. This can oops with a chrooted nfs mount, for example. Similarly there can be other problems with revalidating a dentry which is already in nameidata of the path walk. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-14fs: force_reval_path drop rcu-walk before d_invalidateNick Piggin
d_revalidate can return in rcu-walk mode even when it returns 0. We can't just call any old dcache function on rcu-walk dentry (the dentry is unstable, so even through d_lock can safely be taken, the result may no longer be what we expect -- careful re-checks would be required). So just drop rcu in this case. (I missed this conversion when switching to the rcu-walk convention that Linus suggested) Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-13nfsd: break lease on unlink, link, and renameJ. Bruce Fields
Any change to any of the links pointing to an entry should also break delegations. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-13nfsd4: break lease on nfsd setattrJ. Bruce Fields
Leases (delegations) should really be broken on any metadata change, not just on size change. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-13nfsd: don't support msnfs export optionJ. Bruce Fields
We've long had these pointless #ifdef MSNFS's sprinkled throughout the code--pointless because MSNFS is always defined (and we give no config option to make that easy to change). So we could just remove the ifdef's and compile the resulting code unconditionally. But as long as we're there: why not just rip out this code entirely? The only purpose is to implement the "msnfs" export option which turns on Windows-like behavior in some cases, and: - the export option isn't documented anywhere; - the userland utilities (which would need to be able to parse "msnfs" in an export file) don't support it; - I don't know how to maintain this, as I don't know what the proper behavior is; and - google shows no evidence that anyone has ever used this. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-13nfsd4: initialize cb_per_clientJ. Bruce Fields
Otherwise a callback that is aborted before it runs will result in a list_del on an uninitialized list head. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-13fs/fs-writeback.c: fix sync_inodes_sb() return value kernel-docStefan Hajnoczi
The sync_inodes_sb() function does not have a return value. Remove the outdated documentation comment. Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13thp: remove PG_buddyAndrea Arcangeli
PG_buddy can be converted to _mapcount == -2. So the PG_compound_lock can be added to page->flags without overflowing (because of the sparse section bits increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y. This also has to move the memory hotplug code from _mapcount to lru.next to avoid any risk of clashes. We can't use lru.next for PG_buddy removal, but memory hotplug can use lru.next even more easily than the mapcount instead. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13thp: transparent hugepage vmstatAndrea Arcangeli
Add hugepage stat information to /proc/vmstat and /proc/meminfo. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13oom: allow a non-CAP_SYS_RESOURCE proces to oom_score_adj downMandeep Singh Baines
We'd like to be able to oom_score_adj a process up/down as it enters/leaves the foreground. Currently, it is not possible to oom_adj down without CAP_SYS_RESOURCE. This patch allows a task to decrease its oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set it to or its inherited value at fork. Assuming the thread that has forked it has oom_score_adj of 0, each process could decrease it back from 0 upon activation unless a CAP_SYS_RESOURCE thread elevated it to something higher. Alternative considered: * a setuid binary * a daemon with CAP_SYS_RESOURCE Since you don't wan't all processes to be able to reduce their oom_adj, a setuid or daemon implementation would be complex. The alternatives also have much higher overhead. This patch updated from original patch based on feedback from David Rientjes. Signed-off-by: Mandeep Singh Baines <msb@chromium.org> Acked-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13mm: smaps: export mlock informationNikanth Karthikesan
Currently there is no way to find whether a process has locked its pages in memory or not. And which of the memory regions are locked in memory. Add a new field "Locked" to export this information via the smaps file. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13fs/mpage.c: consolidate codeHai Shan
Merge mpage_end_io_read() and mpage_end_io_write() into mpage_end_io() to eliminate code duplication. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Hai Shan <shan.hai@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13sync_inode_metadata: fix commentAndrew Morton
Use correct function name, remove incorrect apostrophe Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13writeback: avoid livelocking WB_SYNC_ALL writebackJan Kara
When wb_writeback() is called in WB_SYNC_ALL mode, work->nr_to_write is usually set to LONG_MAX. The logic in wb_writeback() then calls __writeback_inodes_sb() with nr_to_write == MAX_WRITEBACK_PAGES and we easily end up with non-positive nr_to_write after the function returns, if the inode has more than MAX_WRITEBACK_PAGES dirty pages at the moment. When nr_to_write is <= 0 wb_writeback() decides we need another round of writeback but this is wrong in some cases! For example when a single large file is continuously dirtied, we would never finish syncing it because each pass would be able to write MAX_WRITEBACK_PAGES and inode dirty timestamp never gets updated (as inode is never completely clean). Thus __writeback_inodes_sb() would write the redirtied inode again and again. Fix the issue by setting nr_to_write to LONG_MAX in WB_SYNC_ALL mode. We do not need nr_to_write in WB_SYNC_ALL mode anyway since write_cache_pages() does livelock avoidance using page tagging in WB_SYNC_ALL mode. This makes wb_writeback() call __writeback_inodes_sb() only once on WB_SYNC_ALL. The latter function won't livelock because it works on - a finite set of files by doing queue_io() once at the beginning - a finite set of pages by PAGECACHE_TAG_TOWRITE page tagging After this patch, program from http://lkml.org/lkml/2010/10/24/154 is no longer able to stall sync forever. [fengguang.wu@intel.com: fix locking comment] Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Engelhardt <jengelh@medozas.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13writeback: stop background/kupdate works from livelocking other worksJan Kara
Background writeback is easily livelockable in a loop in wb_writeback() by a process continuously re-dirtying pages (or continuously appending to a file). This is in fact intended as the target of background writeback is to write dirty pages it can find as long as we are over dirty_background_threshold. But the above behavior gets inconvenient at times because no other work queued in the flusher thread's queue gets processed. In particular, since e.g. sync(1) relies on flusher thread to do all the IO for it, sync(1) can hang forever waiting for flusher thread to do the work. Generally, when a flusher thread has some work queued, someone submitted the work to achieve a goal more specific than what background writeback does. Moreover by working on the specific work, we also reduce amount of dirty pages which is exactly the target of background writeout. So it makes sense to give specific work a priority over a generic page cleaning. Thus we interrupt background writeback if there is some other work to do. We return to the background writeback after completing all the queued work. This may delay the writeback of expired inodes for a while, however the expired inodes will eventually be flushed to disk as long as the other works won't livelock. [fengguang.wu@intel.com: update comment] Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Engelhardt <jengelh@medozas.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13writeback: trace wakeup event for background writebackWu Fengguang
This tracks when balance_dirty_pages() tries to wakeup the flusher thread for background writeback (if it was not started already). Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Engelhardt <jengelh@medozas.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13writeback: integrated background writeback workJan Kara
Check whether background writeback is needed after finishing each work. When bdi flusher thread finishes doing some work check whether any kind of background writeback needs to be done (either because dirty_background_ratio is exceeded or because we need to start flushing old inodes). If so, just do background write back. This way, bdi_start_background_writeback() just needs to wake up the flusher thread. It will do background writeback as soon as there is no other work. This is a preparatory patch for the next patch which stops background writeback as soon as there is other work to do. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Engelhardt <jengelh@medozas.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13ecryptfs: fix broken buildLinus Torvalds
Stephen Rothwell reports that the vfs merge broke the build of ecryptfs. The breakage comes from commit 66cb76666d69 ("sanitize ecryptfs ->mount()") which was obviously not even build tested. Tssk, tssk, Al. This is the minimal build fixup for the situation, although I don't have a filesystem to actually test it with. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13ceph: fix xattr rbtree searchSage Weil
Fix xattr name comparison in rbtree search for strings that share a prefix. The *name argument is null terminated, but the xattr name is not, so we need to use strncmp, but that means adjusting for the case where name is a prefix of xattr->name. The corresponding case in __set_xattr() already handles this properly (although in that case *name is also not null terminated). Reported-by: Sergiy Kibrik <sakib@meta.ua> Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-13ceph: fix getattr on directory when using norbytesYehuda Sadeh
The norbytes mount option was broken, and when doing getattr on a directory it return the rbytes instead of the number of entities. This commit fixes it. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-13Squashfs: simplify CONFIG_SQUASHFS_LZO handlingPhillip Lougher
Get rid of messy repeated #if(n)def CONFIG_SQUASHFS_LZO code in decompressor.c Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2011-01-13Squashfs: move squashfs_i() definition from squashfs.hPhillip Lougher
Move squashfs_i() definition out of squashfs.h, this eliminates the need to #include squashfs_fs_i.h from numerous files. Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2011-01-13Squashfs: get rid of default n in KconfigPhillip Lougher
As pointed out by Geert Uytterhoeven, "default n" is the default, no reason to specify it. Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2011-01-13Squashfs: add missing check in zlib_wrapperPhillip Lougher
On file system corruption zlib can return Z_STREAM_OK with input buffers remaining, which will not be released. Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2011-01-13Squashfs: remove unnecessary variable in zlib_wrapperPhillip Lougher
Get rid of unnecessary bytes variable, and remove redundant initialisation of zlib_err. Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2011-01-13Squashfs: Add XZ compression configuration optionPhillip Lougher
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2011-01-13Squashfs: add XZ compression supportPhillip Lougher
Add support for reading file systems compressed with the XZ compression algorithm. This patch adds the XZ decompressor wrapper code. Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2011-01-13NFS: Fix NFSv3 exclusive open semanticsTrond Myklebust
Commit c0204fd2b8fe047b18b67e07e1bf2a03691240cd (NFS: Clean up nfs4_proc_create()) broke NFSv3 exclusive open by removing the code that passes the O_EXCL flag down to nfs3_proc_create(). This patch reverts that offending hunk from the original commit. Reported-by: Nick Bowler <nbowler@elliptictech.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org [2.6.37] Tested-by: Nick Bowler <nbowler@elliptictech.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13Merge branch 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits) block: ensure that completion error gets properly traced blktrace: add missing probe argument to block_bio_complete block cfq: don't use atomic_t for cfq_group block cfq: don't use atomic_t for cfq_queue block: trace event block fix unassigned field block: add internal hd part table references block: fix accounting bug on cross partition merges kref: add kref_test_and_get bio-integrity: mark kintegrityd_wq highpri and CPU intensive block: make kblockd_workqueue smarter Revert "sd: implement sd_check_events()" block: Clean up exit_io_context() source code. Fix compile warnings due to missing removal of a 'ret' variable fs/block: type signature of major_to_index(int) to major_to_index(unsigned) block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p) cfq-iosched: don't check cfqg in choose_service_tree() fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors cdrom: export cdrom_check_events() sd: implement sd_check_events() sr: implement sr_check_events() ...
2011-01-13Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (41 commits) fs: add documentation on fallocate hole punching Gfs2: fail if we try to use hole punch Btrfs: fail if we try to use hole punch Ext4: fail if we try to use hole punch Ocfs2: handle hole punching via fallocate properly XFS: handle hole punching via fallocate properly fs: add hole punching to fallocate vfs: pass struct file to do_truncate on O_TRUNC opens (try #2) fix signedness mess in rw_verify_area() on 64bit architectures fs: fix kernel-doc for dcache::prepend_path fs: fix kernel-doc for dcache::d_validate sanitize ecryptfs ->mount() switch afs move internal-only parts of ncpfs headers to fs/ncpfs switch ncpfs switch 9p pass default dentry_operations to mount_pseudo() switch hostfs switch affs switch configfs ...
2011-01-13Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: rbd: fix cleanup when trying to mount inexistent image net/ceph: make ceph_msgr_wq non-reentrant ceph: fsc->*_wq's aren't used in memory reclaim path ceph: Always free allocated memory in osdmap_decode() ceph: Makefile: Remove unnessary code ceph: associate requests with opening sessions ceph: drop redundant r_mds field ceph: implement DIRLAYOUTHASH feature to get dir layout from MDS ceph: add dir_layout to inode
2011-01-13Merge branch 'for-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits) Documentation/trace/events.txt: Remove obsolete sched_signal_send. writeback: fix global_dirty_limits comment runtime -> real-time ppc: fix comment typo singal -> signal drivers: fix comment typo diable -> disable. m68k: fix comment typo diable -> disable. wireless: comment typo fix diable -> disable. media: comment typo fix diable -> disable. remove doc for obsolete dynamic-printk kernel-parameter remove extraneous 'is' from Documentation/iostats.txt Fix spelling milisec -> ms in snd_ps3 module parameter description Fix spelling mistakes in comments Revert conflicting V4L changes i7core_edac: fix typos in comments mm/rmap.c: fix comment sound, ca0106: Fix assignment to 'channel'. hrtimer: fix a typo in comment init/Kconfig: fix typo anon_inodes: fix wrong function name in comment fix comment typos concerning "consistent" poll: fix a typo in comment ... Fix up trivial conflicts in: - drivers/net/wireless/iwlwifi/iwl-core.c (moved to iwl-legacy.c) - fs/ext4/ext4.h Also fix missed 'diabled' typo in drivers/net/bnx2x/bnx2x.h while at it.