summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2011-05-25ext4: Convert ext4 to new truncate calling conventionJan Kara
Trivial conversion. Fixup one error handling case calling vmtruncate() and remove ->truncate callback. We also fix a bug that IS_IMMUTABLE and IS_APPEND files could not be truncated during failed writes. In fact, the test can be completely removed as upper layers do necessary permission checks for truncate in do_sys_[f]truncate() and may_open() anyway. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-25cifs: add cifs_async_writevJeff Layton
Add the ability for CIFS to do an asynchronous write. The kernel will set the frame up as it would for a "normal" SMBWrite2 request, and use cifs_call_async to send it. The mid callback will then be configured to handle the result. Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-25cifs: clean up wsize negotiation and allow for larger wsizeJeff Layton
Now that we can handle larger wsizes in writepages, fix up the negotiation of the wsize to allow for that. find_get_pages only seems to give out a max of 256 pages at a time, so that gives us a reasonable default of 1M for the wsize. If the server however does not support large writes via POSIX extensions, then we cap the wsize to (128k - PAGE_CACHE_SIZE). That gives us a size that goes up to the max frame size specified in RFC1001. Finally, if CAP_LARGE_WRITE_AND_X isn't set, then further cap it to the largest size allowed by the protocol (USHRT_MAX). Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-25cifs: convert cifs_writepages to use async writesJeff Layton
Have cifs_writepages issue asynchronous writes instead of waiting on each write call to complete before issuing another. This also allows us to return more quickly from writepages. It can just send out all of the I/Os and not wait around for the replies. In the WB_SYNC_ALL case, if the write completes with a retryable error, then the completion workqueue job will resend the write. This also changes the page locking semantics a little bit. Instead of holding the page lock until the response is received, release it after doing the send. This will reduce contention for the page lock and should prevent processes that have the file mmap'ed from being blocked unnecessarily. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-25CIFS: Fix undefined behavior when mount failsPavel Shilovsky
Fix double kfree() calls on the same pointers and cleanup mount code. Reviewed-and-Tested-by: Jeff Layton <jlayton@samba.org> Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-25Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits) ceph: fix cap flush race reentrancy libceph: subscribe to osdmap when cluster is full libceph: handle new osdmap down/state change encoding rbd: handle online resize of underlying rbd image ceph: avoid inode lookup on nfs fh reconnect ceph: use LOOKUPINO to make unconnected nfs fh more reliable rbd: use snprintf for disk->disk_name rbd: cleanup: make kfree match kmalloc rbd: warn on update_snaps failure on notify ceph: check return value for start_request in writepages ceph: remove useless check libceph: add missing breaks in addr_set_port libceph: fix TAG_WAIT case ceph: fix broken comparison in readdir loop libceph: fix osdmap timestamp assignment ceph: fix rare potential cap leak libceph: use snprintf for unknown addrs libceph: use snprintf for formatting object name ceph: use snprintf for dirstat content libceph: fix uninitialized value when no get_authorizer method is set ...
2011-05-25Merge branch 'for-davem' of ↵David S. Miller
ssh://master.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next-2.6
2011-05-25Squashfs: add extra sanity checks at mount timePhillip Lougher
Add some extra sanity checks of the inode and directory structures. Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2011-05-25Squashfs: add sanity checks to fragment reading at mount timePhillip Lougher
Fsfuzzer generates corrupted filesystems which throw a warn_on in kmalloc. One of these is due to a corrupted superblock fragments field. Fix this by checking that the number of bytes to be read (and allocated) does not extend into the next filesystem structure. Also add a couple of other sanity checks of the mount-time fragment table structures. Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2011-05-25Squashfs: add sanity checks to lookup table reading at mount timePhillip Lougher
Fsfuzzer generates corrupted filesystems which throw a warn_on in kmalloc. One of these is due to a corrupted superblock inodes field. Fix this by checking that the number of bytes to be read (and allocated) does not extend into the next filesystem structure. Also add a couple of other sanity checks of the mount-time lookup table structures. Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2011-05-25Squashfs: add sanity checks to id reading at mount timePhillip Lougher
Fsfuzzer generates corrupted filesystems which throw a warn_on in kmalloc. One of these is due to a corrupted superblock no_ids field. Fix this by checking that the number of bytes to be read (and allocated) does not extend into the next filesystem structure. Also add a couple of other sanity checks of the mount-time id table structures. Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2011-05-25Squashfs: add sanity checks to xattr reading at mount timePhillip Lougher
These checks add sanity checking of the mount-time xattr structures. Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2011-05-25Squashfs: reverse order of filesystem table readingPhillip Lougher
Reverse order of table reading from mostly first to last in placement order, to last to first. This is to enable extra superblock sanity checks to be added in later patches. Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2011-05-25Squashfs: move table allocation into squashfs_read_table()Phillip Lougher
This eliminates a lot of duplicate code. Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2011-05-25Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs: 9p: update Documentation pointers net/9p: enable 9p to work in non-default network namespace net/9p: p9_idpool_get return -1 on error fs/9p: Don't clunk dentry fid when we fail to get a writeback inode 9p: Small cleanup in <net/9p/9p.h> 9p: remove experimental tag from tested configurations 9p: typo fixes and minor cleanups net/9p: Change linuxdoc names to match functions.
2011-05-25Merge branch 'for-2.6.40/splice' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-2.6.40/splice' of git://git.kernel.dk/linux-2.6-block: splice: add wakeup_pipe_readers()
2011-05-25Merge branch 'for-2.6.40/core' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-2.6.40/core' of git://git.kernel.dk/linux-2.6-block: (40 commits) cfq-iosched: free cic_index if cfqd allocation fails cfq-iosched: remove unused 'group_changed' in cfq_service_tree_add() cfq-iosched: reduce bit operations in cfq_choose_req() cfq-iosched: algebraic simplification in cfq_prio_to_maxrq() blk-cgroup: Initialize ioc->cgroup_changed at ioc creation time block: move bd_set_size() above rescan_partitions() in __blkdev_get() block: call elv_bio_merged() when merged cfq-iosched: Make IO merge related stats per cpu cfq-iosched: Fix a memory leak of per cpu stats for root group backing-dev: Kill set but not used var in bdi_debug_stats_show() block: get rid of on-stack plugging debug checks blk-throttle: Make no throttling rule group processing lockless blk-cgroup: Make cgroup stat reset path blkg->lock free for dispatch stats blk-cgroup: Make 64bit per cpu stats safe on 32bit arch blk-throttle: Make dispatch stats per cpu blk-throttle: Free up a group only after one rcu grace period blk-throttle: Use helper function to add root throtl group to lists blk-throttle: Introduce a helper function to fill in device details blk-throttle: Dynamically allocate root group blk-cgroup: Allow sleeping while dynamically allocating a group ...
2011-05-25xfs: correctly decrement the extent buffer index in xfs_bmap_del_extentChristoph Hellwig
The code in xfs_bmap_del_extent does not correctly decrement the extent buffer index when deleting a whole extent. Most of the time this gets caught by checks in xfs_bmapi that work around it and decrement it manually and thus wasn't noticed so far. Based on an earlier patch from Lachlan McIlroy. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-25xfs: check for valid indices in xfs_iext_get_ext and xfs_iext_idx_to_irecChristoph Hellwig
Based on an earlier patch from Lachlan McIlroy. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-25xfs: fix up asserts in xfs_iflush_forkChristoph Hellwig
Remove asserts in xfs_iflush_fork that would call xfs_iext_get_ext with a potentially invalid extent buffer index. Based on an earlier patch from Lachlan McIlroy. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-25xfs: do not do pointer arithmetic on extent recordsChristoph Hellwig
We need to call xfs_iext_get_ext for the previous extent to get a valid pointer, and can't just do pointer arithmetics as they might be in different pages. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-25xfs: do not use unchecked extent indices in xfs_bunmapiChristoph Hellwig
Make sure to only call xfs_iext_get_ext after we've validate the extent index when moving on to the next index in xfs_bunmapi. Also remove the old workaround for too large indices that has been superceeded by the proper fix in xfs_bmap_del_extent. Based on an earlier patch from Lachlan McIlroy. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-25xfs: do not use unchecked extent indices in xfs_bmapiChristoph Hellwig
Make sure to only call xfs_iext_get_ext after we've validate the extent index when moving on to the next index in xfs_bmapi. Based on an earlier patch from Lachlan McIlroy. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-25xfs: do not use unchecked extent indices in xfs_bmap_add_extent_*Christoph Hellwig
Make sure to only call xfs_iext_get_ext after we've validate the extent index in the various xfs_bmap_add_extent_* helpers. Based on an earlier patch from Lachlan McIlroy. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-25xfs: remove if_lastexChristoph Hellwig
The if_lastex field in struct xfs_ifork is only used as a temporary index during xfs_bmapi and xfs_bunmapi. Instead of using the inode fork to store it keep it local in the callchain. Fortunately this is very easy as we already pass a stack copy of it down the whole chain which can simplify be changed to be passed by reference. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-25xfs: remove the unused XFS_BMAPI_RSVBLOCKS flagChristoph Hellwig
The XFS_BMAPI_RSVBLOCKS is unused, and as far as I can see has always been. Remove it to simplify the bmapi implementation and conserve stack space. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-25fs/ncpfs/inode.c: suppress used-uninitialised warningAndrew Morton
We get this spurious warning: fs/ncpfs/inode.c: In function 'ncp_fill_super': fs/ncpfs/inode.c:451: warning: 'data.mounted_vol[1u]' may be used uninitialized in this function fs/ncpfs/inode.c:451: warning: 'data.mounted_vol[2u]' may be used uninitialized in this function fs/ncpfs/inode.c:451: warning: 'data.mounted_vol[3u]' may be used uninitialized in this function ... It's notabug, but we can easily fix it with a memset(). Reported-by: Harry Wei <jiaweiwei.xiyou@gmail.com> Cc: Petr Vandrovec <petr@vandrovec.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25fscache: remove dead code under CONFIG_WORKQUEUE_DEBUGFSAmerigo Wang
There is no CONFIG_WORKQUEUE_DEBUGFS any more, so this code is dead. Signed-off-by: WANG Cong <amwang@redhat.com> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25proc: allocate storage for numa_maps statistics onceStephen Wilson
In show_numa_map() we collect statistics into a numa_maps structure. Since the number of NUMA nodes can be very large, this structure is not a candidate for stack allocation. Instead of going thru a kmalloc()+kfree() cycle each time show_numa_map() is invoked, perform the allocation just once when /proc/pid/numa_maps is opened. Performing the allocation when numa_maps is opened, and thus before a reference to the target tasks mm is taken, eliminates a potential stalemate condition in the oom-killer as originally described by Hugh Dickins: ... imagine what happens if the system is out of memory, and the mm we're looking at is selected for killing by the OOM killer: while we wait in __get_free_page for more memory, no memory is freed from the selected mm because it cannot reach exit_mmap while we hold that reference. Signed-off-by: Stephen Wilson <wilsons@start.ca> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25proc: make struct proc_maps_private truly privateStephen Wilson
Now that mm/mempolicy.c is no longer implementing /proc/pid/numa_maps there is no need to export struct proc_maps_private to the world. Move it to fs/proc/internal.h instead. Signed-off-by: Stephen Wilson <wilsons@start.ca> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm: proc: move show_numa_map() to fs/proc/task_mmu.cStephen Wilson
Moving show_numa_map() from mempolicy.c to task_mmu.c solves several issues. - Having the show() operation "miles away" from the corresponding seq_file iteration operations is a maintenance burden. - The need to export ad hoc info like struct proc_maps_private is eliminated. - The implementation of show_numa_map() can be improved in a simple manner by cooperating with the other seq_file operations (start, stop, etc) -- something that would be messy to do without this change. Signed-off-by: Stephen Wilson <wilsons@start.ca> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25tmpfs: implement generic xattr supportEric Paris
Implement generic xattrs for tmpfs filesystems. The Feodra project, while trying to replace suid apps with file capabilities, realized that tmpfs, which is used on the build systems, does not support file capabilities and thus cannot be used to build packages which use file capabilities. Xattrs are also needed for overlayfs. The xattr interface is a bit odd. If a filesystem does not implement any {get,set,list}xattr functions the VFS will call into some random LSM hooks and the running LSM can then implement some method for handling xattrs. SELinux for example provides a method to support security.selinux but no other security.* xattrs. As it stands today when one enables CONFIG_TMPFS_POSIX_ACL tmpfs will have xattr handler routines specifically to handle acls. Because of this tmpfs would loose the VFS/LSM helpers to support the running LSM. To make up for that tmpfs had stub functions that did nothing but call into the LSM hooks which implement the helpers. This new patch does not use the LSM fallback functions and instead just implements a native get/set/list xattr feature for the full security.* and trusted.* namespace like a normal filesystem. This means that tmpfs can now support both security.selinux and security.capability, which was not previously possible. The basic implementation is that I attach a: struct shmem_xattr { struct list_head list; /* anchored by shmem_inode_info->xattr_list */ char *name; size_t size; char value[0]; }; Into the struct shmem_inode_info for each xattr that is set. This implementation could easily support the user.* namespace as well, except some care needs to be taken to prevent large amounts of unswappable memory being allocated for unprivileged users. [mszeredi@suse.cz: new config option, suport trusted.*, support symlinks] Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com> Tested-by: Serge Hallyn <serge.hallyn@ubuntu.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Acked-by: Hugh Dickins <hughd@google.com> Tested-by: Jordi Pujol <jordipujolp@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25vmscan: change shrinker API by passing shrink_control structYing Han
Change each shrinker's API by consolidating the existing parameters into shrink_control struct. This will simplify any further features added w/o touching each file of shrinker. [akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: fix warning] [kosaki.motohiro@jp.fujitsu.com: fix up new shrinker API] [akpm@linux-foundation.org: fix xfs warning] [akpm@linux-foundation.org: update gfs2] Signed-off-by: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Acked-by: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25vmscan: change shrink_slab() interfaces by passing shrink_controlYing Han
Consolidate the existing parameters to shrink_slab() into a new shrink_control struct. This is needed later to pass the same struct to shrinkers. Signed-off-by: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Acked-by: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm: Convert i_mmap_lock to a mutexPeter Zijlstra
Straightforward conversion of i_mmap_lock to a mutex. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Hugh Dickins <hughd@google.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Miller <davem@davemloft.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Tony Luck <tony.luck@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm: Remove i_mmap_lock lockbreakPeter Zijlstra
Hugh says: "The only significant loser, I think, would be page reclaim (when concurrent with truncation): could spin for a long time waiting for the i_mmap_mutex it expects would soon be dropped? " Counter points: - cpu contention makes the spin stop (need_resched()) - zap pages should be freeing pages at a higher rate than reclaim ever can I think the simplification of the truncate code is definitely worth it. Effectively reverts: 2aa15890f3c ("mm: prevent concurrent unmap_mapping_range() on the same inode") and takes out the code that caused its problem. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Miller <davem@davemloft.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Tony Luck <tony.luck@intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm: mmu_gather reworkPeter Zijlstra
Rework the existing mmu_gather infrastructure. The direct purpose of these patches was to allow preemptible mmu_gather, but even without that I think these patches provide an improvement to the status quo. The first 9 patches rework the mmu_gather infrastructure. For review purpose I've split them into generic and per-arch patches with the last of those a generic cleanup. The next patch provides generic RCU page-table freeing, and the followup is a patch converting s390 to use this. I've also got 4 patches from DaveM lined up (not included in this series) that uses this to implement gup_fast() for sparc64. Then there is one patch that extends the generic mmu_gather batching. After that follow the mm preemptibility patches, these make part of the mm a lot more preemptible. It converts i_mmap_lock and anon_vma->lock to mutexes which together with the mmu_gather rework makes mmu_gather preemptible as well. Making i_mmap_lock a mutex also enables a clean-up of the truncate code. This also allows for preemptible mmu_notifiers, something that XPMEM I think wants. Furthermore, it removes the new and universially detested unmap_mutex. This patch: Remove the first obstacle towards a fully preemptible mmu_gather. The current scheme assumes mmu_gather is always done with preemption disabled and uses per-cpu storage for the page batches. Change this to try and allocate a page for batching and in case of failure, use a small on-stack array to make some progress. Preemptible mmu_gather is desired in general and usable once i_mmap_lock becomes a mutex. Doing it before the mutex conversion saves us from having to rework the code by moving the mmu_gather bits inside the pte_lock. Also avoid flushing the tlb batches from under the pte lock, this is useful even without the i_mmap_lock conversion as it significantly reduces pte lock hold times. [akpm@linux-foundation.org: fix comment tpyo] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Miller <davem@davemloft.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Tony Luck <tony.luck@intel.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm: make expand_downwards() symmetrical with expand_upwards()Michal Hocko
Currently we have expand_upwards exported while expand_downwards is accessible only via expand_stack or expand_stack_downwards. check_stack_guard_page is a nice example of the asymmetry. It uses expand_stack for VM_GROWSDOWN while expand_upwards is called for VM_GROWSUP case. Let's clean this up by exporting both functions and make those names consistent. Let's use expand_{upwards,downwards} because expanding doesn't always involve stack manipulation (an example is ia64_do_page_fault which uses expand_upwards for registers backing store expansion). expand_downwards has to be defined for both CONFIG_STACK_GROWS{UP,DOWN} because get_arg_page calls the downwards version in the early process initialization phase for growsup configuration. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Hugh Dickins <hughd@google.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25fs/9p: Don't clunk dentry fid when we fail to get a writeback inodeAneesh Kumar K.V
The dentry fid get clunked via the dput. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2011-05-259p: remove experimental tag from tested configurationsEric Van Hensbergen
The 9p client is currently undergoing regular regresssion and stress testing as a by-product of the virtfs work. I think its finally time to take off the experimental tags from the well-tested code paths. Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2011-05-25ext4: do not normalize block requests from fallocate()Vivek Haldar
Currently, an fallocate request of size slightly larger than a power of 2 is turned into two block requests, each a power of 2, with the extra blocks pre-allocated for future use. When an application calls fallocate, it already has an idea about how large the file may grow so there is usually little benefit to reserve extra blocks on the preallocation list. This reduces disk fragmentation. Tested: fsstress. Also verified manually that fallocat'ed files are contiguously laid out with this change (whereas without it they begin at power-of-2 boundaries, leaving blocks in between). CPU usage of fallocate is not appreciably higher. In a tight fallocate loop, CPU usage hovers between 5%-8% with this change, and 5%-7% without it. Using a simulated file system aging program which the file system to 70%, the percentage of free extents larger than 8MB (as measured by e2freefrag) increased from 38.8% without this change, to 69.4% with this change. Signed-off-by: Vivek Haldar <haldar@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-25ext4: enable "punch hole" functionalityAllison Henderson
This patch adds new routines: "ext4_punch_hole" "ext4_ext_punch_hole" and "ext4_ext_check_cache" fallocate has been modified to call ext4_punch_hole when the punch hole flag is passed. At the moment, we only support punching holes in extents, so this routine is pretty much a wrapper for the ext4_ext_punch_hole routine. The ext4_ext_punch_hole routine first completes all outstanding writes with the associated pages, and then releases them. The unblock aligned data is zeroed, and all blocks in between are punched out. The ext4_ext_check_cache routine is very similar to ext4_ext_in_cache except it accepts a ext4_ext_cache parameter instead of a ext4_extent parameter. This routine is used by ext4_ext_punch_hole to check and see if a block in a hole that has been cached. The ext4_ext_cache parameter is necessary because the members ext4_extent structure are not large enough to hold a 32 bit value. The existing ext4_ext_in_cache routine has become a wrapper to this new function. [ext4 punch hole patch series 5/5 v7] Signed-off-by: Allison Henderson <achender@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Mingming Cao <cmm@us.ibm.com>
2011-05-25ext4: add "punch hole" flag to ext4_map_blocks()Allison Henderson
This patch adds a new flag to ext4_map_blocks() that specifies the given range of blocks should be punched out. Extents are first converted to uninitialized extents before they are punched out. Because punching a hole may require that the extent be split, it is possible that the splitting may need more blocks than are available. To deal with this, use of reserved blocks are enabled to allow the split to proceed. The routine then returns the number of blocks successfully punched out. [ext4 punch hole patch series 4/5 v7] Signed-off-by: Allison Henderson <achender@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Mingming Cao <cmm@us.ibm.com>
2011-05-25ext4: punch out extentsAllison Henderson
This patch modifies the truncate routines to support hole punching Below is a brief summary of the patches changes: - Added end param to ext_ext4_rm_leaf This function has been modified to accept an end parameter which enables it to punch holes in leafs instead of just truncating them. - Implemented the "remove head" case in the ext_remove_blocks routine This routine is used by ext_ext4_rm_leaf to remove the tail of an extent during a truncate. The new ext_ext4_rm_leaf routine will now also use it to remove the head of an extent in the case that the hole covers a region of blocks at the beginning of an extent. - Added "end" param to ext4_ext_remove_space routine This function has been modified to accept a stop parameter, which is passed through to ext4_ext_rm_leaf. [ext4 punch hole patch series 3/5 v6] Signed-off-by: Allison Henderson <achender@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-25ext4: add new function ext4_block_zero_page_range()Allison Henderson
This patch modifies the existing ext4_block_truncate_page() function which was used by the truncate code path, and which zeroes out block unaligned data, by adding a new length parameter, and renames it to ext4_block_zero_page_rage(). This function can now be used to zero out the head of a block, the tail of a block, or the middle of a block. The ext4_block_truncate_page() function is now a wrapper to ext4_block_zero_page_range(). [ext4 punch hole patch series 2/5 v7] Signed-off-by: Allison Henderson <achender@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Mingming Cao <cmm@us.ibm.com>
2011-05-25ext4: add flag to ext4_has_free_blocksAllison Henderson
This patch adds an allocation request flag to the ext4_has_free_blocks function which enables the use of reserved blocks. This will allow a punch hole to proceed even if the disk is full. Punching a hole may require additional blocks to first split the extents. Because ext4_has_free_blocks is a low level function, the flag needs to be passed down through several functions listed below: ext4_ext_insert_extent ext4_ext_create_new_leaf ext4_ext_grow_indepth ext4_ext_split ext4_ext_new_meta_block ext4_mb_new_blocks ext4_claim_free_blocks ext4_has_free_blocks [ext4 punch hole patch series 1/5 v7] Signed-off-by: Allison Henderson <achender@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Mingming Cao <cmm@us.ibm.com>
2011-05-25GFS2: Processes waiting on inode glock that no processes are holdingBob Peterson
This patch fixes a race in the GFS2 glock state machine that may result in lockups. The symptom is that all nodes but one will hang, waiting for a particular glock. All the holder records will have the "W" (Waiting) bit set. The other node will typically have the glock stuck in Exclusive mode (EX) with no holder records, but the dinode will be cached. In other words, an entry with "I:" will appear in the glock dump for that glock, but nothing else. The race has to do with the glock "Pending Demote" bit, which can be set, then immediately reset, thus losing the fact that another node needs the glock. The sequence of events is: 1. Something schedules the glock workqueue (e.g. glock request from fs) 2. The glock workqueue gets to the point between the test of the reply pending bit and the spin lock: if (test_and_clear_bit(GLF_REPLY_PENDING, &gl->gl_flags)) { finish_xmote(gl, gl->gl_reply); drop_ref = 1; } down_read(&gfs2_umount_flush_sem); <---- i.e. here spin_lock(&gl->gl_spin); 3. In comes (a) the reply to our EX lock request setting GLF_REPLY_PENDING and (b) the demote request which sets GLF_PENDING_DEMOTE 4. The following test is executed: if (test_and_clear_bit(GLF_PENDING_DEMOTE, &gl->gl_flags) && gl->gl_state != LM_ST_UNLOCKED && gl->gl_demote_state != LM_ST_EXCLUSIVE) { This resets the pending demote flag, and gl->gl_demote_state is not equal to exclusive, however because the reply from the dlm arrived after we checked for the GLF_REPLY_PENDING flag, gl->gl_state is still equal to unlocked, so although we reset the GLF_PENDING_DEMOTE flag, we didn't then set the GLF_DEMOTE flag or reinstate the GLF_PENDING_DEMOTE_FLAG. The patch closes the timing window by only transitioning the "Pending demote" bit to the "demote" flag once we know the other conditions (not unlocked and not exclusive) are met. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2011-05-25Ocfs2/move_extents: Set several trivial constraints for threshold.Tristan Ye
The threshold should be greater than clustersize and less than i_size. Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25Ocfs2/move_extents: Let defrag handle partial extent moving.Tristan Ye
We're going to support partial extent moving, which may split entire extent movement into pieces to compromise the insuffice allocations, it eases the 'ENSPC' pain and makes the whole moving much less likely to fail, the downside is it may make the fs even more fragmented before moving, just let the userspace make a trade-off here. Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25Ocfs2/move_extents: move/defrag extents within a certain range.Tristan Ye
the basic logic of moving extents for a file is pretty like punching-hole sequence, walk the extents within the range as user specified, calculating an appropriate len to defrag/move, then let ocfs2_defrag/move_extent() to do the actual moving. This func ends up setting 'OCFS2_MOVE_EXT_FL_COMPLETE' to userpace if operation gets done successfully. Signed-off-by: Tristan Ye <tristan.ye@oracle.com>