summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2012-10-25Btrfs: fix deadlock caused by the nested chunk allocationMiao Xie
Steps to reproduce: # mkfs.btrfs -m raid1 <disk1> <disk2> # btrfstune -S 1 <disk1> # mount <disk1> <mnt> # btrfs device add <disk3> <disk4> <mnt> # mount -o remount,rw <mnt> # dd if=/dev/zero of=<mnt>/tmpfile bs=1M count=1 Deadlock happened. It is because of the nested chunk allocation. When we wrote the data into the filesystem, we would allocate the data chunk because there was no data chunk in the filesystem. At the end of the data chunk allocation, we should insert the metadata of the data chunk into the extent tree, but there was no raid1 chunk, so we tried to lock the chunk allocation mutex to allocate the new chunk, but we had held the mutex, the deadlock happened. By rights, we would allocate the raid1 chunk when we added the second device because the profile of the seed filesystem is raid1 and we had two devices. But we didn't do that in fact. It is because the last step of the first device insertion didn't commit the transaction. So when we added the second device, we didn't cow the tree, and just inserted the relative metadata into the leaves which were generated by the first device insertion, and its profile was dup. So, I fix this problem by commiting the transaction at the end of the first device insertion. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-25btrfs: Return EINVAL when length to trim is less than FSBLukas Czerner
Currently if len argument in btrfs_ioctl_fitrim() is smaller than one FSB we will continue and finally return 0 bytes discarded. However if the length to discard is smaller then file system block we should really return EINVAL. Signed-off-by: Lukas Czerner <lczerner@redhat.com>
2012-10-25Btrfs: fix memory leak in btrfs_quota_enable()Tsutomu Itoh
We should free quota_root before returning from the error handling code. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
2012-10-25Btrfs: send correct rdev and mode in btrfs-sendArne Jansen
When sending a device file, the stream was missing the mode. Also the rdev was encoded wrongly. Signed-off-by: Arne Jansen <sensille@gmx.net>
2012-10-25Btrfs: extended inode refs support for send mechanismJan Schmidt
This adds support for the new extended inode refs to btrfs send. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-10-25Btrfs: Fix wrong error handling codeStefan Behrens
gcc says "warning: comparison of unsigned expression >= 0 is always true" because i is an unsigned long. And gcc is right this time. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
2012-10-25Fix a sign bug causing invalid memory access in the ino_paths ioctl.Gabriel de Perthuis
To see the problem, create many hardlinks to the same file (120 should do it), then look up paths by inode with: ls -i btrfs inspect inode-resolve -v $ino /mnt/btrfs I noticed the memory layout of the fspath->val data had some irregularities (some unnecessary gaps that stop appearing about halfway), so I'm not sure there aren't any bugs left in it.
2012-10-25tty, ioctls -- Add new ioctl definitions for tty flags fetchingCyrill Gorcunov
This patch defines new ioctl codes TIOCGPKT, TIOCGPTLCK, TIOCGEXCL for fetching pty's packet mode and locking state, and exclusive mode of tty. [ No real handlers for the codes though, this will be addressed in another patch for easier review and bisectability ] Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> CC: Alan Cox <alan@lxorguk.ukuu.org.uk> CC: "H. Peter Anvin" <hpa@zytor.com> CC: Pavel Emelyanov <xemul@parallels.com> CC: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-24sysfs: sysfs_pathname/sysfs_add_one: Use strlcat() instead of strcat()Geert Uytterhoeven
The warning check for duplicate sysfs entries can cause a buffer overflow when printing the warning, as strcat() doesn't check buffer sizes. Use strlcat() instead. Since strlcat() doesn't return a pointer to the passed buffer, unlike strcat(), I had to convert the nested concatenation in sysfs_add_one() to an admittedly more obscure comma operator construct, to avoid emitting code for the concatenation if CONFIG_BUG is disabled. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-24LOCKD: Clear ln->nsm_clnt only when ln->nsm_users is zeroTrond Myklebust
The current code is clearing it in all cases _except_ when zero. Reported-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org
2012-10-24LOCKD: fix races in nsm_client_getTrond Myklebust
Commit e9406db20fecbfcab646bad157b4cfdc7cadddfb (lockd: per-net NSM client creation and destruction helpers introduced) contains a nasty race on initialisation of the per-net NSM client because it doesn't check whether or not the client is set after grabbing the nsm_create_mutex. Reported-by: Nix <nix@esperi.org.uk> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org
2012-10-24Btrfs: comment for loop in tree_mod_log_insert_moveJan Schmidt
Emphasis the way tree_mod_log_insert_move avoids adding MOD_LOG_KEY_REMOVE_WHILE_MOVING operations, depending on the direction of the move operation. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-10-24Btrfs: fix extent buffer reference for tree mod log rootsJan Schmidt
In get_old_root we grab a lock on the extent buffer before we obtain a reference on that buffer. That order is changed now. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-10-24Btrfs: determine level of old rootsJan Schmidt
In btrfs_find_all_roots' termination condition, we compare the level of the old buffer we got from btrfs_search_old_slot to the level of the current root node. We'd better compare it to the level of the rewinded root node. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-10-24Btrfs: tree mod log's old roots could still be part of the treeJan Schmidt
Tree mod log treated old root buffers as always empty buffers when starting the rewind operations. However, the old root may still be part of the current tree at a lower level, with still some valid entries. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-10-24Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core kernel fixes from Ingo Molnar: "Two small fixes" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Documentation: Reflect the new location of the NMI watchdog info nohz: Fix idle ticks in cpu summary line of /proc/stat
2012-10-23Btrfs: fix a tree mod logging issue for root replacement operationsJan Schmidt
Avoid the implicit free by tree_mod_log_set_root_pointer, which is wrong in two places. Where needed, we call tree_mod_log_free_eb explicitly now. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-10-23Btrfs: don't put removals from push_node_left into tree mod log twiceJan Schmidt
Independant of the check (push_items < src_items) tree_mod_log_eb_copy did log the removal of the old data entries from the source buffer. Therefore, we must not call tree_mod_log_eb_move if the check evaluates to true, as that would log the removal twice, finally resulting in (rewinded) buffers with wrong values for header_nritems. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-10-23Merge tag 'jfs-3.7-2' of git://github.com/kleikamp/linux-shaggyLinus Torvalds
Pull jfs fix from Dave Kleikamp: "Bug fix: Fix FITRIM argument handling" * tag 'jfs-3.7-2' of git://github.com/kleikamp/linux-shaggy: jfs: Fix FITRIM argument handling
2012-10-23Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Various bug fixes for ext4. The most serious of them fixes a security bug (CVE-2012-4508) which leads to stale data exposure when we have fallocate racing against writes to files undergoing delayed allocation. We also have two fixes for the metadata checksum feature, the most serious of which can cause the superblock to have a invalid checksum after a power failure." * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: Avoid underflow in ext4_trim_fs() ext4: Checksum the block bitmap properly with bigalloc enabled ext4: fix undefined bit shift result in ext4_fill_flex_info ext4: fix metadata checksum calculation for the superblock ext4: race-condition protection for ext4_convert_unwritten_extents_endio ext4: serialize fallocate with ext4_convert_unwritten_extents
2012-10-23Merge tag 'nfs-for-3.7-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Trond Myklebust: - Do not call pnfs_return_layout() from an rpciod context - nfs4_ds_disconnect can cause Oopses. Kill it... - Fix the return value for nfs_callback_start_svc - Fix a number of compile warnings * tag 'nfs-for-3.7-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv4: Fix the return value for nfs_callback_start_svc NFSv4.1: Declare osd_pri_2_pnfs_err(), objio_init_read/write to be static NFSv4: fs/nfs/nfs4getroot.c needs to include "internal.h" NFSv4.1: Use kcalloc() to allocate zeroed arrays instead of kzalloc() NFSv4.1: Do not call pnfs_return_layout() from an rpciod context NFSv4.1: Kill nfs4_ds_disconnect()
2012-10-22TTY: devpts, document devpts inode operationsJiri Slaby
Add kernel-doc texts for some devpts functions, i.e. document them. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Acked-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-22TTY: devpts, do not set driver_dataJiri Slaby
The goal is to stop setting and using tty->driver_data in devpts code. It should be used solely by the driver's code, pty in this case. Now driver_data are managed only in the pty driver. devpts_pty_new is switched to accept what we used to dig out of tty_struct, i.e. device node number and index. This also removes a note about driver_data being set outside of the driver. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Acked-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-22TTY: devpts, return created inode from devpts_pty_newJiri Slaby
The goal is to stop setting and using tty->driver_data in devpts code. It should be used solely by the driver's code, pty in this case. For the cleanup of layering, we will need the inode created in devpts_pty_new to be stored into slave's driver_data. So we convert devpts_pty_new to return the inode or an ERR_PTR-encoded error in case of failure. The move of 'inode = new_inode(sb);' from declarators to the code is only cosmetical, but it makes the code easier to read. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Acked-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-22TTY: devpts, don't care about TTY in devpts_get_ttyJiri Slaby
The goal is to stop setting and using tty->driver_data in devpts code. It should be used solely by the driver's code, pty in this case. First, here we remove TTY from devpts_get_tty and rename it to devpts_get_priv. Note we do not remove type safety, we just shift the [implicit] (void *) cast one layer up. index was unused in devpts_get_tty, so remove that from the prototype too. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Acked-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-22ext4: Avoid underflow in ext4_trim_fs()Lukas Czerner
Currently if len argument in ext4_trim_fs() is smaller than one block, the 'end' variable underflow. Avoid that by returning EINVAL if len is smaller than file system block. Also remove useless unlikely(). Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2012-10-22vfs: fix: don't increase bio_slab_max if krealloc() failsAnna Leuschner
Without the patch, bio_slab_max, representing bio_slabs capacity, is increased before krealloc() of bio_slabs. If krealloc() fails, bio_slab_max is too high. Fix that by only updating bio_slab_max if krealloc() is successful. Signed-off-by: Anna Leuschner <anna.m.leuschner@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-22char_dev: pin parent kobjectDmitry Torokhov
In certain cases (for example when a cdev structure is embedded into another object whose lifetime is controlled by a separate kobject) it is beneficial to tie lifetime of another object to the lifetime of character device so that related object is not freed until after char_dev object is freed. To achieve this let's pin kobject's parent when doing cdev_add() and unpin when last reference to cdev structure is being released. Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-22ext4: Checksum the block bitmap properly with bigalloc enabledTao Ma
In mke2fs, we only checksum the whole bitmap block and it is right. While in the kernel, we use EXT4_BLOCKS_PER_GROUP to indicate the size of the checksumed bitmap which is wrong when we enable bigalloc. The right size should be EXT4_CLUSTERS_PER_GROUP and this patch fixes it. Also as every caller of ext4_block_bitmap_csum_set and ext4_block_bitmap_csum_verify pass in EXT4_BLOCKS_PER_GROUP(sb)/8, we'd better removes this parameter and sets it in the function itself. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Cc: stable@vger.kernel.org
2012-10-19hold task->mempolicy while numa_maps scans.KAMEZAWA Hiroyuki
/proc/<pid>/numa_maps scans vma and show mempolicy under mmap_sem. It sometimes accesses task->mempolicy which can be freed without mmap_sem and numa_maps can show some garbage while scanning. This patch tries to take reference count of task->mempolicy at reading numa_maps before calling get_vma_policy(). By this, task->mempolicy will not be freed until numa_maps reaches its end. V2->v3 - updated comments to be more verbose. - removed task_lock() in numa_maps code. V1->V2 - access task->mempolicy only once and remember it. Becase kernel/exit.c can overwrite it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-19Merge branch 'for-3.7' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd bugfixes from J Bruce Fields. * 'for-3.7' of git://linux-nfs.org/~bfields/linux: SUNRPC: Prevent kernel stack corruption on long values of flush NLM: nlm_lookup_file() may return NLMv4-specific error codes
2012-10-18xfs: move allocation stack switch up to xfs_bmapi_allocateDave Chinner
Switching stacks are xfs_alloc_vextent can cause deadlocks when we run out of worker threads on the allocation workqueue. This can occur because xfs_bmap_btalloc can make multiple calls to xfs_alloc_vextent() and even if xfs_alloc_vextent() fails it can return with the AGF locked in the current allocation transaction. If we then need to make another allocation, and all the allocation worker contexts are exhausted because the are blocked waiting for the AGF lock, holder of the AGF cannot get it's xfs-alloc_vextent work completed to release the AGF. Hence allocation effectively deadlocks. To avoid this, move the stack switch one layer up to xfs_bmapi_allocate() so that all of the allocation attempts in a single switched stack transaction occur in a single worker context. This avoids the problem of an allocation being blocked waiting for a worker thread whilst holding the AGF. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-10-18xfs: introduce XFS_BMAPI_STACK_SWITCHDave Chinner
Certain allocation paths through xfs_bmapi_write() are in situations where we have limited stack available. These are almost always in the buffered IO writeback path when convertion delayed allocation extents to real extents. The current stack switch occurs for userdata allocations, which means we also do stack switches for preallocation, direct IO and unwritten extent conversion, even those these call chains have never been implicated in a stack overrun. Hence, let's target just the single stack overun offended for stack switches. To do that, introduce a XFS_BMAPI_STACK_SWITCH flag that the caller can pass xfs_bmapi_write() to indicate it should switch stacks if it needs to do allocation. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-10-18xfs: zero allocation_args on the kernel stackMark Tinguely
Zero the kernel stack space that makes up the xfs_alloc_arg structures. Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-10-18fs, xattr: fix bug when removing a name not in xattr listDavid Rientjes
Commit 38f38657444d ("xattr: extract simple_xattr code from tmpfs") moved some code from tmpfs but introduced a subtle bug along the way. If the name passed to simple_xattr_remove() does not exist in the list of xattrs, then it is possible to call kfree(new_xattr) when new_xattr is actually initialized to itself on the stack via uninitialized_var(). This causes a BUG() since the memory was not allocated via the slab allocator and was not bypassed through to the page allocator because it was too large. Initialize the local variable to NULL so the kfree() never takes place. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-17xfs: only update the last_sync_lsn when a transaction completesDave Chinner
The log write code stamps each iclog with the current tail LSN in the iclog header so that recovery knows where to find the tail of thelog once it has found the head. Normally this is taken from the first item on the AIL - the log item that corresponds to the oldest active item in the log. The problem is that when the AIL is empty, the tail lsn is dervied from the the l_last_sync_lsn, which is the LSN of the last iclog to be written to the log. In most cases this doesn't happen, because the AIL is rarely empty on an active filesystem. However, when it does, it opens up an interesting case when the transaction being committed to the iclog spans multiple iclogs. That is, the first iclog is stamped with the l_last_sync_lsn, and IO is issued. Then the next iclog is setup, the changes copied into the iclog (takes some time), and then the l_last_sync_lsn is stamped into the header and IO is issued. This is still the same transaction, so the tail lsn of both iclogs must be the same for log recovery to find the entire transaction to be able to replay it. The problem arises in that the iclog buffer IO completion updates the l_last_sync_lsn with it's own LSN. Therefore, If the first iclog completes it's IO before the second iclog is filled and has the tail lsn stamped in it, it will stamp the LSN of the first iclog into it's tail lsn field. If the system fails at this point, log recovery will not see a complete transaction, so the transaction will no be replayed. The fix is simple - the l_last_sync_lsn is updated when a iclog buffer IO completes, and this is incorrect. The l_last_sync_lsn shoul dbe updated when a transaction is completed by a iclog buffer IO. That is, only iclog buffers that have transaction commit callbacks attached to them should update the l_last_sync_lsn. This means that the last_sync_lsn will only move forward when a commit record it written, not in the middle of a large transaction that is rolling through multiple iclog buffers. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-10-17xfs: remove xfs_iget.cDave Chinner
The inode cache functions remaining in xfs_iget.c can be moved to xfs_icache.c along with the other inode cache functions. This removes all functionality from xfs_iget.c, so the file can simply be removed. This move results in various functions now only having the scope of a single file (e.g. xfs_inode_free()), so clean up all the definitions and exported prototypes in xfs_icache.[ch] and xfs_inode.h appropriately. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-10-17xfs: move inode locking functions to xfs_inode.cDave Chinner
xfs_ilock() and friends really aren't related to the inode cache in any way, so move them to xfs_inode.c with all the other inode related functionality. While doing this move, move the xfs_ilock() tracepoints to *before* the lock is taken so that when a hang on a lock occurs we have events to indicate which process and what inode we were trying to lock when the hang occurred. This is much better than the current silence we get on a hang... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-10-17xfs: rename xfs_sync.[ch] to xfs_icache.[ch]Dave Chinner
xfs_sync.c now only contains inode reclaim functions and inode cache iteration functions. It is not related to sync operations anymore. Rename to xfs_icache.c to reflect it's contents and prepare for consolidation with the other inode cache file that exists (xfs_iget.c). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-10-17xfs: xfs_quiesce_attr() should quiesce the log like unmountDave Chinner
xfs_quiesce_attr() is supposed to leave the log empty with an unmount record written. Right now it does not wait for the AIL to be emptied before writing the unmount record, not does it wait for metadata IO completion, either. Fix it to use the same method and code as xfs_log_unmount(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-10-17xfs: move xfs_quiesce_attr() into xfs_super.cDave Chinner
Both callers of xfs_quiesce_attr() are in xfs_super.c, and there's nothing really sync-specific about this functionality so it doesn't really matter where it lives. Move it to benext to it's callers, so all the remount/sync_fs code is in the one place. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-10-17xfs: xfs_sync_fsdata is redundantDave Chinner
Why do we need to write the superblock to disk once we've written all the data? We don't actually - the reasons for doing this are lost in the mists of time, and go back to the way Irix used to drive VFS flushing. On linux, this code is only called from two contexts: remount and .sync_fs. In the remount case, the call is followed by a metadata sync, which unpins and writes the superblock. In the sync_fs case, we only need to force the log to disk to ensure that the superblock is correctly on disk, so we don't actually need to write it. Hence the functionality is either redundant or superfluous and thus can be removed. Seeing as xfs_quiesce_data is essentially now just a log force, remove it as well and fold the code back into the two callers. Neither of them need the log covering check, either, as that is redundant for the remount case, and unnecessary for the .sync_fs case. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-10-17xfs: syncd workqueue is no moreDave Chinner
With the syncd functions moved to the log and/or removed, the syncd workqueue is the only remaining bit left. It is used by the log covering/ail pushing work, as well as by the inode reclaim work. Given how cheap workqueues are these days, give the log and inode reclaim work their own work queues and kill the syncd work queue. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-10-17xfs: xfs_sync_data is redundant.Dave Chinner
We don't do any data writeback from XFS any more - the VFS is completely responsible for that, including for freeze. We can replace the remaining caller with a VFS level function that achieves the same thing, but without conflicting with current writeback work. This means we can remove the flush_work and xfs_flush_inodes() - the VFS functionality completely replaces the internal flush queue for doing this writeback work in a separate context to avoid stack overruns. This does have one complication - it cannot be called with page locks held. Hence move the flushing of delalloc space when ENOSPC occurs back up into xfs_file_aio_buffered_write when we don't hold any locks that will stall writeback. Unfortunately, writeback_inodes_sb_if_idle() is not sufficient to trigger delalloc conversion fast enough to prevent spurious ENOSPC whent here are hundreds of writers, thousands of small files and GBs of free RAM. Hence we need to use sync_sb_inodes() to block callers while we wait for writeback like the previous xfs_flush_inodes implementation did. That means we have to hold the s_umount lock here, but because this call can nest inside i_mutex (the parent directory in the create case, held by the VFS), we have to use down_read_trylock() to avoid potential deadlocks. In practice, this trylock will succeed on almost every attempt as unmount/remount type operations are exceedingly rare. Note: we always need to pass a count of zero to generic_file_buffered_write() as the previously written byte count. We only do this by accident before this patch by the virtue of ret always being zero when there are no errors. Make this explicit rather than needing to specifically zero ret in the ENOSPC retry case. Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-10-17xfs: Bring some sanity to log unmountingDave Chinner
When unmounting the filesystem, there are lots of operations that need to be done in a specific order, and they are spread across across a couple of functions. We have to drain the AIL before we write the unmount record, and we have to shut down the background log work before we do either of them. But this is all split haphazardly across xfs_unmountfs() and xfs_log_unmount(). Move all the AIL flushing and log manipulations to xfs_log_unmount() so that the responisbilities of each function is clear and the operations they perform obvious. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-10-17xfs: sync work is now only periodic log workDave Chinner
The only thing the periodic sync work does now is flush the AIL and idle the log. These are really functions of the log code, so move the work to xfs_log.c and rename it appropriately. The only wart that this leaves behind is the xfssyncd_centisecs sysctl, otherwise the xfssyncd is dead. Clean up any comments that related to xfssyncd to reflect it's passing. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-10-17xfs: don't run the sync work if the filesystem is read-onlyDave Chinner
If the filesystem is mounted or remounted read-only, stop the sync worker that tries to flush or cover the log if the filesystem is dirty. It's read-only, so it isn't dirty. Restart it on a remount,rw as necessary. This avoids the need for RO checks in the work. Similarly, stop the sync work when the filesystem is frozen, and start it again when the filesysetm is thawed. This avoids the need for special freeze checks in the work. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-10-17xfs: rationalise xfs_mount_wq usersDave Chinner
Instead of starting and stopping background work on the xfs_mount_wq all at the same time, separate them to where they really are needed to start and stop. The xfs_sync_worker, only needs to be started after all the mount processing has completed successfully, while it needs to be stopped before the log is unmounted. The xfs_reclaim_worker is started on demand, and can be stopped before the unmount process does it's own inode reclaim pass. The xfs_flush_inodes work is run on demand, and so we really only need to ensure that it has stopped running before we start processing an unmount, freeze or remount,ro. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-10-17xfs: xfs_syncd_stop must dieDave Chinner
xfs_syncd_start and xfs_syncd_stop tie a bunch of unrelated functionailty together that actually have different start and stop requirements. Kill these functions and open code the start/stop methods for each of the background functions. Subsequent patches will move the start/stop functions around to the correct places to avoid races and shutdown issues. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-10-17jfs: Fix FITRIM argument handlingLukas Czerner
Currently when 'range->start' is beyond the end of file system nothing is done and that fact is ignored, where in fact we should return EINVAL. The same problem is when 'range.len' is smaller than file system block. Fix this by adding check for such conditions and return EINVAL appropriately. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Acked-by: Tino Reichardt <milky-kernel@mcmilk.de> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>