summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2014-09-23xfs: xlog_cil_force_lsn doesn't always wait correctlyDave Chinner
When running a tight mount/unmount loop on an older kernel, RedHat QE found that unmount would occasionally hang in xfs_buf_unpin_wait() on the superblock buffer. Tracing and other debug work by Eric Sandeen indicated that it was hanging on the writing of the superblock during unmount immediately after logging the superblock counters in a synchronous transaction. Further debug indicated that the synchronous transaction was not waiting for completion correctly, and we narrowed it down to xlog_cil_force_lsn() returning NULLCOMMITLSN and hence not pushing the transaction in the iclog buffer to disk correctly. While this unmount superblock write code is now very different in mainline kernels, the xlog_cil_force_lsn() code is identical, and it was bisected to the backport of commit f876e44 ("xfs: always do log forces via the workqueue"). This commit made the CIL push asynchronous for log forces and hence exposed a race condition that couldn't occur on a synchronous push. Essentially, the xlog_cil_force_lsn() relied implicitly on the fact that the sequence push would be complete by the time xlog_cil_push_now() returned, resulting in the context being pushed being in the committing list. When it was made asynchronous, it was recognised that there was a race condition in detecting whether an asynchronous push has started or not and code was added to handle it. Unfortunately, the fix was not quite right and left a race condition where it it would detect an empty CIL while a push was in progress before the context had been added to the committing list. This was incorrectly seen as a "nothing to do" condition and so would tell xfs_log_force_lsn() that there is nothing to wait for, and hence it would push the iclogbufs in memory. The fix is simple, but explaining the logic and the race condition is a lot more complex. The fix is to add the context to the committing list before we start emptying the CIL. This allows us to detect the difference between an empty "do nothing" push and a push that has not started by adding a discrete "emptying the CIL" state to avoid the transient, incorrect "empty" condition that the (unchanged) waiting code was seeing. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23Merge branch 'xfs-shift-extents-rework' into for-nextDave Chinner
2014-09-23xfs: only writeback and truncate pages for the freed rangeBrian Foster
xfs_free_file_space() only affects the range of the file for which space is being freed. It currently writes and truncates the page cache from the start offset of the free to EOF. Modify xfs_free_file_space() to write back and truncate page cache of just the range being freed. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23xfs: writeback and inval. file range to be shifted by collapseBrian Foster
The collapse range operation currently writes the entire file before starting the collapse to avoid changes in the in-core extent list due to writeback causing the extent count to change. Now that collapse range is fsb based rather than extent index based it can sustain changes in the extent list during the shift sequence without disruption. Modify xfs_collapse_file_space() to writeback and invalidate pages associated with the range of the file to be shifted. xfs_free_file_space() currently has similar behavior, but the space free need only affect the region of the file that is freed and this could change in the future. Also update the comments to reflect the current implementation. We retain the eofblocks trim permanently as a best option for dealing with delalloc extents. We don't shift delalloc extents because this scenario only occurs with post-eof preallocation (since data must be flushed such that the cache can be invalidated and data can be shifted). That means said space must also be initialized before being shifted into the accessible region of the file only to be immediately truncated off as the last part of the collapse. In other words, the eofblocks trim will happen anyways, we just run it first to ensure the file remains in a consistent state throughout the collapse. Finally, detect and fail explicitly in the event of a delalloc extent during the extent shift. The implementation does not support delalloc extents and the caller is expected to prevent this scenario in advance as is done by collapse. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23xfs: refactor single extent shift into xfs_bmse_shift_one() helperBrian Foster
xfs_bmap_shift_extents() has a variety of conditions and error checks that make the logic difficult to follow and indent heavy. Refactor the loop body of this function into a new xfs_bmse_shift_one() helper. This simplifies the error checks, eliminates index decrement on merge hack by pushing the index increment down into the helper, and makes the code more readable by reducing multiple levels of indentation. This is a code refactor only. The behavior of extent shift and collapse range is not modified. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23xfs: refactor shift-by-merge into xfs_bmse_merge() helperBrian Foster
The extent shift mechanism in xfs_bmap_shift_extents() is complicated and handles several different, non-deterministic scenarios. These include extent shifts, extent merges and potential btree updates in either of the former scenarios. Refactor the code to be more linear and readable. The loop logic in xfs_bmap_shift_extents() and some initial error checking is adjusted slightly. The associated btree lookup and update/delete operations are condensed into single blocks of code. This reduces the number of btree-specific blocks and facilitates the separation of the merge operation into a new xfs_bmse_merge() and xfs_bmse_can_merge() helpers. This is a code refactor only. The behavior of extent shift and collapse range is not modified. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23xfs: track collapse via file offset rather than extent indexBrian Foster
The collapse range implementation uses a transaction per extent shift. The progress of the overall operation is tracked via the current extent index of the in-core extent list. This is racy because the ilock must be dropped and reacquired for each transaction according to locking and log reservation rules. Therefore, writeback to prior regions of the file is possible and can change the extent count. This changes the extent to which the current index refers and causes the collapse to fail mid operation. To avoid this problem, the entire file is currently written back before the collapse operation starts. To eliminate the need to flush the entire file, use the file offset (fsb) to track the progress of the overall extent shift operation rather than the extent index. Modify xfs_bmap_shift_extents() to unconditionally convert the start_fsb parameter to an extent index and return the file offset of the extent where the shift left off, if further extents exist. The bulk of ths function can remain based on extent index as ilock is held by the caller. xfs_collapse_file_space() now uses the fsb output as the starting point for the subsequent shift. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23xfs: ensure WB_SYNC_ALL writeback handles partial pages correctlyDave Chinner
XFS has been having trouble with stray delayed allocation extents beyond EOF for a long time. Recent changes to the collapse range code has triggered erroneous EBUSY errors on page invalidtion for block size smaller than page size filesystems. These have been caused by dirty buffers beyond EOF on a partial page which do not get written to disk during a sync. The issue is that write-ahead in xfs_cluster_write() finds such a partial page and handles it by leaving the page dirty but pushing it into a writeback state. This used to work just fine, as the write_cache_pages() code would then find the dirty partial page in the next mapping tree lookup as the dirty tag is still set. Unfortunately, when we moved to a mark and sweep approach to writeback to fix other writeback sync issues, we broken this. THe act of marking the page as under writeback now clears the TOWRITE tag in the radix tree, even though the page is still dirty. This causes the TOWRITE tag to be cleared, and hence the next lookup on the mapping tree does not find the dirty partial page and so doesn't try to write it again. This same writeback bug was found recently in ext4 and fixed in commit 1c8349a ("ext4: fix data integrity sync in ordered mode") without communication to the wider filesystem community. We can use exactly the same fix here so the TOWRITE flag is not cleared on partial page writes. cc: stable@vger.kernel.org # dependent on 1c8349a17137b93f0a83f276c764a6df1b9a116e Root-cause-found-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23Merge branch 'rcu/next' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull the v3.18 RCU changes from Paul E. McKenney: " * Update RCU documentation. These were posted to LKML at https://lkml.org/lkml/2014/8/28/378. * Miscellaneous fixes. These were posted to LKML at https://lkml.org/lkml/2014/8/28/386. An additional fix that eliminates a documented (but now inconvenient) deadlock between RCU hotplug and expedited grace periods was posted at https://lkml.org/lkml/2014/8/28/573. * Changes related to No-CBs CPUs and NO_HZ_FULL. These were posted to LKML at https://lkml.org/lkml/2014/8/28/412. * Torture-test updates. These were posted to LKML at https://lkml.org/lkml/2014/8/28/546 and at https://lkml.org/lkml/2014/9/11/1114. * RCU-tasks implementation. These were posted to LKML at https://lkml.org/lkml/2014/8/28/540. " Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-22Merge tag 'fscache-fixes-20140917' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull fs-cache fixes from David Howells: - Put a timeout in releasepage() to deal with a recursive hang between the memory allocator, writeback, ext4 and fscache under memory pressure. - Fix a pair of refcount bugs in the fscache error handling. - Remove a couple of unused pagevecs. - The cachefiles requirement that the base directory support rename should permit rename2 as an alternative - otherwise certain filesystems cannot now be used as backing stores (such as ext4). * tag 'fscache-fixes-20140917' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: CacheFiles: Handle rename2 cachefiles: remove two unused pagevecs. FS-Cache: refcount becomes corrupt under vma pressure. FS-Cache: Reduce cookie ref count if submit fails. FS-Cache: Timeout for releasepage()
2014-09-22Btrfs: try not to ENOSPC on log replayJosef Bacik
When doing log replay we may have to update inodes, which traditionally goes through our delayed inode stuff. This will try to move space over from the trans handle, but we don't reserve space in our trans handle on replay since we don't know how much we will need, so instead we try to flush. But because we have a trans handle open we won't flush anything, so if we are out of reserve space we will simply return ENOSPC. Since we know that if an operation made it into the log then we definitely had space before the box bought the farm then we don't need to worry about doing this space reservation. Use the fs_info->log_root_recovering flag to skip the delayed inode stuff and update the item directly. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-09-22Btrfs: don't do async reclaim during log replayJosef Bacik
Trying to reproduce a log enospc bug I hit a panic in the async reclaim code during log replay. This is because we use fs_info->fs_root as our root for shrinking and such. Technically we can use whatever root we want, but let's just not allow async reclaim while we're doing log replay. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-09-22Btrfs: remove empty block groups automaticallyJosef Bacik
One problem that has plagued us is that a user will use up all of his space with data, remove a bunch of that data, and then try to create a bunch of small files and run out of space. This happens because all the chunks were allocated for data since the metadata requirements were so low. But now there's a bunch of empty data block groups and not enough metadata space to do anything. This patch solves this problem by automatically deleting empty block groups. If we notice the used count go down to 0 when deleting or on mount notice that a block group has a used count of 0 then we will queue it to be deleted. When the cleaner thread runs we will double check to make sure the block group is still empty and then we will delete it. This patch has the side effect of no longer having a bunch of BUG_ON()'s in the chunk delete code, which will be helpful for both this and relocate. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-09-22Merge branch 'for-linus' into for-3.18/coreJens Axboe
Moving patches from for-linus to 3.18 instead, pull in this changes that will go to Linus today.
2014-09-22Fix nasty 32-bit overflow bug in buffer i/o code.Anton Altaparmakov
On 32-bit architectures, the legacy buffer_head functions are not always handling the sector number with the proper 64-bit types, and will thus fail on 4TB+ disks. Any code that uses __getblk() (and thus bread(), breadahead(), sb_bread(), sb_breadahead(), sb_getblk()), and calls it using a 64-bit block on a 32-bit arch (where "long" is 32-bit) causes an inifinite loop in __getblk_slow() with an infinite stream of errors logged to dmesg like this: __find_get_block_slow() failed. block=6740375944, b_blocknr=2445408648 b_state=0x00000020, b_size=512 device sda1 blocksize: 512 Note how in hex block is 0x191C1F988 and b_blocknr is 0x91C1F988 i.e. the top 32-bits are missing (in this case the 0x1 at the top). This is because grow_dev_page() is broken and has a 32-bit overflow due to shifting the page index value (a pgoff_t - which is just 32 bits on 32-bit architectures) left-shifted as the block number. But the top bits to get lost as the pgoff_t is not type cast to sector_t / 64-bit before the shift. This patch fixes this issue by type casting "index" to sector_t before doing the left shift. Note this is not a theoretical bug but has been seen in the field on a 4TiB hard drive with logical sector size 512 bytes. This patch has been verified to fix the infinite loop problem on 3.17-rc5 kernel using a 4TB disk image mounted using "-o loop". Without this patch doing a "find /nt" where /nt is an NTFS volume causes the inifinite loop 100% reproducibly whilst with the patch it works fine as expected. Signed-off-by: Anton Altaparmakov <aia21@cantab.net> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-21pnfs/blocklayout: Fix a 64-bit division/remainder issue in bl_map_stripeTrond Myklebust
kbuild test robot reports: fs/built-in.o: In function `bl_map_stripe': >> :(.text+0x965b4): undefined reference to `__aeabi_uldivmod' >> :(.text+0x965cc): undefined reference to `__aeabi_uldivmod' >> :(.text+0x96604): undefined reference to `__aeabi_uldivmod' Fixes: 5c83746a0cf2 (pnfs/blocklayout: in-kernel GETDEVICEINFO XDR parsing) Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-19Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "I've got a revert to fix a regression with btrfs device registration, and Filipe has part two of his fsync fix from last week" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Revert "Btrfs: device_list_add() should not update list when mounted" Btrfs: set inode's logged_trans/last_log_commit after ranged fsync
2014-09-19Merge tag 'nfs-for-3.17-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client fixes from Trond Myklebust: "Highligts: - fix an Oops in nfs4_open_and_get_state - fix an Oops in the nfs4_state_manager - fix another bug in the close/open_downgrade code" * tag 'nfs-for-3.17-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv4: Fix another bug in the close/open_downgrade code NFSv4: nfs4_state_manager() vs. nfs_server_remove_lists() NFS: remove BUG possibility in nfs4_open_and_get_state
2014-09-19UBIFS: Remove bogus assertRichard Weinberger
This assertion was only correct before UBIFS had xattr support. Now with xattr support also a directory node can carry data and can act as host node. Suggested-by: Artem Bityutskiy <dedekind1@gmail.com> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2014-09-19Btrfs: fix data corruption after fast fsync and writeback errorFilipe Manana
When we do a fast fsync, we start all ordered operations and then while they're running in parallel we visit the list of modified extent maps and construct their matching file extent items and write them to the log btree. After that, in btrfs_sync_log() we wait for all the ordered operations to finish (via btrfs_wait_logged_extents). The problem with this is that we were completely ignoring errors that can happen in the extent write path, such as -ENOSPC, a temporary -ENOMEM or -EIO errors for example. When such error happens, it means we have parts of the on disk extent that weren't written to, and so we end up logging file extent items that point to these extents that contain garbage/random data - so after a crash/reboot plus log replay, we get our inode's metadata pointing to those extents. This worked in contrast with the full (non-fast) fsync path, where we start all ordered operations, wait for them to finish and then write to the log btree. In this path, after each ordered operation completes we check if it's flagged with an error (BTRFS_ORDERED_IOERR) and return -EIO if so (via btrfs_wait_ordered_range). So if an error happens with any ordered operation, just return a -EIO error to userspace, so that it knows that not all of its previous writes were durably persisted and the application can take proper action (like redo the writes for e.g.) - and definitely not leave any file extent items in the log refer to non fully written extents. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-09-19Btrfs: fix fsync race leading to invalid data after log replayFilipe Manana
When the fsync callback (btrfs_sync_file) starts, it first waits for the writeback of any dirty pages to start and finish without holding the inode's mutex (to reduce contention). After this it acquires the inode's mutex and repeats that process via btrfs_wait_ordered_range only if we're doing a full sync (BTRFS_INODE_NEEDS_FULL_SYNC flag is set on the inode). This is not safe for a non full sync - we need to start and wait for writeback to finish for any pages that might have been made dirty before acquiring the inode's mutex and after that first step mentioned before. Why this is needed is explained by the following comment added to btrfs_sync_file: "Right before acquiring the inode's mutex, we might have new writes dirtying pages, which won't immediately start the respective ordered operations - that is done through the fill_delalloc callbacks invoked from the writepage and writepages address space operations. So make sure we start all ordered operations before starting to log our inode. Not doing this means that while logging the inode, writeback could start and invoke writepage/writepages, which would call the fill_delalloc callbacks (cow_file_range, submit_compressed_extents). These callbacks add first an extent map to the modified list of extents and then create the respective ordered operation, which means in tree-log.c:btrfs_log_inode() we might capture all existing ordered operations (with btrfs_get_logged_extents()) before the fill_delalloc callback adds its ordered operation, and by the time we visit the modified list of extent maps (with btrfs_log_changed_extents()), we see and process the extent map they created. We then use the extent map to construct a file extent item for logging without waiting for the respective ordered operation to finish - this file extent item points to a disk location that might not have yet been written to, containing random data - so after a crash a log replay will make our inode have file extent items that point to disk locations containing invalid data, as we returned success to userspace without waiting for the respective ordered operation to finish, because it wasn't captured by btrfs_get_logged_extents()." Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-09-19sched, cleanup, treewide: Remove set_current_state(TASK_RUNNING) after ↵Kirill Tkhai
schedule() schedule(), io_schedule() and schedule_timeout() always return with TASK_RUNNING state set, so one more setting is unnecessary. (All places in patch are visible good, only exception is kiblnd_scheduler() from: drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c Its schedule() is one line above standard 3 lines of unified diff) No places where set_current_state() is used for mb(). Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1410529254.3569.23.camel@tkhai Cc: Alasdair Kergon <agk@redhat.com> Cc: Anil Belur <askb23@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: David Airlie <airlied@linux.ie> Cc: David Howells <dhowells@redhat.com> Cc: Dmitry Eremin <dmitry.eremin@intel.com> Cc: Frank Blaschka <blaschka@linux.vnet.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Isaac Huang <he.huang@intel.com> Cc: James E.J. Bottomley <JBottomley@parallels.com> Cc: James E.J. Bottomley <jejb@parisc-linux.org> Cc: J. Bruce Fields <bfields@fieldses.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Liang Zhen <liang.zhen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Masaru Nomura <massa.nomura@gmail.com> Cc: Michael Opdenacker <michael.opdenacker@free-electrons.com> Cc: Mikael Starvik <starvik@axis.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Neil Brown <neilb@suse.de> Cc: Oleg Drokin <green@linuxhacker.ru> Cc: Peng Tao <bergwolf@gmail.com> Cc: Richard Weinberger <richard@nod.at> Cc: Robert Love <robert.w.love@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Cc: Ursula Braun <ursula.braun@de.ibm.com> Cc: Zi Shen Lim <zlim.lnx@gmail.com> Cc: devel@driverdev.osuosl.org Cc: dm-devel@redhat.com Cc: dri-devel@lists.freedesktop.org Cc: fcoe-devel@open-fcoe.org Cc: jfs-discussion@lists.sourceforge.net Cc: linux390@de.ibm.com Cc: linux-afs@lists.infradead.org Cc: linux-cris-kernel@axis.com Cc: linux-kernel@vger.kernel.org Cc: linux-nfs@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linux-raid@vger.kernel.org Cc: linux-s390@vger.kernel.org Cc: linux-scsi@vger.kernel.org Cc: qla2xxx-upstream@qlogic.com Cc: user-mode-linux-devel@lists.sourceforge.net Cc: user-mode-linux-user@lists.sourceforge.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19GFS2: fix bad inode i_goal values during block allocationAbhi Das
This patch checks if i_goal is either zero or if doesn't exist within any rgrp (i.e gfs2_blk2rgrpd() returns NULL). If so, it assigns the ip->i_no_addr block as the i_goal. There are two scenarios where a bad i_goal can result in a -EBADSLT error. 1. Attempting to allocate to an existing inode: Control reaches gfs2_inplace_reserve() and ip->i_goal is bad. We need to fix i_goal here. 2. A new inode is created in a directory whose i_goal is hosed: In this case, the parent dir's i_goal is copied onto the new inode. Since the new inode is not yet created, the ip->i_no_addr field is invalid and so, the fix in gfs2_inplace_reserve() as per 1) won't work in this scenario. We need to catch and fix it sooner in the parent dir itself (gfs2_create_inode()), before it is copied to the new inode. Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-09-18ext4: fold ext4_nojournal_sops into ext4_sopsTheodore Ts'o
There's no longer any need to have a separate set of super_operations for nojournal mode. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-18ext4: support freezing ext2 (nojournal) file systemsTheodore Ts'o
Through an oversight, when we added nojournal support to ext4, we didn't add support to allow file system freezing. This is relatively easy to add, so let's do it. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reported-by: Dexuan Cui <decui@microsoft.com>
2014-09-18ext4: fold ext4_sync_fs_nojournal() into ext4_sync_fs()Theodore Ts'o
This allows us to eliminate duplicate code, and eventually allow us to also fold ext4_sops and ext4_nojournal_sops together. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-18Merge branch 'for-linus' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull cifs/smb3 fixes from Steve French: "Fixes for problems found during testing and debugging at the SMB3 storage test event (plugfest) this week" * 'for-linus' of git://git.samba.org/sfrench/cifs-2.6: Fix mfsymlinks file size check Update version number displayed by modinfo for cifs.ko cifs: remove dead code Revert "cifs: No need to send SIGKILL to demux_thread during umount" [SMB3] Fix oops when creating symlinks on smb3 [CIFS] Fix setting time before epoch (negative time values)
2014-09-18cpuset: simplify proc_cpuset_show()Zefan Li
Use the ONE macro instead of REG, and we can simplify proc_cpuset_show(). Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-18cgroup: simplify proc_cgroup_show()Zefan Li
Use the ONE macro instead of REG, and we can simplify proc_cgroup_show(). Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-18NFSv4: Fix another bug in the close/open_downgrade codeTrond Myklebust
James Drew reports another bug whereby the NFS client is now sending an OPEN_DOWNGRADE in a situation where it should really have sent a CLOSE: the client is opening the file for O_RDWR, but then trying to do a downgrade to O_RDONLY, which is not allowed by the NFSv4 spec. Reported-by: James Drews <drews@engr.wisc.edu> Link: http://lkml.kernel.org/r/541AD7E5.8020409@engr.wisc.edu Fixes: aee7af356e15 (NFSv4: Fix problems with close in the presence...) Cc: stable@vger.kernel.org # 2.6.33+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-18NFSv4: nfs4_state_manager() vs. nfs_server_remove_lists()Steve Dickson
There is a race between nfs4_state_manager() and nfs_server_remove_lists() that happens during a nfsv3 mount. The v3 mount notices there is already a supper block so nfs_server_remove_lists() called which uses the nfs_client_lock spin lock to synchronize access to the client list. At the same time nfs4_state_manager() is running through the client list looking for work to do, using the same lock. When nfs4_state_manager() wins the race to the list, a v3 client pointer is found and not ignored properly which causes the panic. Moving some protocol checks before the state checking avoids the panic. CC: Stable Tree <stable@vger.kernel.org> Signed-off-by: Steve Dickson <steved@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-18Revert "Btrfs: device_list_add() should not update list when mounted"Chris Mason
This reverts commit b96de000bc8bc9688b3a2abea4332bd57648a49f. This commit is triggering failures to mount by subvolume id in some configurations. The main problem is how many different ways this scanning function is used, both for scanning while mounted and unmounted. A proper cleanup is too big for late rcs. For now, just revert the commit and we'll put a better fix into a later merge window. Signed-off-by: Chris Mason <clm@fb.com>
2014-09-18btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent mapQu Wenruo
The following commit enhanced the merge_extent_mapping() to reduce fragment in extent map tree, but it can't handle case which existing lies before map_start: 51f39 btrfs: Use right extent length when inserting overlap extent map. [BUG] When existing extent map's start is before map_start, the em->len will be minus, which will corrupt the extent map and fail to insert the new extent map. This will happen when someone get a large extent map, but when it is going to insert it into extent map tree, some one has already commit some write and split the huge extent into small parts. [REPRODUCER] It is very easy to tiger using filebench with randomrw personality. It is about 100% to reproduce when using 8G preallocated file in 60s randonrw test. [FIX] This patch can now handle any existing extent position. Since it does not directly use existing->start, now it will find the previous and next extent around map_start. So the old existing->start < map_start bug will never happen again. [ENHANCE] This patch will insert the best fitted extent map into extent map tree, other than the oldest [map_start, map_start + sectorsize) or the relatively newer but not perfect [map_start, existing->start). The patch will first search existing extent that does not intersects with the desired map range [map_start, map_start + len). The existing extent will be either before or behind map_start, and based on the existing extent, we can find out the previous and next extent around map_start. So the best fitted extent would be [prev->end, next->start). For prev or next is not found, em->start would be prev->end and em->end wold be next->start. With this patch, the fragment in extent map tree should be reduced much more than the 51f39 commit and reduce an unneeded extent map tree search. Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-09-18ext4: don't check quota format when there are no quota filesJan Kara
The check whether quota format is set even though there are no quota files with journalled quota is pointless and it actually makes it impossible to turn off journalled quotas (as there's no way to unset journalled quota format). Just remove the check. CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-18jbd2: simplify calling convention around __jbd2_journal_clean_checkpoint_listJan Kara
__jbd2_journal_clean_checkpoint_list() returns number of buffers it freed but noone was using the value so just stop doing that. This also allows for simplifying the calling convention for journal_clean_once_cp_list(). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-18jbd2: avoid pointless scanning of checkpoint listsJan Kara
Yuanhan has reported that when he is running fsync(2) heavy workload creating new files over ramdisk, significant amount of time is spent in __jbd2_journal_clean_checkpoint_list() trying to clean old transactions (but they cannot be cleaned up because flusher hasn't yet checkpointed those buffers). The workload can be generated by: fs_mark -d /fs/ram0/1 -D 2 -N 2560 -n 1000000 -L 1 -S 1 -s 4096 Reduce the amount of scanning by stopping to scan the transaction list once we find a transaction that cannot be checkpointed. Note that this way of cleaning is still enough to keep freeing space in the journal after fully checkpointed transactions. Reported-and-tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-17CacheFiles: Handle rename2David Howells
Not all filesystems now provide the rename i_op - ext4 for one - but rather provide the rename2 i_op. CacheFiles checks that the filesystem has rename and so will reject ext4 now with EPERM: CacheFiles: Failed to register: -1 Fix this by checking for rename2 as an alternative. The call to vfs_rename() actually handles selection of the appropriate function, so we needn't worry about that. Turning on debugging shows: [cachef] ==> cachefiles_get_directory(,,cache) [cachef] subdir -> ffff88000b22b778 positive [cachef] <== cachefiles_get_directory() = -1 [check] where -1 is EPERM. Signed-off-by: David Howells <dhowells@redhat.com>
2014-09-17cachefiles: remove two unused pagevecs.NeilBrown
These two have been unused since commit c4d6d8dbf335c7fa47341654a37c53a512b519bb CacheFiles: Fix the marking of cached pages in 3.8. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: David Howells <dhowells@redhat.com>
2014-09-17FS-Cache: refcount becomes corrupt under vma pressure.Milosz Tanski
In rare cases under heavy VMA pressure the ref count for a fscache cookie becomes corrupt. In this case we decrement ref count even if we fail before incrementing the refcount. FS-Cache: Assertion failed bnode-eca5f9c6/syslog 0 > 0 is false ------------[ cut here ]------------ kernel BUG at fs/fscache/cookie.c:519! invalid opcode: 0000 [#1] SMP Call Trace: [<ffffffffa01ba060>] __fscache_relinquish_cookie+0x50/0x220 [fscache] [<ffffffffa02d64ce>] ceph_fscache_unregister_inode_cookie+0x3e/0x50 [ceph] [<ffffffffa02ae1d3>] ceph_destroy_inode+0x33/0x200 [ceph] [<ffffffff811cf67e>] ? __fsnotify_inode_delete+0xe/0x10 [<ffffffff811a9e0c>] destroy_inode+0x3c/0x70 [<ffffffff811a9f51>] evict+0x111/0x180 [<ffffffff811aa763>] iput+0x103/0x190 [<ffffffff811a5de8>] __dentry_kill+0x1c8/0x220 [<ffffffff811a5f31>] shrink_dentry_list+0xf1/0x250 [<ffffffff811a762c>] prune_dcache_sb+0x4c/0x60 [<ffffffff811930af>] super_cache_scan+0xff/0x170 [<ffffffff8113d7a0>] shrink_slab_node+0x140/0x2c0 [<ffffffff8113f2da>] shrink_slab+0x8a/0x130 [<ffffffff81142572>] balance_pgdat+0x3e2/0x5d0 [<ffffffff811428ca>] kswapd+0x16a/0x4a0 [<ffffffff810a43f0>] ? __wake_up_sync+0x20/0x20 [<ffffffff81142760>] ? balance_pgdat+0x5d0/0x5d0 [<ffffffff81083e09>] kthread+0xc9/0xe0 [<ffffffff81010000>] ? ftrace_raw_event_xen_mmu_release_ptpage+0x70/0x90 [<ffffffff81083d40>] ? flush_kthread_worker+0xb0/0xb0 [<ffffffff8159f63c>] ret_from_fork+0x7c/0xb0 [<ffffffff81083d40>] ? flush_kthread_worker+0xb0/0xb0 RIP [<ffffffffa01b984b>] __fscache_disable_cookie+0x1db/0x210 [fscache] RSP <ffff8803bc85f9b8> ---[ end trace 254d0d7c74a01f25 ]--- Signed-off-by: Milosz Tanski <milosz@adfin.com> Signed-off-by: David Howells <dhowells@redhat.com>
2014-09-17Btrfs: fix up bounds checking in lseekLiu Bo
An user reported this, it is because that lseek's SEEK_SET/SEEK_CUR/SEEK_END allow a negative value for @offset, but btrfs's SEEK_DATA/SEEK_HOLE don't prepare for that and convert the negative @offset into unsigned type, so we get (end < start) warning. [ 1269.835374] ------------[ cut here ]------------ [ 1269.836809] WARNING: CPU: 0 PID: 1241 at fs/btrfs/extent_io.c:430 insert_state+0x11d/0x140() [ 1269.838816] BTRFS: end < start 4094 18446744073709551615 [ 1269.840334] CPU: 0 PID: 1241 Comm: a.out Tainted: G W 3.16.0+ #306 [ 1269.858229] Call Trace: [ 1269.858612] [<ffffffff81801a69>] dump_stack+0x4e/0x68 [ 1269.858952] [<ffffffff8107894c>] warn_slowpath_common+0x8c/0xc0 [ 1269.859416] [<ffffffff81078a36>] warn_slowpath_fmt+0x46/0x50 [ 1269.859929] [<ffffffff813b0fbd>] insert_state+0x11d/0x140 [ 1269.860409] [<ffffffff813b1396>] __set_extent_bit+0x3b6/0x4e0 [ 1269.860805] [<ffffffff813b21c7>] lock_extent_bits+0x87/0x200 [ 1269.861697] [<ffffffff813a5b28>] btrfs_file_llseek+0x148/0x2a0 [ 1269.862168] [<ffffffff811f201e>] SyS_lseek+0xae/0xc0 [ 1269.862620] [<ffffffff8180b212>] system_call_fastpath+0x16/0x1b [ 1269.862970] ---[ end trace 4d33ea885832054b ]--- This assumes that btrfs starts finding DATA/HOLE from the beginning of file if the assigned @offset is negative. Also we add alignment for lock_extent_bits 's range. Reported-by: Toralf Förster <toralf.foerster@gmx.de> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17Btrfs: cleanup the read failure record after write or when the inode is freeingMiao Xie
After the data is written successfully, we should cleanup the read failure record in that range because - If we set data COW for the file, the range that the failure record pointed to is mapped to a new place, so it is invalid. - If we set no data COW for the file, and if there is no error during writting, the corrupted data is corrected, so the failure record can be removed. And if some errors happen on the mirrors, we also needn't worry about it because the failure record will be recreated if we read the same place again. Sometimes, we may fail to correct the data, so the failure records will be left in the tree, we need free them when we free the inode or the memory leak happens. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17Btrfs: implement repair function when direct read failsMiao Xie
This patch implement data repair function when direct read fails. The detail of the implementation is: - When we find the data is not right, we try to read the data from the other mirror. - When the io on the mirror ends, we will insert the endio work into the dedicated btrfs workqueue, not common read endio workqueue, because the original endio work is still blocked in the btrfs endio workqueue, if we insert the endio work of the io on the mirror into that workqueue, deadlock would happen. - After we get right data, we write it back to the corrupted mirror. - And if the data on the new mirror is still corrupted, we will try next mirror until we read right data or all the mirrors are traversed. - After the above work, we set the uptodate flag according to the result. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17Btrfs: Set real mirror number for read operation on RAID0/5/6Miao Xie
We need real mirror number for RAID0/5/6 when reading data, or if read error happens, we would pass 0 as the number of the mirror on which the io error happens. It is wrong and would cause the filesystem read the data from the corrupted mirror again. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17Btrfs: modify clean_io_failure and make it suit direct ioMiao Xie
We could not use clean_io_failure in the direct IO path because it got the filesystem information from the page structure, but the page in the direct IO bio didn't have the filesystem information in its structure. So we need modify it and pass all the information it need by parameters. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17Btrfs: modify repair_io_failure and make it suit direct ioMiao Xie
The original code of repair_io_failure was just used for buffered read, because it got some filesystem data from page structure, it is safe for the page in the page cache. But when we do a direct read, the pages in bio are not in the page cache, that is there is no filesystem data in the page structure. In order to implement direct read data repair, we need modify repair_io_failure and pass all filesystem data it need by function parameters. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17Btrfs: split bio_readpage_error into several functionsMiao Xie
The data repair function of direct read will be implemented later, and some code in bio_readpage_error will be reused, so split bio_readpage_error into several functions which will be used in direct read repair later. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17Btrfs: Cleanup unused variant and argument of IO failure handlersMiao Xie
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17Btrfs: fix missing error handler if submiting re-read bio failsMiao Xie
We forgot to free failure record and bio after submitting re-read bio failed, fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17Btrfs: do file data check by sub-bio's selfMiao Xie
Direct IO splits the original bio to several sub-bios because of the limit of raid stripe, and the filesystem will wait for all sub-bios and then run final end io process. But it was very hard to implement the data repair when dio read failure happens, because at the final end io function, we didn't know which mirror the data was read from. So in order to implement the data repair, we have to move the file data check in the final end io function to the sub-bio end io function, in which we can get the mirror number of the device we access. This patch did this work as the first step of the direct io data repair implementation. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17Btrfs: cleanup similar code of the buffered data data check and dio read ↵Miao Xie
data check Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>