summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2013-05-24xfs: Don't reference the EFI after it is freedDave Chinner
Checking the EFI for whether it is being released from recovery after we've already released the known active reference is a mistake worthy of a brown paper bag. Fix the (now) obvious use after free that it can cause. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit 52c24ad39ff02d7bd73c92eb0c926fb44984a41d)
2013-05-24xfs: fix rounding in xfs_free_file_spaceDave Chinner
The offset passed into xfs_free_file_space() needs to be rounded down to a certain size, but the rounding mask is built by a 32 bit variable. Hence the mask will always mask off the upper 32 bits of the offset and lead to incorrect writeback and invalidation ranges. This is not actually exposed as a bug because we writeback and invalidate from the rounded offset to the end of the file, and hence the offset we are actually punching a hole out of will always be covered by the code. This needs fixing, however, if we ever want to use exact ranges for writeback/invalidation here... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit 28ca489c63e9aceed8801d2f82d731b3c9aa50f5)
2013-05-24xfs: fix sub-page blocksize data integrity writesDave Chinner
FSX on 512 byte block size filesystems has been failing for some time with corrupted data. The fault dates back to the change in the writeback data integrity algorithm that uses a mark-and-sweep approach to avoid data writeback livelocks. Unfortunately, a side effect of this mark-and-sweep approach is that each page will only be written once for a data integrity sync, and there is a condition in writeback in XFS where a page may require two writeback attempts to be fully written. As a result of the high level change, we now only get a partial page writeback during the integrity sync because the first pass through writeback clears the mark left on the page index to tell writeback that the page needs writeback.... The cause is writing a partial page in the clustering code. This can happen when a mapping boundary falls in the middle of a page - we end up writing back the first part of the page that the mapping covers, but then never revisit the page to have the remainder mapped and written. The fix is simple - if the mapping boundary falls inside a page, then simple abort clustering without touching the page. This means that the next ->writepage entry that write_cache_pages() will make is the page we aborted on, and xfs_vm_writepage() will map all sections of the page correctly. This behaviour is also optimal for non-data integrity writes, as it results in contiguous sequential writeback of the file rather than missing small holes and having to write them a "random" writes in a future pass. With this fix, all the fsx tests in xfstests now pass on a 512 byte block size filesystem on a 4k page machine. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit 49b137cbbcc836ef231866c137d24f42c42bb483)
2013-05-24fs/jfs: Add check if journaling to disk has been disabled in lbmRead()Gu Zheng
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2013-05-24jfs: Several bugs in jfs_freeze() and jfs_unfreeze()Vahram Martirosyan
The mentioned functions do not pay attention to the error codes returned by the functions updateSuper(), lmLogInit() and lmLogShutdown(). It brings to system crash later when writing to log. The patch adds corresponding code to check and return the error codes and to print correct error messages in case of errors. Found by Linux File System Verification project (linuxtesting.org). Signed-off-by: Vahram Martirosyan <vahram.martirosyan@linuxtesting.org> Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2013-05-24cifs: fix composing of mount options for DFS referralsJeff Layton
With the change to ignore the unc= and prefixpath= mount options, there is no longer any need to add them to the options string when mounting. By the same token, we now need to build a device name that includes the prefixpath when mounting. To make things neater, the delimiters on the devicename are changed to '/' since that's preferred when mounting anyway. v2: fix some comments and don't bother looking at whether there is a prepath in the ref->node_name when deciding whether to pass a prepath to cifs_build_devname. v3: rebase on top of potential buffer overrun fix for stable Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
2013-05-24cifs: stop printing the unc= option in /proc/mountsJeff Layton
Since we no longer recognize that option, stop printing it out. The devicename is now the canonical source for this info. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
2013-05-24cifs: fix error handling when calling cifs_parse_devnameJeff Layton
When we allowed separate unc= and prefixpath= mount options, we could ignore EINVAL errors from cifs_parse_devname. Now that they are deprecated, we need to check for that as well and fail the mount if it's malformed. Also fix a later error message that refers to the unc= option. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
2013-05-24cifs: allow sec=none mounts to work against servers that don't support ↵Jeff Layton
extended security In the case of sec=none, we're not sending a username or password, so there's little benefit to mandating NTLMSSP auth. Allow it to use unencapsulated auth in that case. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
2013-05-24cifs: fix potential buffer overrun when composing a new options stringJeff Layton
Consider the case where we have a very short ip= string in the original mount options, and when we chase a referral we end up with a very long IPv6 address. Be sure to allow for that possibility when estimating the size of the string to allocate. Cc: <stable@vger.kernel.org> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
2013-05-24cifs: only set ops for inodes in I_NEW stateJeff Layton
It's generally not safe to reset the inode ops once they've been set. In the case where the inode was originally thought to be a directory and then later found to be a DFS referral, this can lead to an oops when we try to trigger an inode op on it after changing the ops to the blank referral operations. Cc: <stable@vger.kernel.org> Reported-and-Tested-by: Sachin Prabhu <sprabhu@redhat.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
2013-05-24Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull CIFS fix from Steve French: "One cifs fix to merge now - fixes possible DFS oops (I expect to request a merge of 4 additional cifs fixes next week)" * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: cifs: only set ops for inodes in I_NEW state
2013-05-24GFS2: Fix typo in gfs2_log_end_write loopSteven Whitehouse
There was a missing _all in this loop iterator Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-05-24GFS2: fix DLM depends to fix build errorsRandy Dunlap
Fix build errors by correcting DLM dependencies in GFS2. Build errors happen when CONFIG_GFS2_FS_LOCKING_DLM=y and CONFIG_DLM=m: fs/built-in.o: In function `gfs2_lock': file.c:(.text+0xc7abd): undefined reference to `dlm_posix_get' file.c:(.text+0xc7ad0): undefined reference to `dlm_posix_unlock' file.c:(.text+0xc7ad9): undefined reference to `dlm_posix_lock' fs/built-in.o: In function `gdlm_unmount': lock_dlm.c:(.text+0xd6e5b): undefined reference to `dlm_release_lockspace' fs/built-in.o: In function `sync_unlock': lock_dlm.c:(.text+0xd6e9e): undefined reference to `dlm_unlock' fs/built-in.o: In function `sync_lock': lock_dlm.c:(.text+0xd6fb6): undefined reference to `dlm_lock' fs/built-in.o: In function `gdlm_put_lock': lock_dlm.c:(.text+0xd7238): undefined reference to `dlm_unlock' fs/built-in.o: In function `gdlm_mount': lock_dlm.c:(.text+0xd753e): undefined reference to `dlm_new_lockspace' lock_dlm.c:(.text+0xd79d3): undefined reference to `dlm_release_lockspace' fs/built-in.o: In function `gdlm_lock': lock_dlm.c:(.text+0xd8179): undefined reference to `dlm_lock' fs/built-in.o: In function `gdlm_cancel': lock_dlm.c:(.text+0xd6b22): undefined reference to `dlm_unlock' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-05-24GFS2: Use single-block reservations for directoriesBob Peterson
This patch changes the multi-block allocation code, such that directory inodes only get a single block reserved in the bitmap. That way, the bitmaps are more tightly packed together, and there are fewer spans of free blocks for in-use block reservations. This means it takes less time to find a free span of blocks in the bitmap, which speeds things up. This increases the performance of some workloads by almost 2X. In Nate's mockup.py script (which does (1) create dir, (2) create dir in dir, (3) create file in that dir) the test executes in 23 steps rather than 43 steps, a 47% performance improvement. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-05-24GFS2: two minor quota fixupsBob Peterson
This patch fixes two regression problems that Abhi found in the GFS2 quota code. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-05-23xfs: rework remote attr CRCsDave Chinner
Note: this changes the on-disk remote attribute format. I assert that this is OK to do as CRCs are marked experimental and the first kernel it is included in has not yet reached release yet. Further, the userspace utilities are still evolving and so anyone using this stuff right now is a developer or tester using volatile filesystems for testing this feature. Hence changing the format right now to save longer term pain is the right thing to do. The fundamental change is to move from a header per extent in the attribute to a header per filesytem block in the attribute. This means there are more header blocks and the parsing of the attribute data is slightly more complex, but it has the advantage that we always know the size of the attribute on disk based on the length of the data it contains. This is where the header-per-extent method has problems. We don't know the size of the attribute on disk without first knowing how many extents are used to hold it. And we can't tell from a mapping lookup, either, because remote attributes can be allocated contiguously with other attribute blocks and so there is no obvious way of determining the actual size of the atribute on disk short of walking and mapping buffers. The problem with this approach is that if we map a buffer incorrectly (e.g. we make the last buffer for the attribute data too long), we then get buffer cache lookup failure when we map it correctly. i.e. we get a size mismatch on lookup. This is not necessarily fatal, but it's a cache coherency problem that can lead to returning the wrong data to userspace or writing the wrong data to disk. And debug kernels will assert fail if this occurs. I found lots of niggly little problems trying to fix this issue on a 4k block size filesystem, finally getting it to pass with lots of fixes. The thing is, 1024 byte filesystems still failed, and it was getting really complex handling all the corner cases that were showing up. And there were clearly more that I hadn't found yet. It is complex, fragile code, and if we don't fix it now, it will be complex, fragile code forever more. Hence the simple fix is to add a header to each filesystem block. This gives us the same relationship between the attribute data length and the number of blocks on disk as we have without CRCs - it's a linear mapping and doesn't require us to guess anything. It is simple to implement, too - the remote block count calculated at lookup time can be used by the remote attribute set/get/remove code without modification for both CRC and non-CRC filesystems. The world becomes sane again. Because the copy-in and copy-out now need to iterate over each filesystem block, I moved them into helper functions so we separate the block mapping and buffer manupulations from the attribute data and CRC header manipulations. The code becomes much clearer as a result, and it is a lot easier to understand and debug. It also appears to be much more robust - once it worked on 4k block size filesystems, it has worked without failure on 1k block size filesystems, too. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-23xfs: fully initialise temp leaf in xfs_attr3_leaf_compactDave Chinner
xfs_attr3_leaf_compact() uses a temporary buffer for compacting the the entries in a leaf. It copies the the original buffer into the temporary buffer, then zeros the original buffer completely. It then copies the entries back into the original buffer. However, the original buffer has not been correctly initialised, and so the movement of the entries goes horribly wrong. Make sure the zeroed destination buffer is fully initialised, and once we've set up the destination incore header appropriately, write is back to the buffer before starting to move entries around. While debugging this, the _d/_s prefixes weren't sufficient to remind me what buffer was what, so rename then all _src/_dst. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-23xfs: fully initialise temp leaf in xfs_attr3_leaf_unbalanceDave Chinner
xfs_attr3_leaf_unbalance() uses a temporary buffer for recombining the entries in two leaves when the destination leaf requires compaction. The temporary buffer ends up being copied back over the original destination buffer, so the header in the temporary buffer needs to contain all the information that is in the destination buffer. To make sure the temporary buffer is fully initialised, once we've set up the temporary incore header appropriately, write is back to the temporary buffer before starting to move entries around. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-23NFS: Fix SETCLIENTID fallback if GSS is not availableChuck Lever
Commit 79d852bf "NFS: Retry SETCLIENTID with AUTH_SYS instead of AUTH_NONE" did not take into account commit 23631227 "NFSv4: Fix the fallback to AUTH_NULL if krb5i is not available". Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-05-23xfs: correctly map remote attr buffers during removalDave Chinner
If we don't map the buffers correctly (same as for get/set operations) then the incore buffer lookup will fail. If a block number matches but a length is wrong, then debug kernels will ASSERT fail in _xfs_buf_find() due to the length mismatch. Ensure that we map the buffers correctly by basing the length of the buffer on the attribute data length rather than the remote block count. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-23xfs: remote attribute tail zeroing does too muchDave Chinner
When an attribute data does not fill then entire remote block, we zero the remaining part of the buffer. This, however, needs to take into account that the buffer has a header, and so the offset where zeroing starts and the length of zeroing need to take this into account. Otherwise we end up with zeros over the end of the attribute value when CRCs are enabled. While there, make sure we only ask to map an extent that covers the remaining range of the attribute, rather than asking every time for the full length of remote data. If the remote attribute blocks are contiguous with other parts of the attribute tree, it will map those blocks as well and we can potentially zero them incorrectly. We can also get buffer size mistmatches when trying to read or remove the remote attribute, and this can lead to not finding the correct buffer when looking it up in cache. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-23xfs: remote attribute read too shortDave Chinner
Reading a maximally size remote attribute fails when CRCs are enabled with this verification error: XFS (vdb): remote attribute header does not match required off/len/owner) There are two reasons for this, the first being that the length of the buffer being read is determined from the args->rmtblkcnt which doesn't take into account CRC headers. Hence the mapped length ends up being too short and so we need to calculate it directly from the value length. The second is that the byte count of valid data within a buffer is capped by the length of the data and so doesn't take into account that the buffer might be longer due to headers. Hence we need to calculate the data space in the buffer first before calculating the actual byte count of data. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-22Merge tag 'efi-urgent' into x86/urgentH. Peter Anvin
* Avoid confusing the user by returning -EIO instead of -ENOENT in efivarfs if an EFI variable gets deleted from under us and return EOF when reading from a zero-length file - Lingzhu Xiang * Fix an oops in efivar_update_sysfs_entries() caused by reusing (and therefore corrupting) a kzalloc() allocation - Seiji Aguchi * Initialise the DataSize argument to GetVariable() otherwise it will not be updated with the actual size of the variable on return. Discovered on a Acer Aspire V3 BIOS - Lee, Chun-Yi Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-05-21reiserfs: use ->invalidatepage() length argumentLukas Czerner
->invalidatepage() aop now accepts range to invalidate so we can make use of it in reiserfs_invalidatepage() Signed-off-by: Lukas Czerner <lczerner@redhat.com> Cc: reiserfs-devel@vger.kernel.org
2013-05-21gfs2: use ->invalidatepage() length argumentLukas Czerner
->invalidatepage() aop now accepts range to invalidate so we can make use of it in gfs2_invalidatepage(). Signed-off-by: Lukas Czerner <lczerner@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Cc: cluster-devel@redhat.com
2013-05-21ceph: use ->invalidatepage() length argumentLukas Czerner
->invalidatepage() aop now accepts range to invalidate so we can make use of it in ceph_invalidatepage(). Signed-off-by: Lukas Czerner <lczerner@redhat.com> Acked-by: Sage Weil <sage@inktank.com> Cc: ceph-devel@vger.kernel.org
2013-05-21ocfs2: use ->invalidatepage() length argumentLukas Czerner
->invalidatepage() aop now accepts range to invalidate so we can make use of it in ocfs2_invalidatepage(). Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Joel Becker <jlbec@evilplan.org>
2013-05-21xfs: use ->invalidatepage() length argumentLukas Czerner
->invalidatepage() aop now accepts range to invalidate so we can make use of it in xfs_vm_invalidatepage() Signed-off-by: Lukas Czerner <lczerner@redhat.com> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Cc: xfs@oss.sgi.com
2013-05-21jbd: change journal_invalidatepage() to accept lengthLukas Czerner
->invalidatepage() aop now accepts range to invalidate so we can make use of it in journal_invalidatepage() and all the users in ext3 file system. Also update ext3 trace point to print out length argument. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz>
2013-05-21ext4: use ->invalidatepage() length argumentLukas Czerner
->invalidatepage() aop now accepts range to invalidate so we can make use of it in all ext4 invalidatepage routines. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz>
2013-05-21jbd2: change jbd2_journal_invalidatepage to accept lengthLukas Czerner
invalidatepage now accepts range to invalidate and there are two file system using jbd2 also implementing punch hole feature which can benefit from this. We need to implement the same thing for jbd2 layer in order to allow those file system take benefit of this functionality. This commit adds length argument to the jbd2_journal_invalidatepage() and updates all instances in ext4 and ocfs2. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz>
2013-05-21mm: change invalidatepage prototype to accept lengthLukas Czerner
Currently there is no way to truncate partial page where the end truncate point is not at the end of the page. This is because it was not needed and the functionality was enough for file system truncate operation to work properly. However more file systems now support punch hole feature and it can benefit from mm supporting truncating page just up to the certain point. Specifically, with this functionality truncate_inode_pages_range() can be changed so it supports truncating partial page at the end of the range (currently it will BUG_ON() if 'end' is not at the end of the page). This commit changes the invalidatepage() address space operation prototype to accept range to be invalidated and update all the instances for it. We also change the block_invalidatepage() in the same way and actually make a use of the new length argument implementing range invalidation. Actual file system implementations will follow except the file systems where the changes are really simple and should not change the behaviour in any way .Implementation for truncate_page_range() which will be able to accept page unaligned ranges will follow as well. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com>
2013-05-21xfs: remote attribute allocation may be contiguousDave Chinner
When CRCs are enabled, there may be multiple allocations made if the headers cause a length overflow. This, however, does not mean that the number of headers required increases, as the second and subsequent extents may be contiguous with the previous extent. Hence when we map the extents to write the attribute data, we may end up with less extents than allocations made. Hence the assertion that we consume the number of headers we calculated in the allocation loop is incorrect and needs to be removed. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-21xfs: avoid nesting transactions in xfs_qm_scall_setqlim()Dave Chinner
Lockdep reports: ============================================= [ INFO: possible recursive locking detected ] 3.9.0+ #3 Not tainted --------------------------------------------- setquota/28368 is trying to acquire lock: (sb_internal){++++.?}, at: [<c11e8846>] xfs_trans_alloc+0x26/0x50 but task is already holding lock: (sb_internal){++++.?}, at: [<c11e8846>] xfs_trans_alloc+0x26/0x50 from xfs_qm_scall_setqlim()->xfs_dqread() when a dquot needs to be allocated. xfs_qm_scall_setqlim() is starting a transaction and then not passing it into xfs_qm_dqet() and so it starts it's own transaction when allocating the dquot. Splat! Fix this by not allocating the dquot in xfs_qm_scall_setqlim() inside the setqlim transaction. This requires getting the dquot first (and allocating it if necessary) then dropping and relocking the dquot before joining it to the setqlim transaction. Reported-by: Michael L. Semon <mlsemon35@gmail.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-21nfsd: avoid undefined signed overflowJim Rees
In C, signed integer overflow results in undefined behavior, but unsigned overflow wraps around. So do the subtraction first, then cast to signed. Reported-by: Joakim Tjernlund <joakim.tjernlund@transmode.se> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-05-20xfs: remote attribute lookups require the value lengthDave Chinner
When reading a remote attribute, to correctly calculate the length of the data buffer for CRC enable filesystems, we need to know the length of the attribute data. We get this information when we look up the attribute, but we don't store it in the args structure along with the other remote attr information we get from the lookup. Add this information to the args structure so we can use it appropriately. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-20xfs: xfs_attr_shortform_allfit() does not handle attr3 format.Dave Chinner
xfstests generic/117 fails with: XFS: Assertion failed: leaf->hdr.info.magic == cpu_to_be16(XFS_ATTR_LEAF_MAGIC) indicating a function that does not handle the attr3 format correctly. Fix it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-20xfs: xfs_da3_node_read_verify() doesn't handle XFS_ATTR3_LEAF_MAGICDave Chinner
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-20xfs: fix missing KM_NOFS tags to keep lockdep happyDave Chinner
There are several places where we use KM_SLEEP allocation contexts and use the fact that they are called from transaction context to add KM_NOFS where appropriate. Unfortunately, there are several places where the code makes this assumption but can be called from outside transaction context but with filesystem locks held. These places need explicit KM_NOFS annotations to avoid lockdep complaining about reclaim contexts. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-20xfs: Don't reference the EFI after it is freedDave Chinner
Checking the EFI for whether it is being released from recovery after we've already released the known active reference is a mistake worthy of a brown paper bag. Fix the (now) obvious use after free that it can cause. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-20xfs: fix rounding in xfs_free_file_spaceDave Chinner
The offset passed into xfs_free_file_space() needs to be rounded down to a certain size, but the rounding mask is built by a 32 bit variable. Hence the mask will always mask off the upper 32 bits of the offset and lead to incorrect writeback and invalidation ranges. This is not actually exposed as a bug because we writeback and invalidate from the rounded offset to the end of the file, and hence the offset we are actually punching a hole out of will always be covered by the code. This needs fixing, however, if we ever want to use exact ranges for writeback/invalidation here... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-20xfs: fix sub-page blocksize data integrity writesDave Chinner
FSX on 512 byte block size filesystems has been failing for some time with corrupted data. The fault dates back to the change in the writeback data integrity algorithm that uses a mark-and-sweep approach to avoid data writeback livelocks. Unfortunately, a side effect of this mark-and-sweep approach is that each page will only be written once for a data integrity sync, and there is a condition in writeback in XFS where a page may require two writeback attempts to be fully written. As a result of the high level change, we now only get a partial page writeback during the integrity sync because the first pass through writeback clears the mark left on the page index to tell writeback that the page needs writeback.... The cause is writing a partial page in the clustering code. This can happen when a mapping boundary falls in the middle of a page - we end up writing back the first part of the page that the mapping covers, but then never revisit the page to have the remainder mapped and written. The fix is simple - if the mapping boundary falls inside a page, then simple abort clustering without touching the page. This means that the next ->writepage entry that write_cache_pages() will make is the page we aborted on, and xfs_vm_writepage() will map all sections of the page correctly. This behaviour is also optimal for non-data integrity writes, as it results in contiguous sequential writeback of the file rather than missing small holes and having to write them a "random" writes in a future pass. With this fix, all the fsx tests in xfstests now pass on a 512 byte block size filesystem on a 4k page machine. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-20NFSv4.1 Fix a pNFS session draining deadlockAndy Adamson
On a CB_RECALL the callback service thread flushes the inode using filemap_flush prior to scheduling the state manager thread to return the delegation. When pNFS is used and I/O has not yet gone to the data server servicing the inode, a LAYOUTGET can preceed the I/O. Unlike the async filemap_flush call, the LAYOUTGET must proceed to completion. If the state manager starts to recover data while the inode flush is sending the LAYOUTGET, a deadlock occurs as the callback service thread holds the single callback session slot until the flushing is done which blocks the state manager thread, and the state manager thread has set the session draining bit which puts the inode flush LAYOUTGET RPC to sleep on the forechannel slot table waitq. Separate the draining of the back channel from the draining of the fore channel by moving the NFS4_SESSION_DRAINING bit from session scope into the fore and back slot tables. Drain the back channel first allowing the LAYOUTGET call to proceed (and fail) so the callback service thread frees the callback slot. Then proceed with draining the forechannel. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-05-20xfs: Avoid pathological backwards allocationJan Kara
Writing a large file using direct IO in 16 MB chunks sometimes results in a pathological allocation pattern where 16 MB chunks of large free extent are allocated to a file in a reversed order. So extents of a file look for example as: ext logical physical expected length flags 0 0 13 4550656 1 4550656 188136807 4550668 12562432 2 17113088 200699240 200699238 622592 3 17735680 182046055 201321831 4096 4 17739776 182041959 182050150 4096 5 17743872 182037863 182046054 4096 6 17747968 182033767 182041958 4096 7 17752064 182029671 182037862 4096 ... 6757 45400064 154381644 154389835 4096 6758 45404160 154377548 154385739 4096 6759 45408256 252951571 154381643 73728 eof This happens because XFS_ALLOCTYPE_THIS_BNO allocation fails (the last extent in the file cannot be further extended) so we fall back to XFS_ALLOCTYPE_NEAR_BNO allocation which picks end of a large free extent as the best place to continue the file. Since the chunk at the end of the free extent again cannot be further extended, this behavior repeats until the whole free extent is consumed in a reversed order. For data allocations this backward allocation isn't beneficial so make xfs_alloc_compute_diff() pick start of a free extent instead of its end for them. That avoids the backward allocation pattern. See thread at http://oss.sgi.com/archives/xfs/2013-03/msg00144.html for more details about the reproduction case and why this solution was chosen. Based on idea by Dave Chinner <dchinner@redhat.com>. CC: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-20fuse: update inode size and invalidate attributes on fallocateBrian Foster
An fallocate request without FALLOC_FL_KEEP_SIZE set can extend the size of a file. Update the inode size after a successful fallocate. Also invalidate the inode attributes after a successful fallocate to ensure we pick up the latest attribute values (i.e., i_blocks). Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-05-20fuse: truncate pagecache range on hole punchBrian Foster
fuse supports hole punch via the fallocate() FALLOC_FL_PUNCH_HOLE interface. When a hole punch is passed through, the page cache is not cleared and thus allows reading stale data from the cache. This is easily demonstrable (using FOPEN_KEEP_CACHE) by reading a smallish random data file into cache, punching a hole and creating a copy of the file. Drop caches or remount and observe that the original file no longer matches the file copied after the hole punch. The original file contains a zeroed range and the latter file contains stale data. Protect against writepage requests in progress and punch out the associated page cache range after a successful client fs hole punch. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-05-18Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Miao Xie has been very busy, fixing races and enospc problems and many other small but important pieces. Alexandre Oliva discovered some problems with how our error handling was interacting with the block layer and for now has disabled our partial handling of sub-page writes. The real sub-page work is in a series of patches from IBM that we still need to integrate and test. The code Alexandre has turned off was really incomplete. Josef has more error handling fixes and an important fix for the new skinny extent format. This also has my fix for the tracepoint crash from late in 3.9. It's the first stage in a larger clean up to get rid of btrfs_bio and make a proper bioset for all the items we need to tack into the bio. For now the bioset only holds our mirror_num and stripe_index, but for the next merge window I'll shuffle more in." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (25 commits) Btrfs: use a btrfs bioset instead of abusing bio internals Btrfs: make sure roots are assigned before freeing their nodes Btrfs: explicitly use global_block_rsv for quota_tree btrfs: do away with non-whole_page extent I/O Btrfs: don't invoke btrfs_invalidate_inodes() in the spin lock context Btrfs: remove BUG_ON() in btrfs_read_fs_tree_no_radix() Btrfs: pause the space balance when remounting to R/O Btrfs: fix unprotected root node of the subvolume's inode rb-tree Btrfs: fix accessing a freed tree root Btrfs: return errno if possible when we fail to allocate memory Btrfs: update the global reserve if it is empty Btrfs: don't steal the reserved space from the global reserve if their space type is different Btrfs: optimize the error handle of use_block_rsv() Btrfs: don't use global block reservation for inode cache truncation Btrfs: don't abort the current transaction if there is no enough space for inode cache Correct allowed raid levels on balance. Btrfs: fix possible memory leak in replace_path() Btrfs: fix possible memory leak in the find_parent_nodes() Btrfs: don't allow device replace on RAID5/RAID6 Btrfs: handle running extent ops with skinny metadata ...
2013-05-17Merge branch 'for-chris' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next
2013-05-17Btrfs: use a btrfs bioset instead of abusing bio internalsChris Mason
Btrfs has been pointer tagging bi_private and using bi_bdev to store the stripe index and mirror number of failed IOs. As bios bubble back up through the call chain, we use these to decide if and how to retry our IOs. They are also used to count IO failures on a per device basis. Recently a bio tracepoint was added lead to crashes because we were abusing bi_bdev. This commit adds a btrfs bioset, and creates explicit fields for the mirror number and stripe index. The plan is to extend this structure for all of the fields currently in struct btrfs_bio, which will mean one less kmalloc in our IO path. Signed-off-by: Chris Mason <chris.mason@fusionio.com> Reported-by: Tejun Heo <tj@kernel.org>