summaryrefslogtreecommitdiff
path: root/fs/xfs
AgeCommit message (Collapse)Author
2014-05-08Merge tag 'xfs-for-linus-3.15-rc5' of git://oss.sgi.com/xfs/xfsLinus Torvalds
Pull xfs fixes from Dave Chinner: "The main fix is adding support for default ACLs on O_TMPFILE opened inodes to bring XFS into line with other filesystems. Metadata CRCs are now also considered well enough tested to be fully supported, so we're removing the shouty warnings issued at mount time for filesystems with that format. And there's transaction block reservation overrun fix. Summary: - fix a remote attribute size calculation bug that leads to a transaction overrun - add default ACLs to O_TMPFILE files - Remove the EXPERIMENTAL tag from filesystems with metadata CRC support" * tag 'xfs-for-linus-3.15-rc5' of git://oss.sgi.com/xfs/xfs: xfs: remote attribute overwrite causes transaction overrun xfs: initialize default acls for ->tmpfile() xfs: fully support v5 format filesystems
2014-05-07xfs: fix directory readahead offset off-by-oneDave Chinner
Directory readahead can throw loud scary but harmless warnings when multiblock directories are in use a specific pattern of discontiguous blocks are found in the directory. That is, if a hole follows a discontiguous block, it will throw a warning like: XFS (dm-1): xfs_da_do_buf: bno 637 dir: inode 34363923462 XFS (dm-1): [00] br_startoff 637 br_startblock 1917954575 br_blockcount 1 br_state 0 XFS (dm-1): [01] br_startoff 638 br_startblock -2 br_blockcount 1 br_state 0 And dump a stack trace. This is because the readahead offset increment loop does a double increment of the block index - it does an increment for the loop iteration as well as increase the loop counter by the number of blocks in the extent. As a result, the readahead offset does not get incremented correctly for discontiguous blocks and hence can ask for readahead of a directory block from an offset part way through a directory block. If that directory block is followed by a hole, it will trigger a mapping warning like the above. The bad readahead will be ignored, though, because the main directory block read loop uses the correct mapping offsets rather than the readahead offset and so will ignore the bad readahead altogether. Fix the warning by ensuring that the readahead offset is correctly incremented. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-07xfs: don't sleep in xlog_cil_force_lsn on shutdownDave Chinner
Reports of a shutdown hang when fsyncing a directory have surfaced, such as this: [ 3663.394472] Call Trace: [ 3663.397199] [<ffffffff815f1889>] schedule+0x29/0x70 [ 3663.402743] [<ffffffffa01feda5>] xlog_cil_force_lsn+0x185/0x1a0 [xfs] [ 3663.416249] [<ffffffffa01fd3af>] _xfs_log_force_lsn+0x6f/0x2f0 [xfs] [ 3663.429271] [<ffffffffa01a339d>] xfs_dir_fsync+0x7d/0xe0 [xfs] [ 3663.435873] [<ffffffff811df8c5>] do_fsync+0x65/0xa0 [ 3663.441408] [<ffffffff811dfbc0>] SyS_fsync+0x10/0x20 [ 3663.447043] [<ffffffff815fc7d9>] system_call_fastpath+0x16/0x1b If we trigger a shutdown in xlog_cil_push() from xlog_write(), we will never wake waiters on the current push sequence number, so anything waiting in xlog_cil_force_lsn() for that push sequence number to come up will not get woken and hence stall the shutdown. Fix this by ensuring we call wake_up_all(&cil->xc_commit_wait) in the push abort handling, in the log shutdown code when waking all waiters, and adding a shutdown check in the sequence completion wait loops to ensure they abort when a wakeup due to a shutdown occurs. Reported-by: Boris Ranto <branto@redhat.com> Reported-by: Eric Sandeen <esandeen@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-07xfs: truncate_setsize should be outside transactionsDave Chinner
truncate_setsize() removes pages from the page cache, and hence requires page locks to be held. It is not valid to lock a page cache page inside a transaction context as we can hold page locks when we we reserve space for a transaction. If we do, then we expose an ABBA deadlock between log space reservation and page locks. That is, both the write path and writeback lock a page, then start a transaction for block allocation, which means they can block waiting for a log reservation with the page lock held. If we hold a log reservation and then do something that locks a page (e.g. truncate_setsize in xfs_setattr_size) then that page lock can block on the page locked and waiting for a log reservation. If the transaction that is waiting for the page lock is the only active transaction in the system that can free log space via a commit, then writeback will never make progress and so log space will never free up. This issue with xfs_setattr_size() was introduced back in 2010 by commit fa9b227 ("xfs: new truncate sequence") which moved the page cache truncate from outside the transaction context (what was xfs_itruncate_data()) to inside the transaction context as a call to truncate_setsize(). The reason truncate_setsize() was located where in this place was that we can't shouldn't change the file size until after we are in the transaction context and the operation will either succeed or shut down the filesystem on failure. However, block_truncate_page() already modifies the file contents before we enter the transaction context, so we can't really fulfill this guarantee in any way. Hence we may as well ensure that on success or failure, the in-memory inode and data is truncated away and that the application cleans up the mess appropriately. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-06xfs: switch to ->write_iter()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06xfs: switch to ->read_iter()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06xfs: trim the argument lists of xfs_file_{dio,buffered}_aio_write()Al Viro
pos is redundant (it's iocb->ki_pos), and iov/nr_segs/count are taken care of by lifting iov_iter into the caller. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06iov_iter_truncate()Al Viro
Now It Can Be Done(tm) - we don't need to do iov_shorten() in generic_file_direct_write() anymore, now that all ->direct_IO() instances are converted to proper iov_iter methods and honour iter->count and iter->iov_offset properly. Get rid of count/ocount arguments of generic_file_direct_write(), while we are at it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06start adding the tag to iov_iterAl Viro
For now, just use the same thing we pass to ->direct_IO() - it's all iovec-based at the moment. Pass it explicitly to iov_iter_init() and account for kvec vs. iovec in there, by the same kludge NFS ->direct_IO() uses. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06switch {__,}blockdev_direct_IO() to iov_iterAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06get rid of pointless iov_length() in ->direct_IO()Al Viro
all callers have iov_length(iter->iov, iter->nr_segs) == iov_iter_count(iter) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06pass iov_iter to ->direct_IO()Al Viro
unmodified, for now Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06kill generic_segment_checks()Al Viro
all callers of ->aio_read() and ->aio_write() have iov/nr_segs already checked - generic_segment_checks() done after that is just an odd way to spell iov_length(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06generic_file_direct_write(): switch to iov_iterAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06xfs: remote attribute overwrite causes transaction overrunDave Chinner
Commit e461fcb ("xfs: remote attribute lookups require the value length") passes the remote attribute length in the xfs_da_args structure on lookup so that CRC calculations and validity checking can be performed correctly by related code. This, unfortunately has the side effect of changing the args->valuelen parameter in cases where it shouldn't. That is, when we replace a remote attribute, the incoming replacement stores the value and length in args->value and args->valuelen, but then the lookup which finds the existing remote attribute overwrites args->valuelen with the length of the remote attribute being replaced. Hence when we go to create the new attribute, we create it of the size of the existing remote attribute, not the size it is supposed to be. When the new attribute is much smaller than the old attribute, this results in a transaction overrun and an ASSERT() failure on a debug kernel: XFS: Assertion failed: tp->t_blk_res_used <= tp->t_blk_res, file: fs/xfs/xfs_trans.c, line: 331 Fix this by keeping the remote attribute value length separate to the attribute value length in the xfs_da_args structure. The enables us to pass the length of the remote attribute to be removed without overwriting the new attribute's length. Also, ensure that when we save remote block contexts for a later rename we zero the original state variables so that we don't confuse the state of the attribute to be removes with the state of the new attribute that we just added. [Spotted by Brain Foster.] Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-06xfs: initialize default acls for ->tmpfile()Brian Foster
The current tmpfile handler does not initialize default ACLs. Doing so within xfs_vn_tmpfile() makes it roughly equivalent to xfs_vn_mknod(), which is already used as a common create handler. xfs_vn_mknod() does not currently have a mechanism to determine whether to link the file into the namespace. Therefore, further abstract xfs_vn_mknod() into a new xfs_generic_create() handler with a tmpfile parameter. This new handler calls xfs_create_tmpfile() and d_tmpfile() on the dentry when called via ->tmpfile(). Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-05xfs: Fix wrong error codes being returnedFrom: Tuomas Tynkkynen
xfs_{compat_,}attrmulti_by_handle could return an errno with incorrect sign in some cases. While at it, make sure ENOMEM is returned instead of E2BIG if kmalloc fails. Signed-off-by: Tuomas Tynkkynen <tuomas.tynkkynen@iki.fi> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-05xfs: remove dquot hintsDave Chinner
group and project quota hints are currently stored on the user dquot. If we are attaching quotas to the inode, then the group and project dquots are stored as hints on the user dquot to save having to look them up again later. The thing is, the hints are not used for that inode for the rest of the life of the inode - the dquots are attached directly to the inode itself - so the only time the hints are used is when an inode first has dquots attached. When the hints on the user dquot don't match the dquots being attache dto the inode, they are then removed and replaced with the new hints. If a user is concurrently modifying files in different group and/or project contexts, then this leads to thrashing of the hints attached to user dquot. If user quotas are not enabled, then hints are never even used. So, if the hints are used to avoid the cost of the lookup, is the cost of the lookup significant enough to justify the hint infrstructure? Maybe it was once, when there was a global quota manager shared between all XFS filesystems and was hash table based. However, lookups are now much simpler, requiring only a single lock and radix tree lookup local to the filesystem and no hash or LRU manipulations to be made. Hence the cost of lookup is much lower than when hints were implemented. Turns out that benchmarks show that, too, with thir being no differnce in performance when doing file creation workloads as a single user with user, group and project quotas enabled - the hints do not make the code go any faster. In fact, removing the hints shows a 2-3% reduction in the time it takes to create 50 million inodes.... So, let's just get rid of the hints and the complexity around them. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-05xfs: bulletfproof xfs_qm_scall_trunc_qfiles()Eric Sandeen
Coverity noticed that if we sent junk into xfs_qm_scall_trunc_qfiles(), we could get back an uninitialized error value. So sanitize the flags we will accept, and initialize error anyway for good measure. (This bug may have been introduced via c61a9e39). Should resolve Coverity CID 1163872. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-05xfs: fix Q_XQUOTARM ioctlEric Sandeen
The Q_XQUOTARM quotactl was not working properly, because we weren't passing around proper flags. The xfs_fs_set_xstate() ioctl handler used the same flags for Q_XQUOTAON/OFF as well as for Q_XQUOTARM, but Q_XQUOTAON/OFF look for XFS_UQUOTA_ACCT, XFS_UQUOTA_ENFD, XFS_GQUOTA_ACCT etc, i.e. quota type + state, while Q_XQUOTARM looks only for the type of quota, i.e. XFS_DQ_USER, XFS_DQ_GROUP etc. Unfortunately these flag spaces overlap a bit, so we got semi-random results for Q_XQUOTARM; i.e. the value for XFS_DQ_USER == XFS_UQUOTA_ACCT, etc. yeargh. Add a new quotactl op vector specifically for the QUOTARM operation, since it operates with a different flag space. This has been broken more or less forever, AFAICT. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Acked-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-05xfs: fully support v5 format filesystemsDave Chinner
We have had this code in the kernel for over a year now and have shaken all the known issues out of the code over the past few releases. It's now time to remove the experimental warnings during mount and fully support the new filesystem format in production systems. Remove the experimental warning, and add a version number to the initial "mounting filesystem" message to tell use what type of filesystem is being mounted. Also, remove the temporary inode cluster size output at mount time now we know that this code works fine. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24xfs: enable the finobt feature on v5 superblocksBrian Foster
Add the finobt feature bit to the list of known features. As of this point, the kernel code knows how to mount and manage both finobt and non-finobt formatted filesystems. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24xfs: report finobt status in fs geometryBrian Foster
Define the XFS_FSOP_GEOM_FLAGS_FINOBT fs geometry flag and set the associated bit if the filesystem supports the free inode btree. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24xfs: add finobt support to growfsBrian Foster
Add finobt support to growfs. Initialize the agi root/level fields and the root finobt block. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24xfs: update the finobt on inode freeBrian Foster
An inode free operation can have several effects on the finobt. If all inodes have been freed and the chunk deallocated, we remove the finobt record. If the inode chunk was previously full, we must insert a new record based on the existing inobt record. Otherwise, we modify the record in place. Create the xfs_difree_finobt() function to identify the potential scenarios and update the finobt appropriately. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24xfs: refactor xfs_difree() inobt bits into xfs_difree_inobt() helperBrian Foster
Refactor xfs_difree() in preparation for the finobt. xfs_difree() performs the validity checks against the ag and reads the agi header. The work of physically updating the inode allocation btree is pushed down into the new xfs_difree_inobt() helper. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24xfs: use and update the finobt on inode allocationBrian Foster
Replace xfs_dialloc_ag() with an implementation that looks for a record in the finobt. The finobt only tracks records with at least one free inode. This eliminates the need for the intra-ag scan in the original algorithm. Once the inode is allocated, update the finobt appropriately (possibly removing the record) as well as the inobt. Move the original xfs_dialloc_ag() algorithm to xfs_dialloc_ag_inobt() and fall back as such if finobt support is not enabled. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24xfs: insert newly allocated inode chunks into the finobtBrian Foster
A newly allocated inode chunk, by definition, has at least one free inode, so a record is always inserted into the finobt. Create the xfs_inobt_insert() helper from existing code to insert a record in an inobt based on the provided BTNUM. Update xfs_ialloc_ag_alloc() to invoke the helper for the existing XFS_BTNUM_INO tree and XFS_BTNUM_FINO tree, if enabled. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24xfs: update inode allocation/free transaction reservations for finobtBrian Foster
Create the xfs_calc_finobt_res() helper to calculate the finobt log reservation for inode allocation and free. Update XFS_IALLOC_SPACE_RES() to reserve blocks for the additional finobt insertion on inode allocation. Create XFS_IFREE_SPACE_RES() to reserve blocks for the potential finobt record insertion on inode free (i.e., if an inode chunk was previously fully allocated). Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24xfs: support the XFS_BTNUM_FINOBT free inode btree typeBrian Foster
Define the AGI fields for the finobt root/level and add magic numbers. Update the btree code to add support for the new XFS_BTNUM_FINOBT inode btree. The finobt root block is reserved immediately following the inobt root block in the AG. Update XFS_PREALLOC_BLOCKS() to determine the starting AG data block based on whether finobt support is enabled. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24xfs: reserve v5 superblock read-only compat. feature bit for finobtBrian Foster
Reserve a v5 read-only compatibility feature bit for the finobt and create the xfs_sb_version_hasfinobt() helper to determine whether an fs has the feature enabled. The finobt does not change existing on-disk structures, but must remain consistent with the ialloc btree. Modifications from older kernels would violate that constrant. Therefore, we restrict older kernels to read-only mounts of finobt-enabled filesystems. Note that this does not yet enable the ability to rw mount a finobt fs (by setting the feature bit in the XFS_SB_FEAT_RO_COMPAT_ALL mask). Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbersBrian Foster
The introduction of the free inode btree (finobt) requires that xfs_ialloc_btree.c handle multiple trees. Refactor xfs_ialloc_btree.c so the caller specifies the btree type on cursor initialization to prepare for addition of the finobt. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-23xfs: add filestream allocator tracepointsChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-23xfs: remove xfs_filestream_associateChristoph Hellwig
There is no good reason to create a filestream when a directory entry is created. Delay it until the first allocation happens to simply the code and reduce the amount of mru cache lookups we do. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-23xfs: don't create a slab cache for filestream itemsChristoph Hellwig
We only have very few of these around, and allocation isn't that much of a hot path. Remove the slab cache to simplify the code, and to not waste any resources for the usual case of not having any inodes that use the filestream allocator. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-23xfs: rewrite the filestream allocator using the dentry cacheChristoph Hellwig
In Linux we will always be able to find a parent inode for file that are undergoing I/O. Use this to simply the file stream allocator by only keeping track of parent inodes. Signed-off-by: Christoph Hellwig <hch@lst.de>
2014-04-23xfs: remove XFS_IFILESTREAMChristoph Hellwig
We never test the flag except in xfs_inode_is_filestream, but that function already tests the on-disk flag or filesystem wide flags, and is used to decide if we want to set XFS_IFILESTREAM in the first place. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-23xfs: embedd mru_elem into parent structureChristoph Hellwig
There is no need to do a separate allocation for each mru element, just embedd the structure into the parent one in the user. Besides saving a memory allocation and the infrastructure required for it this also simplifies the API. While we do major surgery on xfs_mru_cache.c also de-typedef it and make struct mru_cache private to the implementation file. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-23xfs: handle duplicate entries in xfs_mru_cache_insertChristoph Hellwig
The radix tree code can detect and reject duplicate keys at insert time. Make xfs_mru_cache_insert handle this case so that future changes to the filestream allocator can take advantage of this. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-23xfs: split xfs_bmap_btalloc_nullfbChristoph Hellwig
Split xfs_bmap_btalloc_nullfb into one function for filestream allocations and one for everything else that share a few helpers. This dramatically simplifies the control flow. Signed-off-by: Christoph Hellwig <hch@lst.de>
2014-04-20Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "These are regression and bug fixes for ext4. We had a number of new features in ext4 during this merge window (ZERO_RANGE and COLLAPSE_RANGE fallocate modes, renameat, etc.) so there were many more regression and bug fixes this time around. It didn't help that xfstests hadn't been fully updated to fully stress test COLLAPSE_RANGE until after -rc1" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (31 commits) ext4: disable COLLAPSE_RANGE for bigalloc ext4: fix COLLAPSE_RANGE failure with 1KB block size ext4: use EINVAL if not a regular file in ext4_collapse_range() ext4: enforce we are operating on a regular file in ext4_zero_range() ext4: fix extent merging in ext4_ext_shift_path_extents() ext4: discard preallocations after removing space ext4: no need to truncate pagecache twice in collapse range ext4: fix removing status extents in ext4_collapse_range() ext4: use filemap_write_and_wait_range() correctly in collapse range ext4: use truncate_pagecache() in collapse range ext4: remove temporary shim used to merge COLLAPSE_RANGE and ZERO_RANGE ext4: fix ext4_count_free_clusters() with EXT4FS_DEBUG and bigalloc enabled ext4: always check ext4_ext_find_extent result ext4: fix error handling in ext4_ext_shift_extents ext4: silence sparse check warning for function ext4_trim_extent ext4: COLLAPSE_RANGE only works on extent-based files ext4: fix byte order problems introduced by the COLLAPSE_RANGE patches ext4: use i_size_read in ext4_unaligned_aio() fs: disallow all fallocate operation on active swapfile fs: move falloc collapse range check into the filesystem methods ...
2014-04-17xfs: fix tmpfile/selinux deadlock and initialize securityBrian Foster
xfstests generic/004 reproduces an ilock deadlock using the tmpfile interface when selinux is enabled. This occurs because xfs_create_tmpfile() takes the ilock and then calls d_tmpfile(). The latter eventually calls into xfs_xattr_get() which attempts to get the lock again. E.g.: xfs_io D ffffffff81c134c0 4096 3561 3560 0x00000080 ffff8801176a1a68 0000000000000046 ffff8800b401b540 ffff8801176a1fd8 00000000001d5800 00000000001d5800 ffff8800b401b540 ffff8800b401b540 ffff8800b73a6bd0 fffffffeffffffff ffff8800b73a6bd8 ffff8800b5ddb480 Call Trace: [<ffffffff8177f969>] schedule+0x29/0x70 [<ffffffff81783a65>] rwsem_down_read_failed+0xc5/0x120 [<ffffffffa05aa97f>] ? xfs_ilock_attr_map_shared+0x1f/0x50 [xfs] [<ffffffff813b3434>] call_rwsem_down_read_failed+0x14/0x30 [<ffffffff810ed179>] ? down_read_nested+0x89/0xa0 [<ffffffffa05aa7f2>] ? xfs_ilock+0x122/0x250 [xfs] [<ffffffffa05aa7f2>] xfs_ilock+0x122/0x250 [xfs] [<ffffffffa05aa97f>] xfs_ilock_attr_map_shared+0x1f/0x50 [xfs] [<ffffffffa05701d0>] xfs_attr_get+0x90/0xe0 [xfs] [<ffffffffa0565e07>] xfs_xattr_get+0x37/0x50 [xfs] [<ffffffff8124842f>] generic_getxattr+0x4f/0x70 [<ffffffff8133fd9e>] inode_doinit_with_dentry+0x1ae/0x650 [<ffffffff81340e0c>] selinux_d_instantiate+0x1c/0x20 [<ffffffff813351bb>] security_d_instantiate+0x1b/0x30 [<ffffffff81237db0>] d_instantiate+0x50/0x70 [<ffffffff81237e85>] d_tmpfile+0xb5/0xc0 [<ffffffffa05add02>] xfs_create_tmpfile+0x362/0x410 [xfs] [<ffffffffa0559ac8>] xfs_vn_tmpfile+0x18/0x20 [xfs] [<ffffffff81230388>] path_openat+0x228/0x6a0 [<ffffffff810230f9>] ? sched_clock+0x9/0x10 [<ffffffff8105a427>] ? kvm_clock_read+0x27/0x40 [<ffffffff8124054f>] ? __alloc_fd+0xaf/0x1f0 [<ffffffff8123101a>] do_filp_open+0x3a/0x90 [<ffffffff817845e7>] ? _raw_spin_unlock+0x27/0x40 [<ffffffff8124054f>] ? __alloc_fd+0xaf/0x1f0 [<ffffffff8121e3ce>] do_sys_open+0x12e/0x210 [<ffffffff8121e4ce>] SyS_open+0x1e/0x20 [<ffffffff8178eda9>] system_call_fastpath+0x16/0x1b xfs_vn_tmpfile() also fails to initialize security on the newly created inode. Pull the d_tmpfile() call up into xfs_vn_tmpfile() after the transaction has been committed and the inode unlocked. Also, initialize security on the inode based on the parent directory provided via the tmpfile call. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-17xfs: fix buffer use after free on IO errorEric Sandeen
When testing exhaustion of dm snapshots, the following appeared with CONFIG_DEBUG_OBJECTS_FREE enabled: ODEBUG: free active (active state 0) object type: work_struct hint: xfs_buf_iodone_work+0x0/0x1d0 [xfs] indicating that we'd freed a buffer which still had a pending reference, down this path: [ 190.867975] [<ffffffff8133e6fb>] debug_check_no_obj_freed+0x22b/0x270 [ 190.880820] [<ffffffff811da1d0>] kmem_cache_free+0xd0/0x370 [ 190.892615] [<ffffffffa02c5924>] xfs_buf_free+0xe4/0x210 [xfs] [ 190.905629] [<ffffffffa02c6167>] xfs_buf_rele+0xe7/0x270 [xfs] [ 190.911770] [<ffffffffa034c826>] xfs_trans_read_buf_map+0x7b6/0xac0 [xfs] At issue is the fact that if IO fails in xfs_buf_iorequest, we'll queue completion unconditionally, and then call xfs_buf_rele; but if IO failed, there are no IOs remaining, and xfs_buf_rele will free the bp while work is still queued. Fix this by not scheduling completion if the buffer has an error on it; run it immediately. The rest is only comment changes. Thanks to dchinner for spotting the root cause. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-17xfs: wrong error sign conversion during failed DIO writesDave Chinner
We negate the error value being returned from a generic function incorrectly. The code path that it is running in returned negative errors, so there is no need to negate it to get the correct error signs here. This was uncovered by generic/019. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-17xfs: unmount does not wait for shutdown during unmountDave Chinner
And interesting situation can occur if a log IO error occurs during the unmount of a filesystem. The cases reported have the same signature - the update of the superblock counters fails due to a log write IO error: XFS (dm-16): xfs_do_force_shutdown(0x2) called from line 1170 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa08a44a1 XFS (dm-16): Log I/O Error Detected. Shutting down filesystem XFS (dm-16): Unable to update superblock counters. Freespace may not be correct on next mount. XFS (dm-16): xfs_log_force: error 5 returned. XFS (¿-¿¿¿): Please umount the filesystem and rectify the problem(s) It can be seen that the last line of output contains a corrupt device name - this is because the log and xfs_mount structures have already been freed by the time this message is printed. A kernel oops closely follows. The issue is that the shutdown is occurring in a separate IO completion thread to the unmount. Once the shutdown processing has started and all the iclogs are marked with XLOG_STATE_IOERROR, the log shutdown code wakes anyone waiting on a log force so they can process the shutdown error. This wakes up the unmount code that is doing a synchronous transaction to update the superblock counters. The unmount path now sees all the iclogs are marked with XLOG_STATE_IOERROR and so never waits on them again, knowing that if it does, there will not be a wakeup trigger for it and we will hang the unmount if we do. Hence the unmount runs through all the remaining code and frees all the filesystem structures while the xlog_iodone() is still processing the shutdown. When the log shutdown processing completes, xfs_do_force_shutdown() emits the "Please umount the filesystem and rectify the problem(s)" message, and xlog_iodone() then aborts all the objects attached to the iclog. An iclog that has already been freed.... The real issue here is that there is no serialisation point between the log IO and the unmount. We have serialisations points for log writes, log forces, reservations, etc, but we don't actually have any code that wakes for log IO to fully complete. We do that for all other types of object, so why not iclogbufs? Well, it turns out that we can easily do this. We've got xfs_buf handles, and that's what everyone else uses for IO serialisation. i.e. bp->b_sema. So, lets hold iclogbufs locked over IO, and only release the lock in xlog_iodone() when we are finished with the buffer. That way before we tear down the iclog, we can lock and unlock the buffer to ensure IO completion has finished completely before we tear it down. Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Mike Snitzer <snitzer@redhat.com> Tested-by: Bob Mastors <bob.mastors@solidfire.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-17xfs: collapse range is delalloc challengedDave Chinner
FSX has been detecting data corruption after to collapse range calls. The key observation is that the offset of the last extent in the file was not being shifted, and hence when the file size was adjusted it was truncating away data because the extents handled been correctly shifted. Tracing indicated that before the collapse, the extent list looked like: .... ino 0x5788 state idx 6 offset 26 block 195904 count 10 flag 0 ino 0x5788 state idx 7 offset 39 block 195917 count 35 flag 0 ino 0x5788 state idx 8 offset 86 block 195964 count 32 flag 0 and after the shift of 2 blocks: ino 0x5788 state idx 6 offset 24 block 195904 count 10 flag 0 ino 0x5788 state idx 7 offset 37 block 195917 count 35 flag 0 ino 0x5788 state idx 8 offset 86 block 195964 count 32 flag 0 Note that the last extent did not change offset. After the changing of the file size: ino 0x5788 state idx 6 offset 24 block 195904 count 10 flag 0 ino 0x5788 state idx 7 offset 37 block 195917 count 35 flag 0 ino 0x5788 state idx 8 offset 86 block 195964 count 30 flag 0 You can see that the last extent had it's length truncated, indicating that we've lost data. The reason for this is that the xfs_bmap_shift_extents() loop uses XFS_IFORK_NEXTENTS() to determine how many extents are in the inode. This, unfortunately, doesn't take into account delayed allocation extents - it's a count of physically allocated extents - and hence when the file being collapsed has a delalloc extent like this one does prior to the range being collapsed: .... ino 0x5788 state idx 4 offset 11 block 4503599627239429 count 1 flag 0 .... it gets the count wrong and terminates the shift loop early. Fix it by using the in-memory extent array size that includes delayed allocation extents to determine the number of extents on the inode. Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-17xfs: don't map ranges that span EOF for direct IODave Chinner
Al Viro tracked down the problem that has caused generic/263 to fail on XFS since the test was introduced. If is caused by xfs_get_blocks() mapping a single extent that spans EOF without marking it as buffer-new() so that the direct IO code does not zero the tail of the block at the new EOF. This is a long standing bug that has been around for many, many years. Because xfs_get_blocks() starts the map before EOF, it can't set buffer_new(), because that causes he direct IO code to also zero unaligned sectors at the head of the IO. This would overwrite valid data with zeros, and hence we cannot validly return a single extent that spans EOF to direct IO. Fix this by detecting a mapping that spans EOF and truncate it down to EOF. This results in the the direct IO code doing the right thing for unaligned data blocks before EOF, and then returning to get another mapping for the region beyond EOF which XFS treats correctly by setting buffer_new() on it. This makes direct Io behave correctly w.r.t. tail block zeroing beyond EOF, and fsx is happy about that. Again, thanks to Al Viro for finding what I couldn't. [ dchinner: Fix for __divdi3 build error: Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com> Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> ] Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14xfs: don't try to use the filestream allocator for metadata allocationsChristoph Hellwig
xfs_bmap_btalloc_nullfb has two entirely different control flows when using the filestream allocator vs the regular one, but it get the conditionals wrong and ends up mixing the two for metadata allocations. Fix this by adding a missing userdata check and slight refactoring. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14xfs: remove unused calculation in xfs_dir2_sf_addname()Eric Sandeen
The "add_entsize" calculated here is never used. "incr_isize" accounts for the inode expansion of the old entries + parent + new entry all by itself. Once we've removed add_entsize there, it's just a pointless intermediate variable elsewhere, so remove it. For that matter, old_isize is gratuitous too, so nuke that. And add a few comments so the magic "+1's" and "+2's" make a bit more sense. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14xfs: remove pointless pointer increment in xfs_dir2_block_compact()Eric Sandeen
xfs_dir2_block_compact() is passed a pointer to *blp, and advances it locally - but nobody uses the pointer (locally) after that. This behavior came about as part of prior refactoring, 20f7e9f xfs: factor dir2 block read operations and looking at the code as it was before, it seems quite clear that this change introduced a bug; the pre-refactoring code expects blp to be modified after compaction. And indeed it did; see this commit which fixed it: 37f1356 xfs: recalculate leaf entry pointer after compacting a dir2 block So the bug was introduced & resolved in the 3.8 cycle. Whoops. Well, it's fixed now, and mystery solved; just remove the now-pointless local increment of the blp pointer. (I guess we should have run clang earlier!) Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>