summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2018-07-20ovl: During copy up, first copy up metadata and then dataVivek Goyal
Just a little re-ordering of code. This helps with next patch where after copying up metadata, we skip data copying step, if needed. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20ovl: Provide a mount option metacopy=on/off for metadata copyupVivek Goyal
By default metadata only copy up is disabled. Provide a mount option so that users can choose one way or other. Also provide a kernel config and module option to enable/disable metacopy feature. metacopy feature requires redirect_dir=on when upper is present. Otherwise, it requires redirect_dir=follow atleast. As of now, metacopy does not work with nfs_export=on. So if both metacopy=on and nfs_export=on then nfs_export is disabled. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20ovl: Move the copy up helpers to copy_up.cVivek Goyal
Right now two copy up helpers are in inode.c. Amir suggested it might be better to move these to copy_up.c. There will one more related function which will come in later patch. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20ovl: Initialize ovl_inode->redirect in ovl_get_inode()Vivek Goyal
ovl_inode->redirect is an inode property and should be initialized in ovl_get_inode() only when we are adding a new inode to cache. If inode is already in cache, it is already initialized and we should not be touching ovl_inode->redirect field. As of now this is not a problem as redirects are used only for directories which don't share inode. But soon I want to use redirects for regular files also and there it can become an issue. Hence, move ->redirect initialization in ovl_get_inode(). Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-19fold generic_readlink() into its only callerAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-19binfmt_elf: Respect error return from `regset->active'Maciej W. Rozycki
The regset API documented in <linux/regset.h> defines -ENODEV as the result of the `->active' handler to be used where the feature requested is not available on the hardware found. However code handling core file note generation in `fill_thread_core_info' interpretes any non-zero result from the `->active' handler as the regset requested being active. Consequently processing continues (and hopefully gracefully fails later on) rather than being abandoned right away for the regset requested. Fix the problem then by making the code proceed only if a positive result is returned from the `->active' handler. Signed-off-by: Maciej W. Rozycki <macro@mips.com> Signed-off-by: Paul Burton <paul.burton@mips.com> Fixes: 4206d3aa1978 ("elf core dump: notes user_regset") Patchwork: https://patchwork.linux-mips.org/patch/19332/ Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: James Hogan <jhogan@kernel.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: linux-fsdevel@vger.kernel.org Cc: linux-mips@linux-mips.org Cc: linux-kernel@vger.kernel.org
2018-07-19Btrfs: fix file data corruption after cloning a range and fsyncFilipe Manana
When we clone a range into a file we can end up dropping existing extent maps (or trimming them) and replacing them with new ones if the range to be cloned overlaps with a range in the destination inode. When that happens we add the new extent maps to the list of modified extents in the inode's extent map tree, so that a "fast" fsync (the flag BTRFS_INODE_NEEDS_FULL_SYNC not set in the inode) will see the extent maps and log corresponding extent items. However, at the end of range cloning operation we do truncate all the pages in the affected range (in order to ensure future reads will not get stale data). Sometimes this truncation will release the corresponding extent maps besides the pages from the page cache. If this happens, then a "fast" fsync operation will miss logging some extent items, because it relies exclusively on the extent maps being present in the inode's extent tree, leading to data loss/corruption if the fsync ends up using the same transaction used by the clone operation (that transaction was not committed in the meanwhile). An extent map is released through the callback btrfs_invalidatepage(), which gets called by truncate_inode_pages_range(), and it calls __btrfs_releasepage(). The later ends up calling try_release_extent_mapping() which will release the extent map if some conditions are met, like the file size being greater than 16Mb, gfp flags allow blocking and the range not being locked (which is the case during the clone operation) nor being the extent map flagged as pinned (also the case for cloning). The following example, turned into a test for fstests, reproduces the issue: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0x18 9000K 6908K" /mnt/foo $ xfs_io -f -c "pwrite -S 0x20 2572K 156K" /mnt/bar $ xfs_io -c "fsync" /mnt/bar # reflink destination offset corresponds to the size of file bar, # 2728Kb minus 4Kb. $ xfs_io -c ""reflink ${SCRATCH_MNT}/foo 0 2724K 15908K" /mnt/bar $ xfs_io -c "fsync" /mnt/bar $ md5sum /mnt/bar 95a95813a8c2abc9aa75a6c2914a077e /mnt/bar <power fail> $ mount /dev/sdb /mnt $ md5sum /mnt/bar 207fd8d0b161be8a84b945f0df8d5f8d /mnt/bar # digest should be 95a95813a8c2abc9aa75a6c2914a077e like before the # power failure In the above example, the destination offset of the clone operation corresponds to the size of the "bar" file minus 4Kb. So during the clone operation, the extent map covering the range from 2572Kb to 2728Kb gets trimmed so that it ends at offset 2724Kb, and a new extent map covering the range from 2724Kb to 11724Kb is created. So at the end of the clone operation when we ask to truncate the pages in the range from 2724Kb to 2724Kb + 15908Kb, the page invalidation callback ends up removing the new extent map (through try_release_extent_mapping()) when the page at offset 2724Kb is passed to that callback. Fix this by setting the bit BTRFS_INODE_NEEDS_FULL_SYNC whenever an extent map is removed at try_release_extent_mapping(), forcing the next fsync to search for modified extents in the fs/subvolume tree instead of relying on the presence of extent maps in memory. This way we can continue doing a "fast" fsync if the destination range of a clone operation does not overlap with an existing range or if any of the criteria necessary to remove an extent map at try_release_extent_mapping() is not met (file size not bigger then 16Mb or gfp flags do not allow blocking). CC: stable@vger.kernel.org # 3.16+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-07-18Merge tag 'for-4.18-rc5-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Three regression fixes. They're few-liners and fixing some corner cases missed in the origial patches" * tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: scrub: Don't use inode page cache in scrub_handle_errored_block() btrfs: fix use-after-free of cmp workspace pages btrfs: restore uuid_mutex in btrfs_open_devices
2018-07-18block: Define and use STAT_READ and STAT_WRITEMichael Callahan
Add defines for STAT_READ and STAT_WRITE for indexing the partition stat entries. This clarifies some fs/ code which has hardcoded 1 for STAT_WRITE and will make it easier to extend the stats with additional fields. tj: Refreshed on top of v4.17. Signed-off-by: Michael Callahan <michaelcallahan@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-18block: make bdev_ops->rw_page() take a REQ_OP instead of boolTejun Heo
c11f0c0b5bb9 ("block/mm: make bdev_ops->rw_page() take a bool for read/write") replaced @op with boolean @is_write, which limited the amount of information going into ->rw_page() and more importantly page_endio(), which removed the need to expose block internals to mm. Unfortunately, we want to track discards separately and @is_write isn't enough information. This patch updates bdev_ops->rw_page() to take REQ_OP instead but leaves page_endio() to take bool @is_write. This allows the block part of operations to have enough information while not leaking it to mm. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Mike Christie <mchristi@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-18jffs2: use unsigned 32-bit timstamps consistentlyArnd Bergmann
Most users of jffs2 are 32-bit systems that traditionally only support timestamps using a 32-bit signed time_t, in the range from years 1902 to 2038. On 64-bit systems, jffs2 however interpreted the same timestamps as unsigned values, reading back negative times (before 1970) as times between 2038 and 2106. Now that Linux supports 64-bit inode timestamps even on 32-bit systems, let's use the second interpretation everywhere to allow jffs2 to be used on 32-bit systems beyond 2038 without a fundamental change to the inode format. This has a slight risk of regressions, when existing files with timestamps before 1970 are present in file system images and are now interpreted as future time stamps. I considered moving the wraparound point a bit, e.g. to 1960, in order to deal with timestamps that ended up on Dec 31, 1969 due to incorrect timezone handling. However, this would complicate the implementation unnecessarily, so I went with the simplest possible method of extending the timestamps. Writing files with timestamps before 1970 or after 2106 now results in those times being clamped in the file system. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Boris Brezillon <boris.brezillon@bootlin.com>
2018-07-18jffs2: use 64-bit intermediate timestampsArnd Bergmann
The VFS now uses timespec64 timestamps consistently, but jffs2 still converts them to 32-bit numbers on the storage medium. As the helper functions for the conversion (get_seconds() and timespec_to_timespec64()) are now deprecated, let's change them over to the more modern replacements. This keeps the traditional interpretation of those values, where the on-disk 32-bit numbers are taken to be negative numbers, i.e. dates before 1970, on 32-bit machines, but future numbers past 2038 on 64-bit machines. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Boris Brezillon <boris.brezillon@bootlin.com>
2018-07-18ovl: obsolete "check_copy_up" module optionMiklos Szeredi
This was provided for debugging the ro/rw inconsistecy. The inconsitency is now gone so this option is obsolete. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18vfs: remove open_flags from d_real()Miklos Szeredi
Opening regular files on overlayfs is now handled via ovl_open(). Remove the now unused "open_flags" argument from d_op->d_real() and the d_real() helper. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18Partially revert "locks: fix file locking on overlayfs"Miklos Szeredi
This partially reverts commit c568d68341be7030f5647def68851e469b21ca11. Overlayfs files will now automatically get the correct locks, no need to hack overlay support in VFS. It is a partial revert, because it leaves the locks_inode() calls in place and defines locks_inode() to file_inode(). We could revert those as well, but it would be unnecessary code churn and it makes sense to document that we are getting the inode for locking purposes. Don't revert MS_NOREMOTELOCK yet since that has been part of the userspace API for some time (though not in a useful way). Will try to remove internal flags later when the dust around the new mount API settles. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Acked-by: Jeff Layton <jlayton@kernel.org>
2018-07-18Revert "vfs: do get_write_access() on upper layer of overlayfs"Miklos Szeredi
This reverts commit 4d0c5ba2ff79ef9f5188998b29fd28fcb05f3667. We now get write access on both overlay and underlying layers so this patch is no longer needed for correct operation. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18Revert "vfs: add flags to d_real()"Miklos Szeredi
This reverts commit 495e642939114478a5237a7d91661ba93b76f15a. No user of "flags" argument of d_real() remain. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18Revert "vfs: update ovl inode before relatime check"Miklos Szeredi
This reverts commit 598e3c8f72f5b77c84d2cb26cfd936ffb3cfdbaa. Overlayfs no longer relies on the vfs correct atime handling. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18Revert "ovl: fix relatime for directories"Miklos Szeredi
This reverts commit cd91304e7190b4c4802f8e413ab2214b233e0260. Overlayfs no longer relies on the vfs correct atime handling. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18vfs: fix freeze protection in mnt_want_write_file() for overlayfsMiklos Szeredi
The underlying real file used by overlayfs still contains the overlay path. This results in mnt_want_write_file() calls by the filesystem getting freeze protection on the wrong inode (the overlayfs one instead of the real one). Fix by using file_inode(file)->i_sb instead of file->f_path.mnt->mnt_sb. Reported-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-07-18Revert "ovl: don't allow writing ioctl on lower layer"Miklos Szeredi
This reverts commit 7c6893e3c9abf6a9676e060a1e35e5caca673d57. Overlayfs no longer relies on the vfs for checking writability of files. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18Revert "ovl: fix may_write_real() for overlayfs directories"Miklos Szeredi
This reverts commit 954c736f865d6c0c68ae4263a2f3502ee7c447a3. Overlayfs no longer relies on the vfs for checking writability of files. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18vfs: don't open realMiklos Szeredi
Let overlayfs do its thing when opening a file. This enables stacking and fixes the corner case when a file is opened for read, modified through a writable open, and data is read from the read-only file. After this patch the read-only open will not return stale data even in this case. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-18ovl: add reflink/copyfile/dedup supportMiklos Szeredi
Since set of arguments are so similar, handle in a common helper. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18ovl: add O_DIRECT supportMiklos Szeredi
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18ovl: add ovl_fiemap()Miklos Szeredi
Implement stacked fiemap(). Need to split inode operations for regular file (which has fiemap) and special file (which doesn't have fiemap). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18ovl: add lsattr/chattr supportMiklos Szeredi
Implement FS_IOC_GETFLAGS and FS_IOC_SETFLAGS. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18ovl: add ovl_fallocate()Miklos Szeredi
Implement stacked fallocate. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18ovl: add ovl_mmap()Miklos Szeredi
Implement stacked mmap. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18ovl: add ovl_fsync()Miklos Szeredi
Implement stacked fsync(). Don't sync if lower (noticed by Amir Goldstein). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18ovl: add ovl_write_iter()Miklos Szeredi
Implement stacked writes. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18ovl: add ovl_read_iter()Miklos Szeredi
Implement stacked reading. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18ovl: add helper to return real fileMiklos Szeredi
In the common case we can just use the real file cached in file->private_data. There are two exceptions: 1) File has been copied up since open: in this unlikely corner case just use a throwaway real file for the operation. If ever this becomes a perfomance problem (very unlikely, since overlayfs has been doing most fine without correctly handling this case at all), then we can deal with that by updating the cached real file. 2) File's f_flags have changed since open: no need to reopen the cached real file, we can just change the flags there as well. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18ovl: stack file opsMiklos Szeredi
Implement file operations on a regular overlay file. The underlying file is opened separately and cached in ->private_data. It might be worth making an exception for such files when accounting in nr_file to confirm to userspace expectations. We are only adding a small overhead (248bytes for the struct file) since the real inode and dentry are pinned by overlayfs anyway. This patch doesn't have any effect, since the vfs will use d_real() to find the real underlying file to open. The patch at the end of the series will actually enable this functionality. AV: make it use open_with_fake_path(), don't mess with override_creds SzM: still need to mess with override_creds() until no fs uses current_cred() in their open method. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-18ovl: deal with overlay files in ovl_d_real()Miklos Szeredi
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18ovl: copy up file size as wellMiklos Szeredi
Copy i_size of the underlying inode to the overlay inode in ovl_copyattr(). This is in preparation for stacking I/O operations on overlay files. This patch shouldn't have any observable effect. Remove stale comment from ovl_setattr() [spotted by Vivek Goyal]. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18Revert "Revert "ovl: get_write_access() in truncate""Miklos Szeredi
This reverts commit 31c3a7069593b072bd57192b63b62f9a7e994e9a. Re-add functionality dealing with i_writecount on truncate to overlayfs. This patch shouldn't have any observable effects, since we just re-assert the writecout that vfs_truncate() already got for us. This is in preparation for moving overlay functionality out of the VFS. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18ovl: copy up inode flagsMiklos Szeredi
On inode creation copy certain inode flags from the underlying real inode to the overlay inode. This is in preparation for moving overlay functionality out of the VFS. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18ovl: copy up timesMiklos Szeredi
Copy up mtime and ctime to overlay inode after times in real object are modified. Be careful not to dirty cachelines when not necessary. This is in preparation for moving overlay functionality out of the VFS. This patch shouldn't have any observable effect. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18vfs: export vfs_dedupe_file_range_one() to modulesMiklos Szeredi
This is needed by the stacked dedupe implementation in overlayfs. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18vfs: export vfs_ioctl() to modulesMiklos Szeredi
This is needed by the stacked ioctl implementation in overlayfs. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18vfs: make open_with_fake_path() not contribute to nr_filesMiklos Szeredi
Stacking file operations in overlay will store an extra open file for each overlay file opened. The overhead is just that of "struct file" which is about 256bytes, because overlay already pins an extra dentry and inode when the file is open, which add up to a much larger overhead. For fear of breaking working setups, don't start accounting the extra file. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18Merge branch 'dedupe-cleanup' into overlayfs-nextMiklos Szeredi
Following series for stacking overlay files depends on this mini series.
2018-07-18Merge branch 'for-ovl' of ↵Miklos Szeredi
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into overlayfs-next This gives us the open_with_fake_path() helper that is needed for stacked open files in overlay and mmap in particular.
2018-07-17aio: don't expose __aio_sigset in uapiChristoph Hellwig
glibc uses a different defintion of sigset_t than the kernel does, and the current version would pull in both. To fix this just do not expose the type at all - this somewhat mirrors pselect() where we do not even have a type for the magic sigmask argument, but just use pointer arithmetics. Fixes: 7a074e96 ("aio: implement io_pgetevents") Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Adrian Reber <adrian@lisas.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-17libxfs: Fix a couple of sparse complaintisCarlos Maiolino
No significant changes, just silence a couple of sparse errors. Using cpu_to_be32(NULLAGINO), the NULLAGINO constant will be encoded in BE as a constant, avoiding a BE -> CPU conversion every iteraction of the loop, if be32_to_cpu(agi->agi_unlinked[i]) was used instead. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-17xfs: use swap macro in xfs_dir2_leafn_rebalanceGustavo A. R. Silva
Make use of the swap macro and remove unnecessary variable *tmp*. This makes the code easier to read and maintain. Also, slightly refactor some code. This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-17xfs_bmap_util: use swap macroGustavo A. R. Silva
Make use of the swap macro and remove some unnecessary variables. This makes the code easier to read and maintain. Also, reduces the stack usage. This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-17xfs_attr_leaf: use swap macro in xfs_attr3_leaf_rebalanceGustavo A. R. Silva
Make use of the swap macro and remove some unnecessary variables. This makes the code easier to read and maintain. Also, reduces the stack usage. This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-17xfs: don't assume a left rmap when allocating a new rmapDarrick J. Wong
The original rmap code assumed that there would always be at least one rmap in the rmapbt (the AG sb/agf/agi) and so errored out if it didn't find one. This assumption isn't true for the rmapbt repair function (and it won't be true for realtime rmap either), so remove the check and just deal with the situation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>