summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2020-01-17Merge tag 'for-5.5-rc6-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes that have been in the works during last twp weeks. All have a user visible effect and are stable material: - scrub: properly update progress after calling cancel ioctl, calling 'resume' would start from the beginning otherwise - fix subvolume reference removal, after moving out of the original path the reference is not recognized and will lead to transaction abort - fix reloc root lifetime checks, could lead to crashes when there's subvolume cleaning running in parallel - fix memory leak when quotas get disabled in the middle of extent accounting - fix transaction abort in case of balance being started on degraded mount on eg. RAID1" * tag 'for-5.5-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: check rw_devices, not num_devices for balance Btrfs: always copy scrub arguments back to user space btrfs: relocation: fix reloc_root lifespan and access btrfs: fix memory leak in qgroup accounting btrfs: do not delete mismatched root refs btrfs: fix invalid removal of root ref btrfs: rework arguments of btrfs_unlink_subvol
2020-01-17Merge tag 'fuse-fixes-5.5-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fix from Miklos Szeredi: "Fix a regression in the last release affecting the ftp module of the gvfs filesystem" * tag 'fuse-fixes-5.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: fix fuse_send_readpages() in the syncronous read case
2020-01-17btrfs: check rw_devices, not num_devices for balanceJosef Bacik
The fstest btrfs/154 reports [ 8675.381709] BTRFS: Transaction aborted (error -28) [ 8675.383302] WARNING: CPU: 1 PID: 31900 at fs/btrfs/block-group.c:2038 btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs] [ 8675.390925] CPU: 1 PID: 31900 Comm: btrfs Not tainted 5.5.0-rc6-default+ #935 [ 8675.392780] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 [ 8675.395452] RIP: 0010:btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs] [ 8675.402672] RSP: 0018:ffffb2090888fb00 EFLAGS: 00010286 [ 8675.404413] RAX: 0000000000000000 RBX: ffff92026dfa91c8 RCX: 0000000000000001 [ 8675.406609] RDX: 0000000000000000 RSI: ffffffff8e100899 RDI: ffffffff8e100971 [ 8675.408775] RBP: ffff920247c61660 R08: 0000000000000000 R09: 0000000000000000 [ 8675.410978] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffe4 [ 8675.412647] R13: ffff92026db74000 R14: ffff920247c616b8 R15: ffff92026dfbc000 [ 8675.413994] FS: 00007fd5e57248c0(0000) GS:ffff92027d800000(0000) knlGS:0000000000000000 [ 8675.416146] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8675.417833] CR2: 0000564aa51682d8 CR3: 000000006dcbc004 CR4: 0000000000160ee0 [ 8675.419801] Call Trace: [ 8675.420742] btrfs_start_dirty_block_groups+0x355/0x480 [btrfs] [ 8675.422600] btrfs_commit_transaction+0xc8/0xaf0 [btrfs] [ 8675.424335] reset_balance_state+0x14a/0x190 [btrfs] [ 8675.425824] btrfs_balance.cold+0xe7/0x154 [btrfs] [ 8675.427313] ? kmem_cache_alloc_trace+0x235/0x2c0 [ 8675.428663] btrfs_ioctl_balance+0x298/0x350 [btrfs] [ 8675.430285] btrfs_ioctl+0x466/0x2550 [btrfs] [ 8675.431788] ? mem_cgroup_charge_statistics+0x51/0xf0 [ 8675.433487] ? mem_cgroup_commit_charge+0x56/0x400 [ 8675.435122] ? do_raw_spin_unlock+0x4b/0xc0 [ 8675.436618] ? _raw_spin_unlock+0x1f/0x30 [ 8675.438093] ? __handle_mm_fault+0x499/0x740 [ 8675.439619] ? do_vfs_ioctl+0x56e/0x770 [ 8675.441034] do_vfs_ioctl+0x56e/0x770 [ 8675.442411] ksys_ioctl+0x3a/0x70 [ 8675.443718] ? trace_hardirqs_off_thunk+0x1a/0x1c [ 8675.445333] __x64_sys_ioctl+0x16/0x20 [ 8675.446705] do_syscall_64+0x50/0x210 [ 8675.448059] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 8675.479187] BTRFS: error (device vdb) in btrfs_create_pending_block_groups:2038: errno=-28 No space left We now use btrfs_can_overcommit() to see if we can flip a block group read only. Before this would fail because we weren't taking into account the usable un-allocated space for allocating chunks. With my patches we were allowed to do the balance, which is technically correct. The test is trying to start balance on degraded mount. So now we're trying to allocate a chunk and cannot because we want to allocate a RAID1 chunk, but there's only 1 device that's available for usage. This results in an ENOSPC. But we shouldn't even be making it this far, we don't have enough devices to restripe. The problem is we're using btrfs_num_devices(), that also includes missing devices. That's not actually what we want, we need to use rw_devices. The chunk_mutex is not needed here, rw_devices changes only in device add, remove or replace, all are excluded by EXCL_OP mechanism. Fixes: e4d8ec0f65b9 ("Btrfs: implement online profile changing") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add stacktrace, update changelog, drop chunk_mutex ] Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-17Btrfs: always copy scrub arguments back to user spaceFilipe Manana
If scrub returns an error we are not copying back the scrub arguments structure to user space. This prevents user space to know how much progress scrub has done if an error happened - this includes -ECANCELED which is returned when users ask for scrub to stop. A particular use case, which is used in btrfs-progs, is to resume scrub after it is canceled, in that case it relies on checking the progress from the scrub arguments structure and then use that progress in a call to resume scrub. So fix this by always copying the scrub arguments structure to user space, overwriting the value returned to user space with -EFAULT only if copying the structure failed to let user space know that either that copying did not happen, and therefore the structure is stale, or it happened partially and the structure is probably not valid and corrupt due to the partial copy. Reported-by: Graham Cobb <g.btrfs@cobb.uk.net> Link: https://lore.kernel.org/linux-btrfs/d0a97688-78be-08de-ca7d-bcb4c7fb397e@cobb.uk.net/ Fixes: 06fe39ab15a6a4 ("Btrfs: do not overwrite scrub error with fault error in scrub ioctl") CC: stable@vger.kernel.org # 5.1+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Tested-by: Graham Cobb <g.btrfs@cobb.uk.net> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-16io_uring: only allow submit from owning taskJens Axboe
If the credentials or the mm doesn't match, don't allow the task to submit anything on behalf of this ring. The task that owns the ring can pass the file descriptor to another task, but we don't want to allow that task to submit an SQE that then assumes the ring mm and creds if it needs to go async. Cc: stable@vger.kernel.org Suggested-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-17ubifs: Fix ino_t format warnings in orphan_delete()Geert Uytterhoeven
On alpha and s390x: fs/ubifs/debug.h:158:11: warning: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 4 has type ‘ino_t {aka unsigned int}’ [-Wformat=] ... fs/ubifs/orphan.c:132:3: note: in expansion of macro ‘dbg_gen’ dbg_gen("deleted twice ino %lu", orph->inum); ... fs/ubifs/orphan.c:140:3: note: in expansion of macro ‘dbg_gen’ dbg_gen("delete later ino %lu", orph->inum); __kernel_ino_t is "unsigned long" on most architectures, but not on alpha and s390x, where it is "unsigned int". Hence when printing an ino_t, it should always be cast to "unsigned long" first. Fix this by re-adding the recently removed casts. Fixes: 8009ce956c3d2802 ("ubifs: Don't leak orphans on memory during commit") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-01-16ubifs: Fix deadlock in concurrent bulk-read and writepageZhihao Cheng
In ubifs, concurrent execution of writepage and bulk read on the same file may cause ABBA deadlock, for example (Reproduce method see Link): Process A(Bulk-read starts from page4) Process B(write page4 back) vfs_read wb_workfn or fsync ... ... generic_file_buffered_read write_cache_pages ubifs_readpage LOCK(page4) ubifs_bulk_read ubifs_writepage LOCK(ui->ui_mutex) ubifs_write_inode ubifs_do_bulk_read LOCK(ui->ui_mutex) find_or_create_page(alloc page4) ↑ LOCK(page4) <-- ABBA deadlock occurs! In order to ensure the serialization execution of bulk read, we can't remove the big lock 'ui->ui_mutex' in ubifs_bulk_read(). Instead, we allow ubifs_do_bulk_read() to lock page failed by replacing find_or_create_page(FGP_LOCK) with pagecache_get_page(FGP_LOCK | FGP_NOWAIT). Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Suggested-by: zhangyi (F) <yi.zhang@huawei.com> Cc: <Stable@vger.kernel.org> Fixes: 4793e7c5e1c ("UBIFS: add bulk-read facility") Link: https://bugzilla.kernel.org/show_bug.cgi?id=206153 Signed-off-by: Richard Weinberger <richard@nod.at>
2020-01-16ubifs: Fix wrong memory allocationSascha Hauer
In create_default_filesystem() when we allocate the idx node we must use the idx_node_size we calculated just one line before, not tmp, which contains completely other data. Fixes: c4de6d7e4319 ("ubifs: Refactor create_default_filesystem()") Cc: stable@vger.kernel.org # v4.20+ Reported-by: Naga Sureshkumar Relli <nagasure@xilinx.com> Tested-by: Naga Sureshkumar Relli <nagasure@xilinx.com> Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-01-16ubifs: Add support for FS_ENCRYPT_FLEric Biggers
Make the FS_IOC_GETFLAGS ioctl on ubifs return the FS_ENCRYPT_FL flag on encrypted files, like ext4 and f2fs do. Also make this flag be ignored by FS_IOC_SETFLAGS, like ext4 and f2fs do, since it's a recognized flag but is not directly settable. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-01-16ubifs: Fix FS_IOC_SETFLAGS unexpectedly clearing encrypt flagEric Biggers
UBIFS's implementation of FS_IOC_SETFLAGS fails to preserve existing inode flags that aren't settable by FS_IOC_SETFLAGS, namely the encrypt flag. This causes the encrypt flag to be unexpectedly cleared. Fix it by preserving existing unsettable flags, like ext4 and f2fs do. Test case with kvm-xfstests shell: FSTYP=ubifs KEYCTL_PROG=keyctl . fs/ubifs/config . ~/xfstests/common/encrypt dev=$(__blkdev_to_ubi_volume /dev/vdc) ubiupdatevol -t $dev mount $dev /mnt -t ubifs k=$(_generate_session_encryption_key) mkdir /mnt/edir xfs_io -c "set_encpolicy $k" /mnt/edir echo contents > /mnt/edir/file chattr +i /mnt/edir/file chattr -i /mnt/edir/file With the bug, the following errors occur on the last command: [ 18.081559] fscrypt (ubifs, inode 67): Inconsistent encryption context (parent directory: 65) chattr: Operation not permitted while reading flags on /mnt/edir/file Fixes: d475a507457b ("ubifs: Add skeleton for fscrypto") Cc: <stable@vger.kernel.org> # v4.10+ Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-01-16xfs: check log iovec size to make sure it's plausibly a buffer log formatDarrick J. Wong
When log recovery is processing buffer log items, we should check that the incoming iovec actually describes a region of memory large enough to contain the log format and the dirty map. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-01-16xfs: make struct xfs_buf_log_format have a consistent sizeDarrick J. Wong
Increase XFS_BLF_DATAMAP_SIZE by 1 to fill in the implied padding at the end of struct xfs_buf_log_format. This makes the size consistent so that we can check it in xfs_ondisk.h, and will be needed once we start logging attribute values. On amd64 we get the following pahole: struct xfs_buf_log_format { short unsigned int blf_type; /* 0 2 */ short unsigned int blf_size; /* 2 2 */ short unsigned int blf_flags; /* 4 2 */ short unsigned int blf_len; /* 6 2 */ long long int blf_blkno; /* 8 8 */ unsigned int blf_map_size; /* 16 4 */ unsigned int blf_data_map[16]; /* 20 64 */ /* --- cacheline 1 boundary (64 bytes) was 20 bytes ago --- */ /* size: 88, cachelines: 2, members: 7 */ /* padding: 4 */ /* last cacheline: 24 bytes */ }; But on i386 we get the following: struct xfs_buf_log_format { short unsigned int blf_type; /* 0 2 */ short unsigned int blf_size; /* 2 2 */ short unsigned int blf_flags; /* 4 2 */ short unsigned int blf_len; /* 6 2 */ long long int blf_blkno; /* 8 8 */ unsigned int blf_map_size; /* 16 4 */ unsigned int blf_data_map[16]; /* 20 64 */ /* --- cacheline 1 boundary (64 bytes) was 20 bytes ago --- */ /* size: 84, cachelines: 2, members: 7 */ /* last cacheline: 20 bytes */ }; Notice how the amd64 compiler inserts 4 bytes of padding to the end of the structure to ensure 8-byte alignment. Prior to "xfs: fix memory corruption during remote attr value buffer invalidation" we would try to write to blf_data_map[17], which is harmless on amd64 but really bad on i386. This shouldn't cause any changes in the ondisk logging formats because the log code writes out the log vectors with the appropriate size for the log item's map_size, and log recovery treats the data_map array as a VLA. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-01-16xfs: complain if anyone tries to create a too-large buffer log itemDarrick J. Wong
Complain if someone calls xfs_buf_item_init on a buffer that is larger than the dirty bitmap can handle, or tries to log a region that's past the end of the dirty bitmap. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-01-16xfs: clean up xfs_buf_item_get_format return valueDarrick J. Wong
The only thing that can cause a nonzero return from xfs_buf_item_get_format is if the kmem_alloc fails, which it can't. Get rid of all the unnecessary error handling. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-01-16xfs: streamline xfs_attr3_leaf_inactiveDarrick J. Wong
Now that we know we don't have to take a transaction to stale the incore buffers for a remote value, get rid of the unnecessary memory allocation in the leaf walker and call the rmt_stale function directly. Flatten the loop while we're at it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-01-16xfs: fix memory corruption during remote attr value buffer invalidationDarrick J. Wong
While running generic/103, I observed what looks like memory corruption and (with slub debugging turned on) a slub redzone warning on i386 when inactivating an inode with a 64k remote attr value. On a v5 filesystem, maximally sized remote attr values require one block more than 64k worth of space to hold both the remote attribute value header (64 bytes). On a 4k block filesystem this results in a 68k buffer; on a 64k block filesystem, this would be a 128k buffer. Note that even though we'll never use more than 65,600 bytes of this buffer, XFS_MAX_BLOCKSIZE is 64k. This is a problem because the definition of struct xfs_buf_log_format allows for XFS_MAX_BLOCKSIZE worth of dirty bitmap (64k). On i386 when we invalidate a remote attribute, xfs_trans_binval zeroes all 68k worth of the dirty map, writing right off the end of the log item and corrupting memory. We've gotten away with this on x86_64 for years because the compiler inserts a u32 padding on the end of struct xfs_buf_log_format. Fortunately for us, remote attribute values are written to disk with xfs_bwrite(), which is to say that they are not logged. Fix the problem by removing all places where we could end up creating a buffer log item for a remote attribute value and leave a note explaining why. Next, replace the open-coded buffer invalidation with a call to the helper we created in the previous patch that does better checking for bad metadata before marking the buffer stale. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-01-16xfs: refactor remote attr value buffer invalidationDarrick J. Wong
Hoist the code that invalidates remote extended attribute value buffers into a separate helper function. This prepares us for a memory corruption fix in the next patch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-01-16reiserfs: fix handling of -EOPNOTSUPP in reiserfs_for_each_xattrJeff Mahoney
Commit 60e4cf67a58 (reiserfs: fix extended attributes on the root directory) introduced a regression open_xa_root started returning -EOPNOTSUPP but it was not handled properly in reiserfs_for_each_xattr. When the reiserfs module is built without CONFIG_REISERFS_FS_XATTR, deleting an inode would result in a warning and chowning an inode would also result in a warning and then fail to complete. With CONFIG_REISERFS_FS_XATTR enabled, the xattr root would always be present for read-write operations. This commit handles -EOPNOSUPP in the same way -ENODATA is handled. Fixes: 60e4cf67a582 ("reiserfs: fix extended attributes on the root directory") CC: stable@vger.kernel.org # Commit 60e4cf67a58 was picked up by stable Link: https://lore.kernel.org/r/20200115180059.6935-1-jeffm@suse.com Reported-by: Michael Brunnbauer <brunni@netestate.de> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Jan Kara <jack@suse.cz>
2020-01-16fuse: fix fuse_send_readpages() in the syncronous read caseMiklos Szeredi
Buffered read in fuse normally goes via: -> generic_file_buffered_read() -> fuse_readpages() -> fuse_send_readpages() ->fuse_simple_request() [called since v5.4] In the case of a read request, fuse_simple_request() will return a non-negative bytecount on success or a negative error value. A positive bytecount was taken to be an error and the PG_error flag set on the page. This resulted in generic_file_buffered_read() falling back to ->readpage(), which would repeat the read request and succeed. Because of the repeated read succeeding the bug was not detected with regression tests or other use cases. The FTP module in GVFS however fails the second read due to the non-seekable nature of FTP downloads. Fix by checking and ignoring positive return value from fuse_simple_request(). Reported-by: Ondrej Holy <oholy@redhat.com> Link: https://gitlab.gnome.org/GNOME/gvfs/issues/441 Fixes: 134831e36bbd ("fuse: convert readpages to simple api") Cc: <stable@vger.kernel.org> # v5.4 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-01-15xfs: fix IOCB_NOWAIT handling in xfs_file_dio_aio_readChristoph Hellwig
Direct I/O reads can also be used with RWF_NOWAIT & co. Fix the inode locking in xfs_file_dio_aio_read to take IOCB_NOWAIT into account. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-01-15io_uring: ensure workqueue offload grabs ring mutex for poll listJens Axboe
A previous commit moved the locking for the async sqthread, but didn't take into account that the io-wq workers still need it. We can't use req->in_async for this anymore as both the sqthread and io-wq workers set it, gate the need for locking on io_wq_current_is_worker() instead. Fixes: 8a4955ff1cca ("io_uring: sqthread should grab ctx->uring_lock for submissions") Reported-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-15io_uring: clear req->result always before issuing a read/write requestBijan Mottahedeh
req->result is cleared when io_issue_sqe() calls io_read/write_pre() routines. Those routines however are not called when the sqe argument is NULL, which is the case when io_issue_sqe() is called from io_wq_submit_work(). io_issue_sqe() may then examine a stale result if a polled request had previously failed with -EAGAIN: if (ctx->flags & IORING_SETUP_IOPOLL) { if (req->result == -EAGAIN) return -EAGAIN; io_iopoll_req_issued(req); } and in turn cause a subsequently completed request to be re-issued in io_wq_submit_work(). Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-15f2fs: free sysfs kobjectJaegeuk Kim
Detected kmemleak. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-15f2fs: declare nested quota_sem and remove unnecessary semsJaegeuk Kim
1. f2fs_quota_sync -> down_read(&sbi->quota_sem) -> dquot_writeback_dquots -> f2fs_dquot_commit -> down_read(&sbi->quota_sem) 2. f2fs_quota_sync -> down_read(&sbi->quota_sem) -> f2fs_write_data_pages -> f2fs_write_single_data_page -> down_write(&F2FS_I(inode)->i_sem) f2fs_mkdir -> f2fs_do_add_link -> down_write(&F2FS_I(inode)->i_sem) -> f2fs_init_inode_metadata -> f2fs_new_node_page -> dquot_alloc_inode -> f2fs_dquot_mark_dquot_dirty -> down_read(&sbi->quota_sem) Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-15f2fs: don't put new_page twice in f2fs_renameJaegeuk Kim
In f2fs_rename(), new_page is gone after f2fs_set_link(), but it tries to put again when whiteout is failed and jumped to put_out_dir. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-15f2fs: set I_LINKABLE early to avoid wrong access by vfsJaegeuk Kim
This patch moves setting I_LINKABLE early in rename2(whiteout) to avoid the below warning. [ 3189.163385] WARNING: CPU: 3 PID: 59523 at fs/inode.c:358 inc_nlink+0x32/0x40 [ 3189.246979] Call Trace: [ 3189.248707] f2fs_init_inode_metadata+0x2d6/0x440 [f2fs] [ 3189.251399] f2fs_add_inline_entry+0x162/0x8c0 [f2fs] [ 3189.254010] f2fs_add_dentry+0x69/0xe0 [f2fs] [ 3189.256353] f2fs_do_add_link+0xc5/0x100 [f2fs] [ 3189.258774] f2fs_rename2+0xabf/0x1010 [f2fs] [ 3189.261079] vfs_rename+0x3f8/0xaa0 [ 3189.263056] ? tomoyo_path_rename+0x44/0x60 [ 3189.265283] ? do_renameat2+0x49b/0x550 [ 3189.267324] do_renameat2+0x49b/0x550 [ 3189.269316] __x64_sys_renameat2+0x20/0x30 [ 3189.271441] do_syscall_64+0x5a/0x230 [ 3189.273410] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 3189.275848] RIP: 0033:0x7f270b4d9a49 Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-15f2fs: don't keep META_MAPPING pages used for moving verity file blocksEric Biggers
META_MAPPING is used to move blocks for both encrypted and verity files. So the META_MAPPING invalidation condition in do_checkpoint() should consider verity too, not just encrypt. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-15f2fs: introduce private biosetChao Yu
In low memory scenario, we can allocate multiple bios without submitting any of them. - f2fs_write_checkpoint() - block_operations() - f2fs_sync_node_pages() step 1) flush cold nodes, allocate new bio from mempool - bio_alloc() - mempool_alloc() step 2) flush hot nodes, allocate a bio from mempool - bio_alloc() - mempool_alloc() step 3) flush warm nodes, be stuck in below call path - bio_alloc() - mempool_alloc() - loop to wait mempool element release, as we only reserved memory for two bio allocation, however above allocated two bios may never be submitted. So we need avoid using default bioset, in this patch we introduce a private bioset, in where we enlarg mempool element count to total number of log header, so that we can make sure we have enough backuped memory pool in scenario of allocating/holding multiple bios. Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-15f2fs: cleanup duplicate stats for atomic filesSahitya Tummala
Remove duplicate sbi->aw_cnt stats counter that tracks the number of atomic files currently opened (it also shows incorrect value sometimes). Use more relit lable sbi->atomic_files to show in the stats. Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-15f2fs: Check write pointer consistency of non-open zonesShin'ichiro Kawasaki
To catch f2fs bugs in write pointer handling code for zoned block devices, check write pointers of non-open zones that current segments do not point to. Do this check at mount time, after the fsync data recovery and current segments' write pointer consistency fix. Or when fsync data recovery is disabled by mount option, do the check when there is no fsync data. Check two items comparing write pointers with valid block maps in SIT. The first item is check for zones with no valid blocks. When there is no valid blocks in a zone, the write pointer should be at the start of the zone. If not, next write operation to the zone will cause unaligned write error. If write pointer is not at the zone start, reset the write pointer to place at the zone start. The second item is check between the write pointer position and the last valid block in the zone. It is unexpected that the last valid block position is beyond the write pointer. In such a case, report as a bug. Fix is not required for such zone, because the zone is not selected for next write operation until the zone get discarded. Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-15f2fs: Check write pointer consistency of open zonesShin'ichiro Kawasaki
On sudden f2fs shutdown, write pointers of zoned block devices can go further but f2fs meta data keeps current segments at positions before the write operations. After remounting the f2fs, this inconsistency causes write operations not at write pointers and "Unaligned write command" error is reported. To avoid the error, compare current segments with write pointers of open zones the current segments point to, during mount operation. If the write pointer position is not aligned with the current segment position, assign a new zone to the current segment. Also check the newly assigned zone has write pointer at zone start. If not, reset write pointer of the zone. Perform the consistency check during fsync recovery. Not to lose the fsync data, do the check after fsync data gets restored and before checkpoint commit which flushes data at current segment positions. Not to cause conflict with kworker's dirfy data/node flush, do the fix within SBI_POR_DOING protection. Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-15Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull vfs fixes from Al Viro: "Fixes for mountpoint_last() bugs (by converting to use of lookup_last()) and an autofs regression fix from this cycle (caused by follow_managed() breakage introduced in barrier fixes series)" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fix autofs regression caused by follow_managed() changes reimplement path_mountpoint() with less magic
2020-01-15xfs: Add __packed to xfs_dir2_sf_entry_t definitionVincenzo Frascino
xfs_check_ondisk_structs() verifies that the sizes of the data types used by xfs are correct via the XFS_CHECK_STRUCT_SIZE() macro. Since the structures padding can vary depending on the ABI (e.g. on ARM OABI structures are padded to multiple of 32 bits), it may happen that xfs_dir2_sf_entry_t size check breaks the compilation with the assertion below: In file included from linux/include/linux/string.h:6, from linux/include/linux/uuid.h:12, from linux/fs/xfs/xfs_linux.h:10, from linux/fs/xfs/xfs.h:22, from linux/fs/xfs/xfs_super.c:7: In function ‘xfs_check_ondisk_structs’, inlined from ‘init_xfs_fs’ at linux/fs/xfs/xfs_super.c:2025:2: linux/include/linux/compiler.h:350:38: error: call to ‘__compiletime_assert_107’ declared with attribute error: XFS: sizeof(xfs_dir2_sf_entry_t) is wrong, expected 3 _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__) Restore the correct behavior adding __packed to the structure definition. Cc: Darrick J. Wong <darrick.wong@oracle.com> Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-01-15gfs2: Avoid access time thrashing in gfs2_inode_lookupAndreas Gruenbacher
In gfs2_inode_lookup, we initialize inode->i_atime to the lowest possibly value after gfs2_inode_refresh may already have been called. This should be the other way around, but we didn't notice because usually the inode type is known from the directory entry and so gfs2_inode_lookup won't call gfs2_inode_refresh. In addition, only initialize ip->i_no_formal_ino from no_formal_ino when actually needed. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-01-15fix autofs regression caused by follow_managed() changesAl Viro
we need to reload ->d_flags after the call of ->d_manage() - the thing might've been called with dentry still negative and have the damn thing turned positive while we'd waited. Fixes: d41efb522e90 "fs/namei.c: pull positivity check into follow_managed()" Reported-by: Ian Kent <raven@themaw.net> Tested-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-01-15reimplement path_mountpoint() with less magicAl Viro
... and get rid of a bunch of bugs in it. Background: the reason for path_mountpoint() is that umount() really doesn't want attempts to revalidate the root of what it's trying to umount. The thing we want to avoid actually happen from complete_walk(); solution was to do something parallel to normal path_lookupat() and it both went overboard and got the boilerplate subtly (and not so subtly) wrong. A better solution is to do pretty much what the normal path_lookupat() does, but instead of complete_walk() do unlazy_walk(). All it takes to avoid that ->d_weak_revalidate() call... mountpoint_last() goes away, along with everything it got wrong, and so does the magic around LOOKUP_NO_REVAL. Another source of bugs is that when we traverse mounts at the final location (and we need to do that - umount . expects to get whatever's overmounting ., if any, out of the lookup) we really ought to take care of ->d_manage() - as it is, manual umount of autofs automount in progress can lead to unpleasant surprises for the daemon. Easily solved by using handle_lookup_down() instead of follow_mount(). Tested-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-01-14io_uring: be consistent in assigning next work from handlerJens Axboe
If we pass back dependent work in case of links, we need to always ensure that we call the link setup and work prep handler. If not, we might be missing some setup for the next work item. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-14io-wq: cancel work if we fail getting a mm referenceJens Axboe
If we require mm and user context, mark the request for cancellation if we fail to acquire the desired mm. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-14fs-verity: use u64_to_user_ptr()Eric Biggers
<linux/kernel.h> already provides a macro u64_to_user_ptr(). Use it instead of open-coding the two casts. No change in behavior. Link: https://lore.kernel.org/r/20191231175408.20524-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-01-14fs-verity: use mempool for hash requestsEric Biggers
When initializing an fs-verity hash algorithm, also initialize a mempool that contains a single preallocated hash request object. Then replace the direct calls to ahash_request_alloc() and ahash_request_free() with allocating and freeing from this mempool. This eliminates the possibility of the allocation failing, which is desirable for the I/O path. This doesn't cause deadlocks because there's no case where multiple hash requests are needed at a time to make forward progress. Link: https://lore.kernel.org/r/20191231175545.20709-1-ebiggers@kernel.org Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-01-14fs-verity: implement readahead of Merkle tree pagesEric Biggers
When fs-verity verifies data pages, currently it reads each Merkle tree page synchronously using read_mapping_page(). Therefore, when the Merkle tree pages aren't already cached, fs-verity causes an extra 4 KiB I/O request for every 512 KiB of data (assuming that the Merkle tree uses SHA-256 and 4 KiB blocks). This results in more I/O requests and performance loss than is strictly necessary. Therefore, implement readahead of the Merkle tree pages. For simplicity, we take advantage of the fact that the kernel already does readahead of the file's *data*, just like it does for any other file. Due to this, we don't really need a separate readahead state (struct file_ra_state) just for the Merkle tree, but rather we just need to piggy-back on the existing data readahead requests. We also only really need to bother with the first level of the Merkle tree, since the usual fan-out factor is 128, so normally over 99% of Merkle tree I/O requests are for the first level. Therefore, make fsverity_verify_bio() enable readahead of the first Merkle tree level, for up to 1/4 the number of pages in the bio, when it sees that the REQ_RAHEAD flag is set on the bio. The readahead size is then passed down to ->read_merkle_tree_page() for the filesystem to (optionally) implement if it sees that the requested page is uncached. While we're at it, also make build_merkle_tree_level() set the Merkle tree readahead size, since it's easy to do there. However, for now don't set the readahead size in fsverity_verify_page(), since currently it's only used to verify holes on ext4 and f2fs, and it would need parameters added to know how much to read ahead. This patch significantly improves fs-verity sequential read performance. Some quick benchmarks with 'cat'-ing a 250MB file after dropping caches: On an ARM64 phone (using sha256-ce): Before: 217 MB/s After: 263 MB/s (compare to sha256sum of non-verity file: 357 MB/s) In an x86_64 VM (using sha256-avx2): Before: 173 MB/s After: 215 MB/s (compare to sha256sum of non-verity file: 223 MB/s) Link: https://lore.kernel.org/r/20200106205533.137005-1-ebiggers@kernel.org Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-01-14fs-verity: implement readahead for FS_IOC_ENABLE_VERITYEric Biggers
When it builds the first level of the Merkle tree, FS_IOC_ENABLE_VERITY sequentially reads each page of the file using read_mapping_page(). This works fine if the file's data is already in pagecache, which should normally be the case, since this ioctl is normally used immediately after writing out the file. But in any other case this implementation performs very poorly, since only one page is read at a time. Fix this by implementing readahead using the functions from mm/readahead.c. This improves performance in the uncached case by about 20x, as seen in the following benchmarks done on a 250MB file (on x86_64 with SHA-NI): FS_IOC_ENABLE_VERITY uncached (before) 3.299s FS_IOC_ENABLE_VERITY uncached (after) 0.160s FS_IOC_ENABLE_VERITY cached 0.147s sha256sum uncached 0.191s sha256sum cached 0.145s Note: we could instead switch to kernel_read(). But that would mean we'd no longer be hashing the data directly from the pagecache, which is a nice optimization of its own. And using kernel_read() would require allocating another temporary buffer, hashing the data and tree pages separately, and explicitly zero-padding the last page -- so it wouldn't really be any simpler than direct pagecache access, at least for now. Link: https://lore.kernel.org/r/20200106205410.136707-1-ebiggers@kernel.org Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-01-14fscrypt: document gfp_flags for bounce page allocationEric Biggers
Document that fscrypt_encrypt_pagecache_blocks() allocates the bounce page from a mempool, and document what this means for the @gfp_flags argument. Link: https://lore.kernel.org/r/20191231181026.47400-1-ebiggers@kernel.org Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-01-14fscrypt: optimize fscrypt_zeroout_range()Eric Biggers
Currently fscrypt_zeroout_range() issues and waits on a bio for each block it writes, which makes it very slow. Optimize it to write up to 16 pages at a time instead. Also add a function comment, and improve reliability by allowing the allocations of the bio and the first ciphertext page to wait on the corresponding mempools. Link: https://lore.kernel.org/r/20191226160813.53182-1-ebiggers@kernel.org Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-01-14Merge branch 'dhowells' (patches from DavidH)Linus Torvalds
Merge misc fixes from David Howells. Two afs fixes and a key refcounting fix. * dhowells: afs: Fix afs_lookup() to not clobber the version on a new dentry afs: Fix use-after-loss-of-ref keys: Fix request_key() cache
2020-01-14afs: Fix afs_lookup() to not clobber the version on a new dentryDavid Howells
Fix afs_lookup() to not clobber the version set on a new dentry by afs_do_lookup() - especially as it's using the wrong version of the version (we need to use the one given to us by whatever op the dir contents correspond to rather than what's in the afs_vnode). Fixes: 9dd0b82ef530 ("afs: Fix missing dentry data version updating") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-14afs: Fix use-after-loss-of-refDavid Howells
afs_lookup() has a tracepoint to indicate the outcome of d_splice_alias(), passing it the inode to retrieve the fid from. However, the function gave up its ref on that inode when it called d_splice_alias(), which may have failed and dropped the inode. Fix this by caching the fid. Fixes: 80548b03991f ("afs: Add more tracepoints") Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-14xfs: fix s_maxbytes computation on 32-bit kernelsDarrick J. Wong
I observed a hang in generic/308 while running fstests on a i686 kernel. The hang occurred when trying to purge the pagecache on a large sparse file that had a page created past MAX_LFS_FILESIZE, which caused an integer overflow in the pagecache xarray and resulted in an infinite loop. I then noticed that Linus changed the definition of MAX_LFS_FILESIZE in commit 0cc3b0ec23ce ("Clarify (and fix) MAX_LFS_FILESIZE macros") so that it is now one page short of the maximum page index on 32-bit kernels. Because the XFS function to compute max offset open-codes the 2005-era MAX_LFS_FILESIZE computation and neither the vfs nor mm perform any sanity checking of s_maxbytes, the code in generic/308 can create a page above the pagecache's limit and kaboom. Fix all this by setting s_maxbytes to MAX_LFS_FILESIZE directly and aborting the mount with a warning if our assumptions ever break. I have no answer for why this seems to have been broken for years and nobody noticed. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-01-14xfs: truncate should remove all blocks, not just to the end of the page cacheDarrick J. Wong
xfs_itruncate_extents_flags() is supposed to unmap every block in a file from EOF onwards. Oddly, it uses s_maxbytes as the upper limit to the bunmapi range, even though s_maxbytes reflects the highest offset the pagecache can support, not the highest offset that XFS supports. The result of this confusion is that if you create a 20T file on a 64-bit machine, mount the filesystem on a 32-bit machine, and remove the file, we leak everything above 16T. Fix this by capping the bunmapi request at the maximum possible block offset, not s_maxbytes. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-01-14xfs: introduce XFS_MAX_FILEOFFDarrick J. Wong
Introduce a new #define for the maximum supported file block offset. We'll use this in the next patch to make it more obvious that we're doing some operation for all possible inode fork mappings after a given offset. We can't use ULLONG_MAX here because bunmapi uses that to detect when it's done. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>