summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2020-05-04xfs: refactor xlog_recover_buffer_pass1Christoph Hellwig
Split out a xlog_add_buffer_cancelled helper which does the low-level manipulation of the buffer cancelation table, and in that helper call xlog_find_buffer_cancelled instead of open coding it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-04xfs: simplify xlog_recover_inode_ra_pass2Christoph Hellwig
Don't bother to allocate memory and convert the log item when we only need the block number and the length. Just extract them directly and call xlog_buf_readahead separately in each branch. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-04xfs: factor out a xlog_buf_readahead helperChristoph Hellwig
Add a little helper to readahead a buffer if it hasn't been cancelled. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-04xfs: rename inode_list xlog_recover_reorder_transChristoph Hellwig
This list contains pretty much everything that is not a buffer. The comment calls it item_list, which is a much better name than inode list, so switch the actual variable name to that as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-04xfs: refactor the buffer cancellation table helpersChristoph Hellwig
Replace the somewhat convoluted use of xlog_peek_buffer_cancelled and xlog_check_buffer_cancelled with two obvious helpers: xlog_is_buffer_cancelled, which returns true if there is a buffer in the cancellation table, and xlog_put_buffer_cancelled, which also decrements the reference count of the buffer cancellation table. Both share a little helper to look up the entry. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-04xfs: define printk_once variants for xfs messagesEric Sandeen
There are a couple places where we directly call printk_once() and one of them doesn't follow the standard xfs subsystem printk format as a result. #define printk_once variants to go with our existing printk_ratelimited #defines so we can do one-shot printks in a consistent manner. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-04xfs: stop CONFIG_XFS_DEBUG from changing compiler flagsArnd Bergmann
I ran into a linker warning in XFS that originates from a mismatch between libelf, binutils and objtool when certain files in the kernel are built with "gcc -g": x86_64-linux-ld: fs/xfs/xfs_trace.o: unable to initialize decompress status for section .debug_info After some discussion, nobody could identify why xfs sets this flag here. CONFIG_XFS_DEBUG used to enable lots of unrelated settings, but now its main purpose is to enable extra consistency checks and assertions that are unrelated to the debug info. Remove the Makefile logic to set the flag here. If anyone relies on the debug info, this can simply be enabled again with the global CONFIG_DEBUG_INFO option. Dave Chinner writes: I'm pretty sure it was needed for the original kgdb integration back in the early 2000s. That was when SGI used to patch their XFS dev tree with kgdb and debug symbols were needed by the custom kgdb modules that were ported across from the Irix kernel debugger. ISTR that the early kcrash kernel dump analysis tools (again, originated from the Irix "icrash" kernel dump tools) had custom XFS debug scripts that needed also the debug info to work correctly... Which is a long way of saying "we don't need it anymore" instead of "nobody knows why it was set"... :) Suggested-by: Christoph Hellwig <hch@infradead.org> Link: https://lore.kernel.org/lkml/20200409074130.GD21033@infradead.org/ Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-04xfs: remove unnecessary check of the variable resblks in xfs_symlinkKaixu Xia
Since the "no-allocation" reservations has been removed, the resblks value should be larger than zero, so remove the unnecessary check. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-04xfs: simplify the flags setting in xfs_qm_scall_quotaonKaixu Xia
Simplify the setting of the flags value, and only consider quota enforcement stuff here. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-04xfs: remove unnecessary assertion from xfs_qm_vop_create_dqattachKaixu Xia
The check XFS_IS_QUOTA_RUNNING() has been done when enter the xfs_qm_vop_create_dqattach() function, it will return directly if the result is false, so the followed XFS_IS_QUOTA_RUNNING() assertion is unnecessary. If we truly care about this, the check also can be added to the condition of next if statements. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-04xfs: remove unnecessary variable udqp from xfs_ioctl_setattrKaixu Xia
The initial value of variable udqp is NULL, and we only set the flag XFS_QMOPT_PQUOTA in xfs_qm_vop_dqalloc() function, so only the pdqp value is initialized and the udqp value is still NULL. Since the udqp value is NULL in the rest part of xfs_ioctl_setattr() function, it is meaningless and do nothing. So remove it from xfs_ioctl_setattr(). Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-04xfs: reserve quota inode transaction space only when neededKaixu Xia
We share an inode between gquota and pquota with the older superblock that doesn't have separate pquotino, and for the need_alloc == false case we don't need to call xfs_dir_ialloc() function, so add the check if reserved free disk blocks is needed. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-04xfs: combine two if statements with same conditionKaixu Xia
The two if statements have same condition, and the mask value does not change in xfs_setattr_nonsize(), so combine them. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-04xfs: trace quota allocations for all quota typesKaixu Xia
The trace event xfs_dquot_dqalloc does not depend on the value uq, so remove the condition, and trace quota allocations for all quota types. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-04xfs: report unrecognized log item type codes during recoveryDarrick J. Wong
When we're sorting recovered log items ahead of recovering them and encounter a log item of unknown type, actually print the type code when we're rejecting the whole transaction to aid in debugging. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-05-04fs/stat: Define DAX statx attributeIra Weiny
In order for users to determine if a file is currently operating in DAX state (effective DAX). Define a statx attribute value and set that attribute if the effective DAX flag is set. To go along with this we propose the following addition to the statx man page: STATX_ATTR_DAX The file is in the DAX (cpu direct access) state. DAX state attempts to minimize software cache effects for both I/O and memory mappings of this file. It requires a file system which has been configured to support DAX. DAX generally assumes all accesses are via cpu load / store instructions which can minimize overhead for small accesses, but may adversely affect cpu utilization for large transfers. File I/O is done directly to/from user-space buffers and memory mapped I/O may be performed with direct memory mappings that bypass kernel page cache. While the DAX property tends to result in data being transferred synchronously, it does not give the same guarantees of O_SYNC where data and the necessary metadata are transferred together. A DAX file may support being mapped with the MAP_SYNC flag, which enables a program to use CPU cache flush instructions to persist CPU store operations without an explicit fsync(2). See mmap(2) for more information. Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-04cachefiles: Fix corruption of the return value in ↵David Howells
cachefiles_read_or_alloc_pages() The patch which changed cachefiles from calling ->bmap() to using the bmap() wrapper overwrote the running return value with the result of calling bmap(). This causes an assertion failure elsewhere in the code. Fix this by using ret2 rather than ret to hold the return value. The oops looks like: kernel BUG at fs/nfs/fscache.c:468! ... RIP: 0010:__nfs_readpages_from_fscache+0x18b/0x190 [nfs] ... Call Trace: nfs_readpages+0xbf/0x1c0 [nfs] ? __alloc_pages_nodemask+0x16c/0x320 read_pages+0x67/0x1a0 __do_page_cache_readahead+0x1cf/0x1f0 ondemand_readahead+0x172/0x2b0 page_cache_async_readahead+0xaa/0xe0 generic_file_buffered_read+0x852/0xd50 ? mem_cgroup_commit_charge+0x6e/0x140 ? nfs4_have_delegation+0x19/0x30 [nfsv4] generic_file_read_iter+0x100/0x140 ? nfs_revalidate_mapping+0x176/0x2b0 [nfs] nfs_file_read+0x6d/0xc0 [nfs] new_sync_read+0x11a/0x1c0 __vfs_read+0x29/0x40 vfs_read+0x8e/0x140 ksys_read+0x61/0xd0 __x64_sys_read+0x1a/0x20 do_syscall_64+0x60/0x1e0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f5d148267e0 Fixes: 10d83e11a582 ("cachefiles: drop direct usage of ->bmap method.") Reported-by: David Wysochanski <dwysocha@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: David Wysochanski <dwysocha@redhat.com> cc: Carlos Maiolino <cmaiolino@redhat.com>
2020-05-04io_uring: fix mismatched finish_wait() calls in io_uring_cancel_files()Xiaoguang Wang
The prepare_to_wait() and finish_wait() calls in io_uring_cancel_files() are mismatched. Currently I don't see any issues related this bug, just find it by learning codes. Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-03Merge tag 'for-5.7-rc3-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull more btrfs fixes from David Sterba: "A few more stability fixes, minor build warning fixes and git url fixup: - fix partial loss of prealloc extent past i_size after fsync - fix potential deadlock due to wrong transaction handle passing via journal_info - fix gcc 4.8 struct intialization warning - update git URL in MAINTAINERS entry" * tag 'for-5.7-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: MAINTAINERS: btrfs: fix git repo URL btrfs: fix gcc-4.8 build warning for struct initializer btrfs: transaction: Avoid deadlock due to bad initialization timing of fs_info::journal_info btrfs: fix partial loss of prealloc extent past i_size after fsync
2020-05-02Merge tag 'iomap-5.7-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull iomap fix from Darrick Wong: "Hoist the check for an unrepresentable FIBMAP return value into ioctl_fibmap. The internal kernel function can handle 64-bit values (and is needed to fix a regression on ext4 + jbd2). It is only the userspace ioctl that is so old that it cannot deal" * tag 'iomap-5.7-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: fibmap: Warn and return an error in case of block > INT_MAX
2020-05-02Merge tag 'nfs-for-5.7-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Trond Myklebust: "Highlights include: Stable fixes: - fix handling of backchannel binding in BIND_CONN_TO_SESSION Bugfixes: - Fix a credential use-after-free issue in pnfs_roc() - Fix potential posix_acl refcnt leak in nfs3_set_acl - defer slow parts of rpc_free_client() to a workqueue - Fix an Oopsable race in __nfs_list_for_each_server() - Fix trace point use-after-free race - Regression: the RDMA client no longer responds to server disconnect requests - Fix return values of xdr_stream_encode_item_{present, absent} - _pnfs_return_layout() must always wait for layoutreturn completion Cleanups: - Remove unreachable error conditions" * tag 'nfs-for-5.7-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFS: Fix a race in __nfs_list_for_each_server() NFSv4.1: fix handling of backchannel binding in BIND_CONN_TO_SESSION SUNRPC: defer slow parts of rpc_free_client() to a workqueue. NFSv4: Remove unreachable error condition due to rpc_run_task() SUNRPC: Remove unreachable error condition xprtrdma: Fix use of xdr_stream_encode_item_{present, absent} xprtrdma: Fix trace point use-after-free race xprtrdma: Restore wake-up-all to rpcrdma_cm_event_handler() nfs: Fix potential posix_acl refcnt leak in nfs3_set_acl NFS/pnfs: Fix a credential use-after-free issue in pnfs_roc() NFS/pnfs: Ensure that _pnfs_return_layout() waits for layoutreturn completion
2020-05-01readdir.c: get rid of the last __put_user(), drop now-useless access_ok()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-01readdir.c: get compat_filldir() more or less in sync with filldir()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-01switch readdir(2) to unsafe_copy_dirent_name()Al Viro
... and the same for its compat counterpart Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-01Merge tag 'io_uring-5.7-2020-05-01' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes from Jens Axboe: - Fix for statx not grabbing the file table, making AT_EMPTY_PATH fail - Cover a few cases where async poll can handle retry, eliminating the need for an async thread - fallback request busy/free fix (Bijan) - syzbot reported SQPOLL thread exit fix for non-preempt (Xiaoguang) - Fix extra put of req for sync_file_range (Pavel) - Always punt splice async. We'll improve this for 5.8, but wanted to eliminate the inode mutex lock from the non-blocking path for 5.7 (Pavel) * tag 'io_uring-5.7-2020-05-01' of git://git.kernel.dk/linux-block: io_uring: punt splice async because of inode mutex io_uring: check non-sync defer_list carefully io_uring: fix extra put in sync_file_range() io_uring: use cond_resched() in io_ring_ctx_wait_and_kill() io_uring: use proper references for fallback_req locking io_uring: only force async punt if poll based retry can't handle it io_uring: enable poll retry for any file with ->read_iter / ->write_iter io_uring: statx must grab the file table for valid fd
2020-05-01io_uring: punt splice async because of inode mutexPavel Begunkov
Nonblocking do_splice() still may wait for some time on an inode mutex. Let's play safe and always punt it async. Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-01io_uring: check non-sync defer_list carefullyPavel Begunkov
io_req_defer() do double-checked locking. Use proper helpers for that, i.e. list_empty_careful(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-01io_uring: fix extra put in sync_file_range()Pavel Begunkov
[ 40.179474] refcount_t: underflow; use-after-free. [ 40.179499] WARNING: CPU: 6 PID: 1848 at lib/refcount.c:28 refcount_warn_saturate+0xae/0xf0 ... [ 40.179612] RIP: 0010:refcount_warn_saturate+0xae/0xf0 [ 40.179617] Code: 28 44 0a 01 01 e8 d7 01 c2 ff 0f 0b 5d c3 80 3d 15 44 0a 01 00 75 91 48 c7 c7 b8 f5 75 be c6 05 05 44 0a 01 01 e8 b7 01 c2 ff <0f> 0b 5d c3 80 3d f3 43 0a 01 00 0f 85 6d ff ff ff 48 c7 c7 10 f6 [ 40.179619] RSP: 0018:ffffb252423ebe18 EFLAGS: 00010286 [ 40.179623] RAX: 0000000000000000 RBX: ffff98d65e929400 RCX: 0000000000000000 [ 40.179625] RDX: 0000000000000001 RSI: 0000000000000086 RDI: 00000000ffffffff [ 40.179627] RBP: ffffb252423ebe18 R08: 0000000000000001 R09: 000000000000055d [ 40.179629] R10: 0000000000000c8c R11: 0000000000000001 R12: 0000000000000000 [ 40.179631] R13: ffff98d68c434400 R14: ffff98d6a9cbaa20 R15: ffff98d6a609ccb8 [ 40.179634] FS: 0000000000000000(0000) GS:ffff98d6af580000(0000) knlGS:0000000000000000 [ 40.179636] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 40.179638] CR2: 00000000033e3194 CR3: 000000006480a003 CR4: 00000000003606e0 [ 40.179641] Call Trace: [ 40.179652] io_put_req+0x36/0x40 [ 40.179657] io_free_work+0x15/0x20 [ 40.179661] io_worker_handle_work+0x2f5/0x480 [ 40.179667] io_wqe_worker+0x2a9/0x360 [ 40.179674] ? _raw_spin_unlock_irqrestore+0x24/0x40 [ 40.179681] kthread+0x12c/0x170 [ 40.179685] ? io_worker_handle_work+0x480/0x480 [ 40.179690] ? kthread_park+0x90/0x90 [ 40.179695] ret_from_fork+0x35/0x40 [ 40.179702] ---[ end trace 85027405f00110aa ]--- Opcode handler must never put submission ref, but that's what io_sync_file_range_finish() do. use io_steal_work() there. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-30io_uring: use cond_resched() in io_ring_ctx_wait_and_kill()Xiaoguang Wang
While working on to make io_uring sqpoll mode support syscalls that need struct files_struct, I got cpu soft lockup in io_ring_ctx_wait_and_kill(), while (ctx->sqo_thread && !wq_has_sleeper(&ctx->sqo_wait)) cpu_relax(); above loop never has an chance to exit, it's because preempt isn't enabled in the kernel, and the context calling io_ring_ctx_wait_and_kill() and io_sq_thread() run in the same cpu, if io_sq_thread calls a cond_resched() yield cpu and another context enters above loop, then io_sq_thread() will always in runqueue and never exit. Use cond_resched() can fix this issue. Reported-by: syzbot+66243bb7126c410cefe6@syzkaller.appspotmail.com Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-30io_uring: use proper references for fallback_req lockingBijan Mottahedeh
Use ctx->fallback_req address for test_and_set_bit_lock() and clear_bit_unlock(). Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-30io_uring: only force async punt if poll based retry can't handle itJens Axboe
We do blocking retry from our poll handler, if the file supports polled notifications. Only mark the request as needing an async worker if we can't poll for it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-30io_uring: enable poll retry for any file with ->read_iter / ->write_iterJens Axboe
We can have files like eventfd where it's perfectly fine to do poll based retry on them, right now io_file_supports_async() doesn't take that into account. Pass in data direction and check the f_op instead of just always needing an async worker. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-01uaccess: Selectively open read or write user accessChristophe Leroy
When opening user access to only perform reads, only open read access. When opening user access to only perform writes, only open write access. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/2e73bc57125c2c6ab12a587586a4eed3a47105fc.1585898438.git.christophe.leroy@c-s.fr
2020-04-30NFS: Fix a race in __nfs_list_for_each_server()Trond Myklebust
The struct nfs_server gets put on the cl_superblocks list before the server->super field has been initialised, in which case the call to nfs_sb_active() will Oops. Add a check to ensure that we skip such a list entry. Fixes: 3c9e502b59fb ("NFS: Add a helper nfs_client_for_each_server()") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-04-30fibmap: Warn and return an error in case of block > INT_MAXRitesh Harjani
We better warn the fibmap user and not return a truncated and therefore an incorrect block map address if the bmap() returned block address is greater than INT_MAX (since user supplied integer pointer). It's better to pr_warn() all user of ioctl_fibmap() and return a proper error code rather than silently letting a FS corruption happen if the user tries to fiddle around with the returned block map address. We fix this by returning an error code of -ERANGE and returning 0 as the block mapping address in case if it is > INT_MAX. Now iomap_bmap() could be called from either of these two paths. Either when a user is calling an ioctl_fibmap() interface to get the block mapping address or by some filesystem via use of bmap() internal kernel API. bmap() kernel API is well equipped with handling of u64 addresses. WARN condition in iomap_bmap_actor() was mainly added to warn all the fibmap users. But now that we have directly added this warning for all fibmap users and also made sure to return 0 as block map address in case if addr > INT_MAX. So we can now remove this logic from iomap_bmap_actor(). Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-04-30btrfs: fix gcc-4.8 build warning for struct initializerArnd Bergmann
Some older compilers like gcc-4.8 warn about mismatched curly braces in a initializer: fs/btrfs/backref.c: In function 'is_shared_data_backref': fs/btrfs/backref.c:394:9: error: missing braces around initializer [-Werror=missing-braces] struct prelim_ref target = {0}; ^ fs/btrfs/backref.c:394:9: error: (near initialization for 'target.rbnode') [-Werror=missing-braces] Use the GNU empty initializer extension to avoid this. Fixes: ed58f2e66e84 ("btrfs: backref, don't add refs from shared block when resolving normal backref") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-04-30ovl: clear ATTR_OPEN from attr->ia_validVivek Goyal
As of now during open(), we don't pass bunch of flags to underlying filesystem. O_TRUNC is one of these. Normally this is not a problem as VFS calls ->setattr() with zero size and underlying filesystem sets file size to 0. But when overlayfs is running on top of virtiofs, it has an optimization where it does not send setattr request to server if dectects that truncation is part of open(O_TRUNC). It assumes that server already zeroed file size as part of open(O_TRUNC). fuse_do_setattr() { if (attr->ia_valid & ATTR_OPEN) { /* * No need to send request to userspace, since actual * truncation has already been done by OPEN. But still * need to truncate page cache. */ } } IOW, fuse expects O_TRUNC to be passed to it as part of open flags. But currently overlayfs does not pass O_TRUNC to underlying filesystem hence fuse/virtiofs breaks. Setup overlayfs on top of virtiofs and following does not zero the file size of a file is either upper only or has already been copied up. fd = open(foo.txt, O_TRUNC | O_WRONLY); There are two ways to fix this. Either pass O_TRUNC to underlying filesystem or clear ATTR_OPEN from attr->ia_valid so that fuse ends up sending a SETATTR request to server. Miklos is concerned that O_TRUNC might have side affects so it is better to clear ATTR_OPEN for now. Hence this patch clears ATTR_OPEN from attr->ia_valid. I found this problem while running unionmount-testsuite. With this patch, unionmount-testsuite passes with overlayfs on top of virtiofs. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Fixes: bccece1ead36 ("ovl: allow remote upper") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-04-30ovl: clear ATTR_FILE from attr->ia_validVivek Goyal
ovl_setattr() can be passed an attr which has ATTR_FILE set and attr->ia_file is a file pointer to overlay file. This is done in open(O_TRUNC) path. We should either replace with attr->ia_file with underlying file object or clear ATTR_FILE so that underlying filesystem does not end up using overlayfs file object pointer. There are no good use cases yet so for now clear ATTR_FILE. fuse seems to be one user which can use this. But it can work even without this. So it is not mandatory to pass ATTR_FILE to fuse. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Fixes: bccece1ead36 ("ovl: allow remote upper") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-04-28exec: Remove BUG_ON(has_group_leader_pid)Eric W. Biederman
With the introduction of exchange_tids thread_group_leader and has_group_leader_pid have become equivalent. Further at this point in the code a thread group has exactly two threads, the previous thread_group_leader that is waiting to be reaped and tsk. So we know it is impossible for tsk to be the thread_group_leader. This is also the last user of has_group_leader_pid so removing this check will allow has_group_leader_pid to be removed. So remove the "BUG_ON(has_group_leader_pid)" that will never fire. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-04-28Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull vfs fixes from Al Viro: "Two old bugs..." * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: propagate_one(): mnt_set_mountpoint() needs mount_lock dlmfs_file_write(): fix the bogosity in handling non-zero *ppos
2020-04-28Fix use after free in get_tree_bdev()David Howells
Commit 6fcf0c72e4b9, a fix to get_tree_bdev() put a missing blkdev_put() in the wrong place, before a warnf() that displays the bdev under consideration rather after it. This results in a silent lockup in printk("%pg") called via warnf() from get_tree_bdev() under some circumstances when there's a race with the blockdev being frozen. This can be caused by xfstests/tests/generic/085 in combination with Lukas Czerner's ext4 mount API conversion patchset. It looks like it ought to occur with other users of get_tree_bdev() such as XFS, but apparently doesn't. Fix this by switching the order of the lines. Fixes: 6fcf0c72e4b9 ("vfs: add missing blkdev_put() in get_tree_bdev()") Reported-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Ian Kent <raven@themaw.net> cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-28proc: Ensure we see the exit of each process tid exactly onceEric W. Biederman
When the thread group leader changes during exec and the old leaders thread is reaped proc_flush_pid will flush the dentries for the entire process because the leader still has it's original pid. Fix this by exchanging the pids in an rcu safe manner, and wrapping the code to do that up in a helper exchange_tids. When I removed switch_exec_pids and introduced this behavior in d73d65293e3e ("[PATCH] pidhash: kill switch_exec_pids") there really was nothing that cared as flushing happened with the cached dentry and de_thread flushed both of them on exec. This lack of fully exchanging pids became a problem a few months later when I introduced 48e6484d4902 ("[PATCH] proc: Rewrite the proc dentry flush on exit optimization"). Which overlooked the de_thread case was no longer swapping pids, and I was looking up proc dentries by task->pid. The current behavior isn't properly a bug as everything in proc will continue to work correctly just a little bit less efficiently. Fix this just so there are no little surprise corner cases waiting to bite people. -- Oleg points out this could be an issue in next_tgid in proc where has_group_leader_pid is called, and reording some of the assignments should fix that. -- Oleg points out this will break the 10 year old hack in __exit_signal.c > /* > * This can only happen if the caller is de_thread(). > * FIXME: this is the temporary hack, we should teach > * posix-cpu-timers to handle this case correctly. > */ > if (unlikely(has_group_leader_pid(tsk))) > posix_cpu_timers_exit_group(tsk); The code in next_tgid has been changed to use PIDTYPE_TGID, and the posix cpu timers code has been fixed so it does not need the 10 year old hack, so this should be safe to merge now. Link: https://lore.kernel.org/lkml/87h7x3ajll.fsf_-_@x220.int.ebiederm.org/ Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Fixes: 48e6484d4902 ("[PATCH] proc: Rewrite the proc dentry flush on exit optimization"). Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-04-28NFSv4.1: fix handling of backchannel binding in BIND_CONN_TO_SESSIONOlga Kornievskaia
Currently, if the client sends BIND_CONN_TO_SESSION with NFS4_CDFC4_FORE_OR_BOTH but only gets NFS4_CDFS4_FORE back it ignores that it wasn't able to enable a backchannel. To make sure, the client sends BIND_CONN_TO_SESSION as the first operation on the connections (ie., no other session compounds haven't been sent before), and if the client's request to bind the backchannel is not satisfied, then reset the connection and retry. Cc: stable@vger.kernel.org Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-04-28Merge branch 'work.sysctl' of ↵Daniel Borkmann
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull in Christoph Hellwig's series that changes the sysctl's ->proc_handler methods to take kernel pointers instead. It gets rid of the set_fs address space overrides used by BPF. As per discussion, pull in the feature branch into bpf-next as it relates to BPF sysctl progs. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200427071508.GV23230@ZenIV.linux.org.uk/T/
2020-04-28coredump: fix crash when umh is disabledLuis Chamberlain
Commit 64e90a8acb859 ("Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper()") added the optiont to disable all call_usermodehelper() calls by setting STATIC_USERMODEHELPER_PATH to an empty string. When this is done, and crashdump is triggered, it will crash on null pointer dereference, since we make assumptions over what call_usermodehelper_exec() did. This has been reported by Sergey when one triggers a a coredump with the following configuration: ``` CONFIG_STATIC_USERMODEHELPER=y CONFIG_STATIC_USERMODEHELPER_PATH="" kernel.core_pattern = |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e ``` The way disabling the umh was designed was that call_usermodehelper_exec() would just return early, without an error. But coredump assumes certain variables are set up for us when this happens, and calls ile_start_write(cprm.file) with a NULL file. [ 2.819676] BUG: kernel NULL pointer dereference, address: 0000000000000020 [ 2.819859] #PF: supervisor read access in kernel mode [ 2.820035] #PF: error_code(0x0000) - not-present page [ 2.820188] PGD 0 P4D 0 [ 2.820305] Oops: 0000 [#1] SMP PTI [ 2.820436] CPU: 2 PID: 89 Comm: a Not tainted 5.7.0-rc1+ #7 [ 2.820680] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190711_202441-buildvm-armv7-10.arm.fedoraproject.org-2.fc31 04/01/2014 [ 2.821150] RIP: 0010:do_coredump+0xd80/0x1060 [ 2.821385] Code: e8 95 11 ed ff 48 c7 c6 cc a7 b4 81 48 8d bd 28 ff ff ff 89 c2 e8 70 f1 ff ff 41 89 c2 85 c0 0f 84 72 f7 ff ff e9 b4 fe ff ff <48> 8b 57 20 0f b7 02 66 25 00 f0 66 3d 00 8 0 0f 84 9c 01 00 00 44 [ 2.822014] RSP: 0000:ffffc9000029bcb8 EFLAGS: 00010246 [ 2.822339] RAX: 0000000000000000 RBX: ffff88803f860000 RCX: 000000000000000a [ 2.822746] RDX: 0000000000000009 RSI: 0000000000000282 RDI: 0000000000000000 [ 2.823141] RBP: ffffc9000029bde8 R08: 0000000000000000 R09: ffffc9000029bc00 [ 2.823508] R10: 0000000000000001 R11: ffff88803dec90be R12: ffffffff81c39da0 [ 2.823902] R13: ffff88803de84400 R14: 0000000000000000 R15: 0000000000000000 [ 2.824285] FS: 00007fee08183540(0000) GS:ffff88803e480000(0000) knlGS:0000000000000000 [ 2.824767] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2.825111] CR2: 0000000000000020 CR3: 000000003f856005 CR4: 0000000000060ea0 [ 2.825479] Call Trace: [ 2.825790] get_signal+0x11e/0x720 [ 2.826087] do_signal+0x1d/0x670 [ 2.826361] ? force_sig_info_to_task+0xc1/0xf0 [ 2.826691] ? force_sig_fault+0x3c/0x40 [ 2.826996] ? do_trap+0xc9/0x100 [ 2.827179] exit_to_usermode_loop+0x49/0x90 [ 2.827359] prepare_exit_to_usermode+0x77/0xb0 [ 2.827559] ? invalid_op+0xa/0x30 [ 2.827747] ret_from_intr+0x20/0x20 [ 2.827921] RIP: 0033:0x55e2c76d2129 [ 2.828107] Code: 2d ff ff ff e8 68 ff ff ff 5d c6 05 18 2f 00 00 01 c3 0f 1f 80 00 00 00 00 c3 0f 1f 80 00 00 00 00 e9 7b ff ff ff 55 48 89 e5 <0f> 0b b8 00 00 00 00 5d c3 66 2e 0f 1f 84 0 0 00 00 00 00 0f 1f 40 [ 2.828603] RSP: 002b:00007fffeba5e080 EFLAGS: 00010246 [ 2.828801] RAX: 000055e2c76d2125 RBX: 0000000000000000 RCX: 00007fee0817c718 [ 2.829034] RDX: 00007fffeba5e188 RSI: 00007fffeba5e178 RDI: 0000000000000001 [ 2.829257] RBP: 00007fffeba5e080 R08: 0000000000000000 R09: 00007fee08193c00 [ 2.829482] R10: 0000000000000009 R11: 0000000000000000 R12: 000055e2c76d2040 [ 2.829727] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 2.829964] CR2: 0000000000000020 [ 2.830149] ---[ end trace ceed83d8c68a1bf1 ]--- ``` Cc: <stable@vger.kernel.org> # v4.11+ Fixes: 64e90a8acb85 ("Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper()") BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=199795 Reported-by: Tony Vroon <chainsaw@gentoo.org> Reported-by: Sergey Kvachonok <ravenexp@gmail.com> Tested-by: Sergei Trofimovich <slyfox@gentoo.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20200416162859.26518-1-mcgrof@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-27Merge tag 'for-5.7-rc3-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - regression fixes: - transaction leak when deleting unused block group - log cleanup after transaction abort - fix block group leak when removing fails - transaction leak if relocation recovery fails - fix SPDX header * tag 'for-5.7-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix transaction leak in btrfs_recover_relocation btrfs: fix block group leak when removing fails btrfs: drop logs when we've aborted a transaction btrfs: fix memory leak of transaction when deleting unused block group btrfs: discard: Use the correct style for SPDX License Identifier
2020-04-27io_uring: statx must grab the file table for valid fdJens Axboe
Clay reports that OP_STATX fails for a test case with a valid fd and empty path: -- Test 0: statx:fd 3: SUCCEED, file mode 100755 -- Test 1: statx:path ./uring_statx: SUCCEED, file mode 100755 -- Test 2: io_uring_statx:fd 3: FAIL, errno 9: Bad file descriptor -- Test 3: io_uring_statx:path ./uring_statx: SUCCEED, file mode 100755 This is due to statx not grabbing the process file table, hence we can't lookup the fd in async context. If the fd is valid, ensure that we grab the file table so we can grab the file from async context. Cc: stable@vger.kernel.org # v5.6 Reported-by: Clay Harris <bugs@claycon.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-27btrfs: transaction: Avoid deadlock due to bad initialization timing of ↵Qu Wenruo
fs_info::journal_info [BUG] One run of btrfs/063 triggered the following lockdep warning: ============================================ WARNING: possible recursive locking detected 5.6.0-rc7-custom+ #48 Not tainted -------------------------------------------- kworker/u24:0/7 is trying to acquire lock: ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs] but task is already holding lock: ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs] other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(sb_internal#2); lock(sb_internal#2); *** DEADLOCK *** May be due to missing lock nesting notation 4 locks held by kworker/u24:0/7: #0: ffff88817b495948 ((wq_completion)btrfs-endio-write){+.+.}, at: process_one_work+0x557/0xb80 #1: ffff888189ea7db8 ((work_completion)(&work->normal_work)){+.+.}, at: process_one_work+0x557/0xb80 #2: ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs] #3: ffff888174ca4da8 (&fs_info->reloc_mutex){+.+.}, at: btrfs_record_root_in_trans+0x83/0xd0 [btrfs] stack backtrace: CPU: 0 PID: 7 Comm: kworker/u24:0 Not tainted 5.6.0-rc7-custom+ #48 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Workqueue: btrfs-endio-write btrfs_work_helper [btrfs] Call Trace: dump_stack+0xc2/0x11a __lock_acquire.cold+0xce/0x214 lock_acquire+0xe6/0x210 __sb_start_write+0x14e/0x290 start_transaction+0x66c/0x890 [btrfs] btrfs_join_transaction+0x1d/0x20 [btrfs] find_free_extent+0x1504/0x1a50 [btrfs] btrfs_reserve_extent+0xd5/0x1f0 [btrfs] btrfs_alloc_tree_block+0x1ac/0x570 [btrfs] btrfs_copy_root+0x213/0x580 [btrfs] create_reloc_root+0x3bd/0x470 [btrfs] btrfs_init_reloc_root+0x2d2/0x310 [btrfs] record_root_in_trans+0x191/0x1d0 [btrfs] btrfs_record_root_in_trans+0x90/0xd0 [btrfs] start_transaction+0x16e/0x890 [btrfs] btrfs_join_transaction+0x1d/0x20 [btrfs] btrfs_finish_ordered_io+0x55d/0xcd0 [btrfs] finish_ordered_fn+0x15/0x20 [btrfs] btrfs_work_helper+0x116/0x9a0 [btrfs] process_one_work+0x632/0xb80 worker_thread+0x80/0x690 kthread+0x1a3/0x1f0 ret_from_fork+0x27/0x50 It's pretty hard to reproduce, only one hit so far. [CAUSE] This is because we're calling btrfs_join_transaction() without re-using the current running one: btrfs_finish_ordered_io() |- btrfs_join_transaction() <<< Call #1 |- btrfs_record_root_in_trans() |- btrfs_reserve_extent() |- btrfs_join_transaction() <<< Call #2 Normally such btrfs_join_transaction() call should re-use the existing one, without trying to re-start a transaction. But the problem is, in btrfs_join_transaction() call #1, we call btrfs_record_root_in_trans() before initializing current::journal_info. And in btrfs_join_transaction() call #2, we're relying on current::journal_info to avoid such deadlock. [FIX] Call btrfs_record_root_in_trans() after we have initialized current::journal_info. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-04-27btrfs: fix partial loss of prealloc extent past i_size after fsyncFilipe Manana
When we have an inode with a prealloc extent that starts at an offset lower than the i_size and there is another prealloc extent that starts at an offset beyond i_size, we can end up losing part of the first prealloc extent (the part that starts at i_size) and have an implicit hole if we fsync the file and then have a power failure. Consider the following example with comments explaining how and why it happens. $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt # Create our test file with 2 consecutive prealloc extents, each with a # size of 128Kb, and covering the range from 0 to 256Kb, with a file # size of 0. $ xfs_io -f -c "falloc -k 0 128K" /mnt/foo $ xfs_io -c "falloc -k 128K 128K" /mnt/foo # Fsync the file to record both extents in the log tree. $ xfs_io -c "fsync" /mnt/foo # Now do a redudant extent allocation for the range from 0 to 64Kb. # This will merely increase the file size from 0 to 64Kb. Instead we # could also do a truncate to set the file size to 64Kb. $ xfs_io -c "falloc 0 64K" /mnt/foo # Fsync the file, so we update the inode item in the log tree with the # new file size (64Kb). This also ends up setting the number of bytes # for the first prealloc extent to 64Kb. This is done by the truncation # at btrfs_log_prealloc_extents(). # This means that if a power failure happens after this, a write into # the file range 64Kb to 128Kb will not use the prealloc extent and # will result in allocation of a new extent. $ xfs_io -c "fsync" /mnt/foo # Now set the file size to 256K with a truncate and then fsync the file. # Since no changes happened to the extents, the fsync only updates the # i_size in the inode item at the log tree. This results in an implicit # hole for the file range from 64Kb to 128Kb, something which fsck will # complain when not using the NO_HOLES feature if we replay the log # after a power failure. $ xfs_io -c "truncate 256K" -c "fsync" /mnt/foo So instead of always truncating the log to the inode's current i_size at btrfs_log_prealloc_extents(), check first if there's a prealloc extent that starts at an offset lower than the i_size and with a length that crosses the i_size - if there is one, just make sure we truncate to a size that corresponds to the end offset of that prealloc extent, so that we don't lose the part of that extent that starts at i_size if a power failure happens. A test case for fstests follows soon. Fixes: 31d11b83b96f ("Btrfs: fix duplicate extents after fsync of file with prealloc extents") CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-04-27propagate_one(): mnt_set_mountpoint() needs mount_lockAl Viro
... to protect the modification of mp->m_count done by it. Most of the places that modify that thing also have namespace_lock held, but not all of them can do so, so we really need mount_lock here. Kudos to Piotr Krysiuk <piotras@gmail.com>, who'd spotted a related bug in pivot_root(2) (fixed unnoticed in 5.3); search for other similar turds has caught out this one. Cc: stable@kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>