summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2020-05-30pstore: Do not leave timer disabled for next backendKees Cook
The pstore.update_ms value was being disabled during pstore_unregister(), which would cause any prior value to go unnoticed on the next pstore_register(). Instead, just let del_timer() stop the timer, which was always sufficient. This additionally refactors the timer reset code and allows the timer to be enabled if the module parameter is changed away from the default. Link: https://lore.kernel.org/lkml/20200506152114.50375-10-keescook@chromium.org/ Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-30pstore: Add locking around superblock changesKees Cook
Nothing was protecting changes to the pstorefs superblock. Add locking and refactor away is_pstore_mounted(), instead using a helper to add a way to safely lock the pstorefs root inode during filesystem changes. Link: https://lore.kernel.org/lkml/20200506152114.50375-9-keescook@chromium.org/ Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-30f2fs: fix wrong discard spaceChao Yu
Under heavy fsstress, we may triggle panic while issuing discard, because __check_sit_bitmap() detects that discard command may earse valid data blocks, the root cause is as below race stack described, since we removed lock when flushing quota data, quota data writeback may race with write_checkpoint(), so that it causes inconsistency in between cached discard entry and segment bitmap. - f2fs_write_checkpoint - block_operations - set_sbi_flag(sbi, SBI_QUOTA_SKIP_FLUSH) - f2fs_flush_sit_entries - add_discard_addrs - __set_bit_le(i, (void *)de->discard_map); - f2fs_write_data_pages - f2fs_write_single_data_page : inode is quota one, cp_rwsem won't be locked - f2fs_do_write_data_page - f2fs_allocate_data_block - f2fs_wait_discard_bio : discard entry has not been added yet. - update_sit_entry - f2fs_clear_prefree_segments - f2fs_issue_discard : add discard entry In order to fix this, this patch uses node_write to serialize f2fs_allocate_data_block and checkpoint. Fixes: 435cbab95e39 ("f2fs: fix quota_sync failure due to f2fs_lock_op") Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-05-30io_uring: fix overflowed reqs cancellationPavel Begunkov
Overflowed requests in io_uring_cancel_files() should be shed only of inflight and overflowed refs. All other left references are owned by someone else. If refcount_sub_and_test() fails, it will go further and put put extra ref, don't do that. Also, don't need to do io_wq_cancel_work() for overflowed reqs, they will be let go shortly anyway. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-30io_uring: off timeouts based only on completionsPavel Begunkov
Offset timeouts wait not for sqe->off non-timeout CQEs, but rather sqe->off + number of prior inflight requests. Wait exactly for sqe->off non-timeout completions Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-30io_uring: move timeouts flushing to a helperPavel Begunkov
Separate flushing offset timeouts io_commit_cqring() by moving it into a helper. Just a preparation, makes following patches clearer. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-29fs/xfs: Update xfs_ioctl_setattr_dax_invalidate()Ira Weiny
Because of the separation of FS_XFLAG_DAX from S_DAX and the delayed setting of S_DAX, data invalidation no longer needs to happen when FS_XFLAG_DAX is changed. Change xfs_ioctl_setattr_dax_invalidate() to be xfs_ioctl_dax_check_set_cache() and alter the code to reflect the new functionality. Furthermore, we no longer need the locking so we remove the join_flags logic. Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-29fs/xfs: Combine xfs_diflags_to_linux() and xfs_diflags_to_iflags()Ira Weiny
The functionality in xfs_diflags_to_linux() and xfs_diflags_to_iflags() are nearly identical. The only difference is that *_to_linux() is called after inode setup and disallows changing the DAX flag. Combining them can be done with a flag which indicates if this is the initial setup to allow the DAX flag to be properly set only at init time. So remove xfs_diflags_to_linux() and call the modified xfs_diflags_to_iflags() directly. While we are here simplify xfs_diflags_to_iflags() to take struct xfs_inode and use xfs_ip2xflags() to ensure future diflags are included correctly. Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-29fs/xfs: Create function xfs_inode_should_enable_dax()Ira Weiny
xfs_inode_supports_dax() should reflect if the inode can support DAX not that it is enabled for DAX. Change the use of xfs_inode_supports_dax() to reflect only if the inode and underlying storage support dax. Add a new function xfs_inode_should_enable_dax() which reflects if the inode should be enabled for DAX. Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-29fs/xfs: Make DAX mount option a tri-stateIra Weiny
As agreed upon[1]. We make the dax mount option a tri-state. '-o dax' continues to operate the same. We add 'always', 'never', and 'inode' (default). [1] https://lore.kernel.org/lkml/20200405061945.GA94792@iweiny-DESK2.sc.intel.com/ Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-29fs/xfs: Change XFS_MOUNT_DAX to XFS_MOUNT_DAX_ALWAYSIra Weiny
In prep for the new tri-state mount option which then introduces XFS_MOUNT_DAX_NEVER. Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-29fs/xfs: Remove unnecessary initialization of i_rwsemIra Weiny
An earlier call of xfs_reinit_inode() from xfs_iget_cache_hit() already handles initialization of i_rwsem. Doing so again is unneeded. Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-29exec: Compute file based creds only onceEric W. Biederman
Move the computation of creds from prepare_binfmt into begin_new_exec so that the creds need only be computed once. This is just code reorganization no semantic changes of any kind are made. Moving the computation is safe. I have looked through the kernel and verified none of the binfmts look at bprm->cred directly, and that there are no helpers that look at bprm->cred indirectly. Which means that it is not a problem to compute the bprm->cred later in the execution flow as it is not used until it becomes current->cred. A new function bprm_creds_from_file is added to contain the work that needs to be done. bprm_creds_from_file first computes which file bprm->executable or most likely bprm->file that the bprm->creds will be computed from. The funciton bprm_fill_uid is updated to receive the file instead of accessing bprm->file. The now unnecessary work needed to reset the bprm->cred->euid, and bprm->cred->egid is removed from brpm_fill_uid. A small comment to document that bprm_fill_uid now only deals with the work to handle suid and sgid files. The default case is already heandled by prepare_exec_creds. The function security_bprm_repopulate_creds is renamed security_bprm_creds_from_file and now is explicitly passed the file from which to compute the creds. The documentation of the bprm_creds_from_file security hook is updated to explain when the hook is called and what it needs to do. The file is passed from cap_bprm_creds_from_file into get_file_caps so that the caps are computed for the appropriate file. The now unnecessary work in cap_bprm_creds_from_file to reset the ambient capabilites has been removed. A small comment to document that the work of cap_bprm_creds_from_file is to read capabilities from the files secureity attribute and derive capabilities from the fact the user had uid 0 has been added. Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-05-29exec: Add a per bprm->file version of per_clearEric W. Biederman
There is a small bug in the code that recomputes parts of bprm->cred for every bprm->file. The code never recomputes the part of clear_dangerous_personality_flags it is responsible for. Which means that in practice if someone creates a sgid script the interpreter will not be able to use any of: READ_IMPLIES_EXEC ADDR_NO_RANDOMIZE ADDR_COMPAT_LAYOUT MMAP_PAGE_ZERO. This accentially clearing of personality flags probably does not matter in practice because no one has complained but it does make the code more difficult to understand. Further remaining bug compatible prevents the recomputation from being removed and replaced by simply computing bprm->cred once from the final bprm->file. Making this change removes the last behavior difference between computing bprm->creds from the final file and recomputing bprm->cred several times. Which allows this behavior change to be justified for it's own reasons, and for any but hunts looking into why the behavior changed to wind up here instead of in the code that will follow that computes bprm->cred from the final bprm->file. This small logic bug appears to have existed since the code started clearing dangerous personality bits. History Tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Fixes: 1bb0fa189c6a ("[PATCH] NX: clean up legacy binary support") Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-05-29pselect6() and friends: take handling the combined 6th/7th args into helperAl Viro
... and use unsafe_get_user(), while we are at it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-29Merge tag 'ceph-for-5.7-rc8' of git://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fixes from Ilya Dryomov: "Cache tiering and cap handling fixups, both marked for stable" * tag 'ceph-for-5.7-rc8' of git://github.com/ceph/ceph-client: ceph: flush release queue when handling caps for unknown inode libceph: ignore pool overlay and cache logic on redirects
2020-05-29Merge tag 'gfs2-v5.7-rc7.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 fix from Andreas Gruenbacher: "Fix the previous, flawed gfs2_find_jhead commit" * tag 'gfs2-v5.7-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Even more gfs2_find_jhead fixes
2020-05-29orangefs: convert get_user_pages() --> pin_user_pages()John Hubbard
This code was using get_user_pages*(), in a "Case 1" scenario (Direct IO), using the categorization from [1]. That means that it's time to convert the get_user_pages*() + put_page() calls to pin_user_pages*() + unpin_user_pages() calls. There is some helpful background in [2]: basically, this is a small part of fixing a long-standing disconnect between pinning pages, and file systems' use of those pages. [1] Documentation/core-api/pin_user_pages.rst [2] "Explicit pinning of user-space pages": https://lwn.net/Articles/807108/ Cc: Mike Marshall <hubcap@omnibond.com> Cc: Martin Brandenburg <martin@omnibond.com> Cc: devel@lists.orangefs.org Cc: linux-fsdevel@vger.kernel.org Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2020-05-29orangefs: remove redundant assignment to variable retColin Ian King
The variable ret is being initialized with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2020-05-29net: add a new bind_add methodChristoph Hellwig
The SCTP protocol allows to bind multiple address to a socket. That feature is currently only exposed as a socket option. Add a bind_add method struct proto that allows to bind additional addresses, and switch the dlm code to use the method instead of going through the socket option from kernel space. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-29sctp: add sctp_sock_set_nodelayChristoph Hellwig
Add a helper to directly set the SCTP_NODELAY sockopt from kernel space without going through a fake uaccess. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-29gfs2: Even more gfs2_find_jhead fixesAndreas Gruenbacher
Fix several issues in the previous gfs2_find_jhead fix: * When updating @blocks_submitted, @block refers to the first block block not submitted yet, not the last block submitted, so fix an off-by-one error. * We want to ensure that @blocks_submitted is far enough ahead of @blocks_read to guarantee that there is in-flight I/O. Otherwise, we'll eventually end up waiting for pages that haven't been submitted, yet. * It's much easier to compare the number of blocks added with the number of blocks submitted to limit the maximum bio size. * Even with bio chaining, we can keep adding blocks until we reach the maximum bio size, as long as we stop at a page boundary. This simplifies the logic. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Bob Peterson <rpeterso@redhat.com>
2020-05-29fs: fix indentation in deactivate_super()Yufen Yu
Fix the breaked indent in deactive_super(). Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-29vfs: Remove duplicated d_mountpoint check in __is_local_mountpointNikolay Borisov
This function acts as an out-of-line helper for is_local_mountpoint is only called after the latter verifies the dentry is not a mountpoint. There's no semantic changes and the resulting object code is smaller: add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-26 (-26) Function old new delta __is_local_mountpoint 147 121 -26 Total: Before=34161, After=34135, chg -0.08% Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-29erofs: suppress false positive last_block warningGao Xiang
As Andrew mentioned, some rare specific gcc versions could report last_block uninitialized warning. Actually last_block doesn't need to be uninitialized first from its implementation due to bio == NULL condition. After a bio is allocated, last_block will be assigned then. The detailed analysis is in this thread [1]. So let's silence those confusing gccs simply. [1] https://lore.kernel.org/r/20200421072839.GA13867@hsiangkao-HP-ZHAN-66-Pro-G1 Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Link: https://lore.kernel.org/r/20200528084844.23359-1-hsiangkao@redhat.com Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2020-05-29erofs: convert to use the new mount fs_context apiChao Yu
Convert the erofs to use new internal mount API as the old one will be obsoleted and removed. This allows greater flexibility in communication of mount parameters between userspace, the VFS and the filesystem. See Documentation/filesystems/mount_api.txt for more information. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Link: https://lore.kernel.org/r/20200529104836.17843-1-hsiangkao@redhat.com Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2020-05-29reiserfs: Replace kmalloc with kcalloc in the commentLiao Pingfang
Use kcalloc instead of kmalloc in the comment according to the previous kcalloc() call. Link: https://lore.kernel.org/r/1590714150-15895-1-git-send-email-wang.yi59@zte.com.cn Signed-off-by: Liao Pingfang <liao.pingfang@zte.com.cn> Signed-off-by: Jan Kara <jack@suse.cz>
2020-05-28nfsd: Fix svc_xprt refcnt leak when setup callback client failedXiyu Yang
nfsd4_process_cb_update() invokes svc_xprt_get(), which increases the refcount of the "c->cn_xprt". The reference counting issue happens in one exception handling path of nfsd4_process_cb_update(). When setup callback client failed, the function forgets to decrease the refcnt increased by svc_xprt_get(), causing a refcnt leak. Fix this issue by calling svc_xprt_put() when setup callback client failed. Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn> Signed-off-by: Xin Tan <tanxin.ctf@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-05-28Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "5 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: include/asm-generic/topology.h: guard cpumask_of_node() macro argument fs/binfmt_elf.c: allocate initialized memory in fill_thread_core_info() mm: remove VM_BUG_ON(PageSlab()) from page_mapcount() mm,thp: stop leaking unreleased file pages mm/z3fold: silence kmemleak false positives of slots
2020-05-28f2fs: compress: don't compress any datas after cp stopChao Yu
While compressed data writeback, we need to drop dirty pages like we did for non-compressed pages if cp stops, however it's not needed to compress any data in such case, so let's detect cp stop condition in cluster_may_compress() to avoid redundant compressing and let following f2fs_write_raw_pages() drops dirty pages correctly. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-05-28f2fs: remove unneeded return value of __insert_discard_tree()Chao Yu
We never use return value of __insert_discard_tree(), so remove it. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-05-28f2fs: fix wrong value of tracepoint parameterChao Yu
In f2fs_lookup(), we should set @err correctly before printing it in tracepoint. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-05-28f2fs: protect new segment allocation in expand_inode_dataDaeho Jeong
Found a new segemnt allocation without f2fs_lock_op() in expand_inode_data(). So, when we do fallocate() for a pinned file and trigger checkpoint very frequently and simultaneously. F2FS gets stuck in the below code of do_checkpoint() forever. f2fs_sync_meta_pages(sbi, META, LONG_MAX, FS_CP_META_IO); /* Wait for all dirty meta pages to be submitted for IO */ <= if fallocate() here, f2fs_wait_on_all_pages(sbi, F2FS_DIRTY_META); <= it'll wait forever. Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-05-28fs/binfmt_elf.c: allocate initialized memory in fill_thread_core_info()Alexander Potapenko
KMSAN reported uninitialized data being written to disk when dumping core. As a result, several kilobytes of kmalloc memory may be written to the core file and then read by a non-privileged user. Reported-by: sam <sunhaoyl@outlook.com> Signed-off-by: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200419100848.63472-1-glider@google.com Link: https://github.com/google/kmsan/issues/76 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-28rxrpc: add rxrpc_sock_set_min_security_levelChristoph Hellwig
Add a helper to directly set the RXRPC_MIN_SECURITY_LEVEL sockopt from kernel space without going through a fake uaccess. Thanks to David Howells for the documentation updates. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28tcp: add tcp_sock_set_user_timeoutChristoph Hellwig
Add a helper to directly set the TCP_USER_TIMEOUT sockopt from kernel space without going through a fake uaccess. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28tcp: add tcp_sock_set_nodelayChristoph Hellwig
Add a helper to directly set the TCP_NODELAY sockopt from kernel space without going through a fake uaccess. Cleanup the callers to avoid pointless wrappers now that this is a simple function call. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Sagi Grimberg <sagi@grimberg.me> Acked-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28tcp: add tcp_sock_set_corkChristoph Hellwig
Add a helper to directly set the TCP_CORK sockopt from kernel space without going through a fake uaccess. Cleanup the callers to avoid pointless wrappers now that this is a simple function call. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28net: add sock_set_rcvbufChristoph Hellwig
Add a helper to directly set the SO_RCVBUFFORCE sockopt from kernel space without going through a fake uaccess. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28net: add sock_set_keepaliveChristoph Hellwig
Add a helper to directly set the SO_KEEPALIVE sockopt from kernel space without going through a fake uaccess. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28net: add sock_set_sndtimeoChristoph Hellwig
Add a helper to directly set the SO_SNDTIMEO_NEW sockopt from kernel space without going through a fake uaccess. The interface is simplified to only pass the seconds value, as that is the only thing needed at the moment. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28net: add sock_set_reuseaddrChristoph Hellwig
Add a helper to directly set the SO_REUSEADDR sockopt from kernel space without going through a fake uaccess. For this the iscsi target now has to formally depend on inet to avoid a mostly theoretical compile failure. For actual operation it already did depend on having ipv4 or ipv6 support. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28Merge branch 'for-next/scs' into for-next/coreWill Deacon
Support for Clang's Shadow Call Stack in the kernel (Sami Tolvanen and Will Deacon) * for-next/scs: arm64: entry-ftrace.S: Update comment to indicate that x18 is live scs: Move DEFINE_SCS macro into core code scs: Remove references to asm/scs.h from core code scs: Move scs_overflow_check() out of architecture code arm64: scs: Use 'scs_sp' register alias for x18 scs: Move accounting into alloc/free functions arm64: scs: Store absolute SCS stack pointer value in thread_info efi/libstub: Disable Shadow Call Stack arm64: scs: Add shadow stacks for SDEI arm64: Implement Shadow Call Stack arm64: Disable SCS for hypervisor code arm64: vdso: Disable Shadow Call Stack arm64: efi: Restore register x18 if it was corrupted arm64: Preserve register x18 when CPU is suspended arm64: Reserve register x18 from general allocation with SCS scs: Disable when function graph tracing is enabled scs: Add support for stack usage debugging scs: Add page accounting for shadow call stack allocations scs: Add support for Clang's Shadow Call Stack (SCS)
2020-05-28btrfs: fix space_info bytes_may_use underflow during space cache writeoutFilipe Manana
We always preallocate a data extent for writing a free space cache, which causes writeback to always try the nocow path first, since the free space inode has the prealloc bit set in its flags. However if the block group that contains the data extent for the space cache has been turned to RO mode due to a running scrub or balance for example, we have to fallback to the cow path. In that case once a new data extent is allocated we end up calling btrfs_add_reserved_bytes(), which decrements the counter named bytes_may_use from the data space_info object with the expection that this counter was previously incremented with the same amount (the size of the data extent). However when we started writeout of the space cache at cache_save_setup(), we incremented the value of the bytes_may_use counter through a call to btrfs_check_data_free_space() and then decremented it through a call to btrfs_prealloc_file_range_trans() immediately after. So when starting the writeback if we fallback to cow mode we have to increment the counter bytes_may_use of the data space_info again to compensate for the extent allocation done by the cow path. When this issue happens we are incorrectly decrementing the bytes_may_use counter and when its current value is smaller then the amount we try to subtract we end up with the following warning: ------------[ cut here ]------------ WARNING: CPU: 3 PID: 657 at fs/btrfs/space-info.h:115 btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs] Modules linked in: btrfs blake2b_generic xor raid6_pq libcrc32c (...) CPU: 3 PID: 657 Comm: kworker/u8:7 Tainted: G W 5.6.0-rc7-btrfs-next-58 #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 Workqueue: writeback wb_workfn (flush-btrfs-1591) RIP: 0010:btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs] Code: ff ff 48 (...) RSP: 0000:ffffa41608f13660 EFLAGS: 00010287 RAX: 0000000000001000 RBX: ffff9615b93ae400 RCX: 0000000000000000 RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff9615b96ab410 RBP: fffffffffffee000 R08: 0000000000000001 R09: 0000000000000000 R10: ffff961585e62a40 R11: 0000000000000000 R12: ffff9615b96ab400 R13: ffff9615a1a2a000 R14: 0000000000012000 R15: ffff9615b93ae400 FS: 0000000000000000(0000) GS:ffff9615bb200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055cbbc2ae178 CR3: 0000000115794006 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: find_free_extent+0x4a0/0x16c0 [btrfs] btrfs_reserve_extent+0x91/0x180 [btrfs] cow_file_range+0x12d/0x490 [btrfs] btrfs_run_delalloc_range+0x9f/0x6d0 [btrfs] ? find_lock_delalloc_range+0x221/0x250 [btrfs] writepage_delalloc+0xe8/0x150 [btrfs] __extent_writepage+0xe8/0x4c0 [btrfs] extent_write_cache_pages+0x237/0x530 [btrfs] extent_writepages+0x44/0xa0 [btrfs] do_writepages+0x23/0x80 __writeback_single_inode+0x59/0x700 writeback_sb_inodes+0x267/0x5f0 __writeback_inodes_wb+0x87/0xe0 wb_writeback+0x382/0x590 ? wb_workfn+0x4a2/0x6c0 wb_workfn+0x4a2/0x6c0 process_one_work+0x26d/0x6a0 worker_thread+0x4f/0x3e0 ? process_one_work+0x6a0/0x6a0 kthread+0x103/0x140 ? kthread_create_worker_on_cpu+0x70/0x70 ret_from_fork+0x3a/0x50 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020 softirqs last enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace bd7c03622e0b0a52 ]--- ------------[ cut here ]------------ So fix this by incrementing the bytes_may_use counter of the data space_info when we fallback to the cow path. If the cow path is successful the counter is decremented after extent allocation (by btrfs_add_reserved_bytes()), if it fails it ends up being decremented as well when clearing the delalloc range (extent_clear_unlock_delalloc()). This could be triggered sporadically by the test case btrfs/061 from fstests. Fixes: 82d5902d9c681b ("Btrfs: Support reading/writing on disk free ino cache") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-28btrfs: fix space_info bytes_may_use underflow after nocow buffered writeFilipe Manana
When doing a buffered write we always try to reserve data space for it, even when the file has the NOCOW bit set or the write falls into a file range covered by a prealloc extent. This is done both because it is expensive to check if we can do a nocow write (checking if an extent is shared through reflinks or if there's a hole in the range for example), and because when writeback starts we might actually need to fallback to COW mode (for example the block group containing the target extents was turned into RO mode due to a scrub or balance). When we are unable to reserve data space we check if we can do a nocow write, and if we can, we proceed with dirtying the pages and setting up the range for delalloc. In this case the bytes_may_use counter of the data space_info object is not incremented, unlike in the case where we are able to reserve data space (done through btrfs_check_data_free_space() which calls btrfs_alloc_data_chunk_ondemand()). Later when running delalloc we attempt to start writeback in nocow mode but we might revert back to cow mode, for example because in the meanwhile a block group was turned into RO mode by a scrub or relocation. The cow path after successfully allocating an extent ends up calling btrfs_add_reserved_bytes(), which expects the bytes_may_use counter of the data space_info object to have been incremented before - but we did not do it when the buffered write started, since there was not enough available data space. So btrfs_add_reserved_bytes() ends up decrementing the bytes_may_use counter anyway, and when the counter's current value is smaller then the size of the allocated extent we get a stack trace like the following: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 20138 at fs/btrfs/space-info.h:115 btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs] Modules linked in: btrfs blake2b_generic xor raid6_pq libcrc32c (...) CPU: 0 PID: 20138 Comm: kworker/u8:15 Not tainted 5.6.0-rc7-btrfs-next-58 #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 Workqueue: writeback wb_workfn (flush-btrfs-1754) RIP: 0010:btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs] Code: ff ff 48 (...) RSP: 0018:ffffbda18a4b3568 EFLAGS: 00010287 RAX: 0000000000000000 RBX: ffff9ca076f5d800 RCX: 0000000000000000 RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff9ca068470410 RBP: fffffffffffff000 R08: 0000000000000001 R09: 0000000000000000 R10: ffff9ca079d58040 R11: 0000000000000000 R12: ffff9ca068470400 R13: ffff9ca0408b2000 R14: 0000000000001000 R15: ffff9ca076f5d800 FS: 0000000000000000(0000) GS:ffff9ca07a600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005605dbfe7048 CR3: 0000000138570006 CR4: 00000000003606f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: find_free_extent+0x4a0/0x16c0 [btrfs] btrfs_reserve_extent+0x91/0x180 [btrfs] cow_file_range+0x12d/0x490 [btrfs] run_delalloc_nocow+0x341/0xa40 [btrfs] btrfs_run_delalloc_range+0x1ea/0x6d0 [btrfs] ? find_lock_delalloc_range+0x221/0x250 [btrfs] writepage_delalloc+0xe8/0x150 [btrfs] __extent_writepage+0xe8/0x4c0 [btrfs] extent_write_cache_pages+0x237/0x530 [btrfs] ? btrfs_wq_submit_bio+0x9f/0xc0 [btrfs] extent_writepages+0x44/0xa0 [btrfs] do_writepages+0x23/0x80 __writeback_single_inode+0x59/0x700 writeback_sb_inodes+0x267/0x5f0 __writeback_inodes_wb+0x87/0xe0 wb_writeback+0x382/0x590 ? wb_workfn+0x4a2/0x6c0 wb_workfn+0x4a2/0x6c0 process_one_work+0x26d/0x6a0 worker_thread+0x4f/0x3e0 ? process_one_work+0x6a0/0x6a0 kthread+0x103/0x140 ? kthread_create_worker_on_cpu+0x70/0x70 ret_from_fork+0x3a/0x50 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffff94ebdedf>] copy_process+0x74f/0x2020 softirqs last enabled at (0): [<ffffffff94ebdedf>] copy_process+0x74f/0x2020 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace f9f6ef8ec4cd8ec9 ]--- So to fix this, when falling back into cow mode check if space was not reserved, by testing for the bit EXTENT_NORESERVE in the respective file range, and if not, increment the bytes_may_use counter for the data space_info object. Also clear the EXTENT_NORESERVE bit from the range, so that if the cow path fails it decrements the bytes_may_use counter when clearing the delalloc range (through the btrfs_clear_delalloc_extent() callback). Fixes: 7ee9e4405f264e ("Btrfs: check if we can nocow if we don't have data space") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-28btrfs: fix wrong file range cleanup after an error filling dealloc rangeFilipe Manana
If an error happens while running dellaloc in COW mode for a range, we can end up calling extent_clear_unlock_delalloc() for a range that goes beyond our range's end offset by 1 byte, which affects 1 extra page. This results in clearing bits and doing page operations (such as a page unlock) outside our target range. Fix that by calling extent_clear_unlock_delalloc() with an inclusive end offset, instead of an exclusive end offset, at cow_file_range(). Fixes: a315e68f6e8b30 ("Btrfs: fix invalid attempt to free reserved space on failure to cow range") CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-28btrfs: remove redundant local variable in read_block_for_searchNikolay Borisov
The local 'b' variable is only used to directly read values from passed extent buffer. So eliminate it and directly use the input parameter. Furthermore this shrinks the size of the following functions: ./scripts/bloat-o-meter ctree.orig fs/btrfs/ctree.o add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-73 (-73) Function old new delta read_block_for_search.isra 876 871 -5 push_node_left 1112 1044 -68 Total: Before=50348, After=50275, chg -0.14% Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-28btrfs: open code key_searchNikolay Borisov
This function wraps the optimisation implemented by d7396f07358a ("Btrfs: optimize key searches in btrfs_search_slot") however this optimisation is really used in only one place - btrfs_search_slot. Just open code the optimisation and also add a comment explaining how it works since it's not clear just by looking at the code - the key point here is it depends on an internal invariant that BTRFS' btree provides, namely intermediate pointers always contain the key at slot0 at the child node. So in the case of exact match we can safely assume that the given key will always be in slot 0 on lower levels. Furthermore this results in a reduction of btrfs_search_slot's size: ./scripts/bloat-o-meter ctree.orig fs/btrfs/ctree.o add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-75 (-75) Function old new delta btrfs_search_slot 2783 2708 -75 Total: Before=50423, After=50348, chg -0.15% Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-28btrfs: split btrfs_direct_IO to read and write partChristoph Hellwig
The read and write versions don't have anything in common except for the call to iomap_dio_rw. So split this function, and merge each half into its only caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-28btrfs: remove BTRFS_INODE_READDIO_NEED_LOCKGoldwyn Rodrigues
Since we now perform direct reads using i_rwsem, we can remove this inode flag used to co-ordinate unlocked reads. The truncate call takes i_rwsem. This means it is correctly synchronized with concurrent direct reads. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <jth@kernel.org> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>