linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2020-07-07	xfs: unwind log item error flagging	Dave Chinner
	When an buffer IO error occurs, we want to mark all the log items attached to the buffer as failed. Open code the error handling loop so that we can modify the flagging for the different types of objects directly and independently of each other. This also allows us to remove the ->iop_error method from the log item operations. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-07	xfs: handle buffer log item IO errors directly	Dave Chinner
	Currently when a buffer with attached log items has an IO error it called ->iop_error for each attched log item. These all call xfs_set_li_failed() to handle the error, but we are about to change the way log items manage buffers. hence we first need to remove the per-item dependency on buffer handling done by xfs_set_li_failed(). We already have specific buffer type IO completion routines, so move the log item error handling out of the generic error handling and into the log item specific functions so we can implement per-type error handling easily. This requires a more complex return value from the error handling code so that we can take the correct action the failure handling requires. This results in some repeated boilerplate in the functions, but that can be cleaned up later once all the changes cascade through this code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-07	xfs: get rid of log item callbacks	Dave Chinner
	They are not used anymore, so remove them from the log item and the buffer iodone attachment interfaces. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-07	xfs: clean up the buffer iodone callback functions	Dave Chinner
	Now that we've sorted inode and dquot buffers, we can apply the same cleanups to dirty buffers with buffer log items. They only have one callback, too, so we don't need the log item callback. Collapse the iodone functions and remove all the now unnecessary infrastructure around callback processing. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-07	btrfs: discard: add missing put when grabbing block group from unused list	Qu Wenruo
	[BUG] The following small test script can trigger ASSERT() at unmount time: mkfs.btrfs -f $dev mount $dev $mnt mount -o remount,discard=async $mnt umount $mnt The call trace: assertion failed: atomic_read(&block_group->count) == 1, in fs/btrfs/block-group.c:3431 ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.h:3204! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 4 PID: 10389 Comm: umount Tainted: G O 5.8.0-rc3-custom+ #68 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: btrfs_free_block_groups.cold+0x22/0x55 [btrfs] close_ctree+0x2cb/0x323 [btrfs] btrfs_put_super+0x15/0x17 [btrfs] generic_shutdown_super+0x72/0x110 kill_anon_super+0x18/0x30 btrfs_kill_super+0x17/0x30 [btrfs] deactivate_locked_super+0x3b/0xa0 deactivate_super+0x40/0x50 cleanup_mnt+0x135/0x190 __cleanup_mnt+0x12/0x20 task_work_run+0x64/0xb0 __prepare_exit_to_usermode+0x1bc/0x1c0 __syscall_return_slowpath+0x47/0x230 do_syscall_64+0x64/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The code: ASSERT(atomic_read(&block_group->count) == 1); btrfs_put_block_group(block_group); [CAUSE] Obviously it's some btrfs_get_block_group() call doesn't get its put call. The offending btrfs_get_block_group() happens here: void btrfs_mark_bg_unused(struct btrfs_block_group bg) { if (list_empty(&bg->bg_list)) { btrfs_get_block_group(bg); list_add_tail(&bg->bg_list, &fs_info->unused_bgs); } } So every call sites removing the block group from unused_bgs list should reduce the ref count of that block group. However for async discard, it didn't follow the call convention: void btrfs_discard_punt_unused_bgs_list(struct btrfs_fs_info fs_info) { list_for_each_entry_safe(block_group, next, &fs_info->unused_bgs, bg_list) { list_del_init(&block_group->bg_list); btrfs_discard_queue_work(&fs_info->discard_ctl, block_group); } } And in btrfs_discard_queue_work(), it doesn't call btrfs_put_block_group() either. [FIX] Fix the problem by reducing the reference count when we grab the block group from unused_bgs list. Reported-by: Marcos Paulo de Souza <mpdesouza@suse.com> Fixes: 6e80d4f8c422 ("btrfs: handle empty block_group removal for async discard") CC: stable@vger.kernel.org # 5.6+ Tested-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-06	pstore: Fix linking when crypto API disabled	Matteo Croce
	When building a kernel with CONFIG_PSTORE=y and CONFIG_CRYPTO not set, a build error happens: ld: fs/pstore/platform.o: in function `pstore_dump': platform.c:(.text+0x3f9): undefined reference to `crypto_comp_compress' ld: fs/pstore/platform.o: in function `pstore_get_backend_records': platform.c:(.text+0x784): undefined reference to `crypto_comp_decompress' This because some pstore code uses crypto_comp_(de)compress regardless of the CONFIG_CRYPTO status. Fix it by wrapping the (de)compress usage by IS_ENABLED(CONFIG_PSTORE_COMPRESS) Signed-off-by: Matteo Croce <mcroce@linux.microsoft.com> Link: https://lore.kernel.org/lkml/20200706234045.9516-1-mcroce@linux.microsoft.com Fixes: cb3bee0369bc ("pstore: Use crypto compress API") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-06	iomap: Make sure iomap_end is called after iomap_begin	Andreas Gruenbacher
	Make sure iomap_end is always called when iomap_begin succeeds. Without this fix, iomap_end won't be called when a filesystem's iomap_begin operation returns an invalid mapping, bypassing any unlocking done in iomap_end. With this fix, the unlocking will still happen. This bug was found by Bob Peterson during code review. It's unlikely that such iomap_begin bugs will survive to affect users, so backporting this fix seems unnecessary. Fixes: ae259a9c8593 ("fs: introduce iomap infrastructure") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-06	xfs: use direct calls for dquot IO completion	Dave Chinner
	Similar to inodes, we can call the dquot IO completion functions directly from the buffer completion code, removing another user of log item callbacks for IO completion processing. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-06	xfs: make inode IO completion buffer centric	Dave Chinner
	Having different io completion callbacks for different inode states makes things complex. We can detect if the inode is stale via the XFS_ISTALE flag in IO completion, so we don't need a special callback just for this. This means inodes only have a single iodone callback, and inode IO completion is entirely buffer centric at this point. Hence we no longer need to use a log item callback at all as we can just call xfs_iflush_done() directly from the buffer completions and walk the buffer log item list to complete the all inodes under IO. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-06	xfs: clean up whacky buffer log item list reinit	Dave Chinner
	When we've emptied the buffer log item list, it does a list_del_init on itself to reset it's pointers to itself. This is unnecessary as the list is already empty at this point - it was a left-over fragment from the list_head conversion of the buffer log item list. Remove them. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-06	xfs: call xfs_buf_iodone directly	Dave Chinner
	All unmarked dirty buffers should be in the AIL and have log items attached to them. Hence when they are written, we will run a callback to remove the item from the AIL if appropriate. Now that we've handled inode and dquot buffers, all remaining calls are to xfs_buf_iodone() and so we can hard code this rather than use an indirect call. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-06	xfs: mark log recovery buffers for completion	Dave Chinner
	Log recovery has it's own buffer write completion handler for buffers that it directly recovers. Convert these to direct calls by flagging these buffers as being log recovery buffers. The flag will get cleared by the log recovery IO completion routine, so it will never leak out of log recovery. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-06	xfs: mark dquot buffers in cache	Dave Chinner
	dquot buffers always have write IO callbacks, so by marking them directly we can avoid needing to attach ->b_iodone functions to them. This avoids an indirect call, and makes future modifications much simpler. This is largely a rearrangement of the code at this point - no IO completion functionality changes at this point, just how the code is run is modified. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-06	xfs: mark inode buffers in cache	Dave Chinner
	Inode buffers always have write IO callbacks, so by marking them directly we can avoid needing to attach ->b_iodone functions to them. This avoids an indirect call, and makes future modifications much simpler. While this is largely a refactor of existing functionality, we broaden the scope of the flag to beyond where inodes are explicitly attached because future changes need to know what type of log items are attached to the buffer. Adding this buffer flag may invoke the inode iodone callback in cases where it wouldn't have been previously, but this is not a functional change because the callback is identical to the normal buffer write iodone callback when inodes are not attached. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-06	xfs: add an inode item lock	Dave Chinner
	The inode log item is kind of special in that it can be aggregating new changes in memory at the same time time existing changes are being written back to disk. This means there are fields in the log item that are accessed concurrently from contexts that don't share any locking at all. e.g. updating ili_last_fields occurs at flush time under the ILOCK_EXCL and flush lock at flush time, under the flush lock at IO completion time, and is read under the ILOCK_EXCL when the inode is logged. Hence there is no actual serialisation between reading the field during logging of the inode in transactions vs clearing the field in IO completion. We currently get away with this by the fact that we are only clearing fields in IO completion, and nothing bad happens if we accidentally log more of the inode than we actually modify. Worst case is we consume a tiny bit more memory and log bandwidth. However, if we want to do more complex state manipulations on the log item that requires updates at all three of these potential locations, we need to have some mechanism of serialising those operations. To do this, introduce a spinlock into the log item to serialise internal state. This could be done via the xfs_inode i_flags_lock, but this then leads to potential lock inversion issues where inode flag updates need to occur inside locks that best nest inside the inode log item locks (e.g. marking inodes stale during inode cluster freeing). Using a separate spinlock avoids these sorts of problems and simplifies future code. This does not touch the use of ili_fields in the item formatting code - that is entirely protected by the ILOCK_EXCL at this point in time, so it remains untouched. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-06	xfs: remove logged flag from inode log item	Dave Chinner
	This was used to track if the item had logged fields being flushed to disk. We log everything in the inode these days, so this logic is no longer needed. Remove it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-06	xfs: Don't allow logging of XFS_ISTALE inodes	Dave Chinner
	In tracking down a problem in this patchset, I discovered we are reclaiming dirty stale inodes. This wasn't discovered until inodes were always attached to the cluster buffer and then the rcu callback that freed inodes was assert failing because the inode still had an active pointer to the cluster buffer after it had been reclaimed. Debugging the issue indicated that this was a pre-existing issue resulting from the way the inodes are handled in xfs_inactive_ifree. When we free a cluster buffer from xfs_ifree_cluster, all the inodes in cache are marked XFS_ISTALE. Those that are clean have nothing else done to them and so eventually get cleaned up by background reclaim. i.e. it is assumed we'll never dirty/relog an inode marked XFS_ISTALE. On journal commit dirty stale inodes as are handled by both buffer and inode log items to run though xfs_istale_done() and removed from the AIL (buffer log item commit) or the log item will simply unpin it because the buffer log item will clean it. What happens to any specific inode is entirely dependent on which log item wins the commit race, but the result is the same - stale inodes are clean, not attached to the cluster buffer, and not in the AIL. Hence inode reclaim can just free these inodes without further care. However, if the stale inode is relogged, it gets dirtied again and relogged into the CIL. Most of the time this isn't an issue, because relogging simply changes the inode's location in the current checkpoint. Problems arise, however, when the CIL checkpoints between two transactions in the xfs_inactive_ifree() deferops processing. This results in the XFS_ISTALE inode being redirtied and inserted into the CIL without any of the other stale cluster buffer infrastructure being in place. Hence on journal commit, it simply gets unpinned, so it remains dirty in memory. Everything in inode writeback avoids XFS_ISTALE inodes so it can't be written back, and it is not tracked in the AIL so there's not even a trigger to attempt to clean the inode. Hence the inode just sits dirty in memory until inode reclaim comes along, sees that it is XFS_ISTALE, and goes to reclaim it. This reclaiming of a dirty inode caused use after free, list corruptions and other nasty issues later in this patchset. Hence this patch addresses a violation of the "never log XFS_ISTALE inodes" caused by the deferops processing rolling a transaction and relogging a stale inode in xfs_inactive_free. It also adds a bunch of asserts to catch this problem in debug kernels so that we don't reintroduce this problem in future. Reproducer for this issue was generic/558 on a v4 filesystem. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-06	xfs: remove useless definitions in xfs_linux.h	Yafang Shao
	Remove current_pid(), current_test_flags() and current_clear_flags_nested(), because they are useless. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-06	xfs: use MMAPLOCK around filemap_map_pages()	Dave Chinner
	The page faultround path ->map_pages is implemented in XFS via filemap_map_pages(). This function checks that pages found in page cache lookups have not raced with truncate based invalidation by checking page->mapping is correct and page->index is within EOF. However, we've known for a long time that this is not sufficient to protect against races with invalidations done by operations that do not change EOF. e.g. hole punching and other fallocate() based direct extent manipulations. The way we protect against these races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED lock so they serialise against fallocate and truncate before calling into the filemap function that processes the fault. Do the same for XFS's ->map_pages implementation to close this potential data corruption issue. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-06	xfs: move helpers that lock and unlock two inodes against userspace IO	Darrick J. Wong
	Move the double-inode locking helpers to xfs_inode.c since they're not specific to reflink. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-07-06	xfs: refactor locking and unlocking two inodes against userspace IO	Darrick J. Wong
	Refactor the two functions that we use to lock and unlock two inodes to block userspace from initiating IO against a file, whether via system calls or mmap activity. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-07-06	xfs: fix xfs_reflink_remap_prep calling conventions	Darrick J. Wong
	Fix the return value of xfs_reflink_remap_prep so that its return value conventions match the rest of xfs. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-07-06	xfs: reflink can skip remap existing mappings	Darrick J. Wong
	If the source and destination map are identical, we can skip the remap step to save some time. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-07-06	xfs: only reserve quota blocks if we're mapping into a hole	Darrick J. Wong
	When logging quota block count updates during a reflink operation, we only log the /delta/ of the block count changes to the dquot. Since we now know ahead of time the extent type of both dmap and smap (and that they have the same length), we know that we only need to reserve quota blocks for dmap's blockcount if we're mapping it into a hole. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-07-06	xfs: only reserve quota blocks for bmbt changes if we're changing the data fork	Darrick J. Wong
	Now that we've reworked xfs_reflink_remap_extent to remap only one extent per transaction, we actually know if the extent being removed is an allocated mapping. This means that we now know ahead of time if we're going to be touching the data fork. Since we only need blocks for a bmbt split if we're going to update the data fork, we only need to get quota reservation if we know we're going to touch the data fork. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-07-06	xfs: redesign the reflink remap loop to fix blkres depletion crash	Darrick J. Wong
	The existing reflink remapping loop has some structural problems that need addressing: The biggest problem is that we create one transaction for each extent in the source file without accounting for the number of mappings there are for the same range in the destination file. In other words, we don't know the number of remap operations that will be necessary and we therefore cannot guess the block reservation required. On highly fragmented filesystems (e.g. ones with active dedupe) we guess wrong, run out of block reservation, and fail. The second problem is that we don't actually use the bmap intents to their full potential -- instead of calling bunmapi directly and having to deal with its backwards operation, we could call the deferred ops xfs_bmap_unmap_extent and xfs_refcount_decrease_extent instead. This makes the frontend loop much simpler. Solve all of these problems by refactoring the remapping loops so that we only perform one remapping operation per transaction, and each operation only tries to remap a single extent from source to dest. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reported-by: Edwin Török <edwin@etorok.net> Tested-by: Edwin Török <edwin@etorok.net>
2020-07-06	xfs: rename xfs_bmap_is_real_extent to is_written_extent	Darrick J. Wong
	The name of this predicate is a little misleading -- it decides if the extent mapping is allocated and written. Change the name to be more direct, as we're going to add a new predicate in the next patch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-07-06	xfs: fix reflink quota reservation accounting error	Darrick J. Wong
	Quota reservations are supposed to account for the blocks that might be allocated due to a bmap btree split. Reflink doesn't do this, so fix this to make the quota accounting more accurate before we start rearranging things. Fixes: 862bb360ef56 ("xfs: reflink extents from one file to another") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-07-06	xfs: don't eat an EIO/ENOSPC writeback error when scrubbing data fork	Darrick J. Wong
	The data fork scrubber calls filemap_write_and_wait to flush dirty pages and delalloc reservations out to disk prior to checking the data fork's extent mappings. Unfortunately, this means that scrub can consume the EIO/ENOSPC errors that would otherwise have stayed around in the address space until (we hope) the writer application calls fsync to persist data and collect errors. The end result is that programs that wrote to a file might never see the error code and proceed as if nothing were wrong. xfs_scrub is not in a position to notify file writers about the writeback failure, and it's only here to check metadata, not file contents. Therefore, if writeback fails, we should stuff the error code back into the address space so that an fsync by the writer application can pick that up. Fixes: 99d9d8d05da2 ("xfs: scrub inode block mappings") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2020-07-06	xfs: preserve rmapbt swapext block reservation from freed blocks	Brian Foster
	The rmapbt extent swap algorithm remaps individual extents between the source inode and the target to trigger reverse mapping metadata updates. If either inode straddles a format or other bmap allocation boundary, the individual unmap and map cycles can trigger repeated bmap block allocations and frees as the extent count bounces back and forth across the boundary. While net block usage is bound across the swap operation, this behavior can prematurely exhaust the transaction block reservation because it continuously drains as the transaction rolls. Each allocation accounts against the reservation and each free returns to global free space on transaction roll. The previous workaround to this problem attempted to detect this boundary condition and provide surplus block reservation to acommodate it. This is insufficient because more remaps can occur than implied by the extent counts; if start offset boundaries are not aligned between the two inodes, for example. To address this problem more generically and dynamically, add a transaction accounting mode that returns freed blocks to the transaction reservation instead of the superblock counters on transaction roll and use it when the rmapbt based algorithm is active. This allows the chain of remap transactions to preserve the block reservation based own its own frees and prevent premature exhaustion regardless of the remap pattern. Note that this is only safe for superblocks with lazy sb accounting, but the latter is required for v5 supers and the rmap feature depends on v5. Fixes: b3fed434822d0 ("xfs: account format bouncing into rmapbt swapext tx reservation") Root-caused-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-06	xfs: Couple of typo fixes in comments	Keyur Patel
	./xfs/libxfs/xfs_inode_buf.c:56: unnecssary ==> unnecessary ./xfs/libxfs/xfs_inode_buf.c:59: behavour ==> behaviour ./xfs/libxfs/xfs_inode_buf.c:206: unitialized ==> uninitialized Signed-off-by: Keyur Patel <iamkeyur96@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-06	io_uring: briefly loose locks while reaping events	Pavel Begunkov
	It's not nice to hold @uring_lock for too long io_iopoll_reap_events(). For instance, the lock is needed to publish requests to @poll_list, and that locks out tasks doing that for no good reason. Loose it occasionally. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-06	io_uring: fix stopping iopoll'ing too early	Pavel Begunkov
	Nobody adjusts *nr_events (number of completed requests) before calling io_iopoll_getevents(), so the passed @min shouldn't be adjusted as well. Othewise it can return less than initially asked @min without hitting need_resched(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-06	io_uring: don't delay iopoll'ed req completion	Pavel Begunkov
	->iopoll() may have completed current request, but instead of reaping it, io_do_iopoll() just continues with the next request in the list. As a result it can leave just polled and completed request in the list up until next syscall. Even outer loop in io_iopoll_getevents() doesn't help the situation. E.g. poll_list: req0 -> req1 If req0->iopoll() completed both requests, and @min<=1, then @req0 will be left behind. Check whether a req was completed after ->iopoll(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-05	io_uring: fix lost cqe->flags	Pavel Begunkov
	Don't forget to fill cqe->flags properly in io_submit_flush_completions() Fixes: a1d7c393c4711 ("io_uring: enable READ/WRITE to use deferred completions") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-05	io_uring: keep queue_sqe()'s fail path separately	Pavel Begunkov
	A preparation path, extracts error path into a separate block. It looks saner then calling req_set_fail_links() after io_put_req_find_next(), even though it have been working well. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-05	io_uring: fix mis-refcounting linked timeouts	Pavel Begunkov
	io_prep_linked_timeout() sets REQ_F_LINK_TIMEOUT altering refcounting of the following linked request. After that someone should call io_queue_linked_timeout(), otherwise a submission reference of the linked timeout won't be ever dropped. That's what happens in io_steal_work() if io-wq decides to postpone linked request with io_wqe_enqueue(). io_queue_linked_timeout() can also be potentially called twice without synchronisation during re-submission, e.g. io_rw_resubmit(). There are the rules, whoever did io_prep_linked_timeout() must also call io_queue_linked_timeout(). To not do it twice, io_prep_linked_timeout() will return non NULL only for the first call. That's controlled by REQ_F_LINK_TIMEOUT flag. Also kill REQ_F_QUEUE_TIMEOUT. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-05	io_uring: use new io_req_task_work_add() helper throughout	Jens Axboe
	Since we now have that in the 5.9 branch, convert the existing users of task_work_add() to use this new helper. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-05	io_uring: abstract out task work running	Jens Axboe
	Provide a helper to run task_work instead of checking and running manually in a bunch of different spots. While doing so, also move the task run state setting where we run the task work. Then we can move it out of the callback helpers. This also helps ensure we only do this once per task_work list run, not per task_work item. Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-05	Merge branch 'io_uring-5.8' into for-5.9/io_uring	Jens Axboe
	Pull in task_work changes from the 5.8 series, as we'll need to apply the same kind of changes to other parts in the 5.9 branch. * io_uring-5.8: io_uring: fix regression with always ignoring signals in io_cqring_wait() io_uring: use signal based task_work running task_work: teach task_work_add() to do signal_wake_up()
2020-07-05	Replace HTTP links with HTTPS ones: CIFS	Alexander A. Klimov
	Rationale: Reduces attack surface on kernel devs opening the links for MITM as HTTPS traffic is much harder to manipulate. Deterministic algorithm: For each file: If not .svg: For each line: If doesn't contain `\bxmlns\b`: For each link, `\bhttp://[^# \t\r\n]*(?:\w\|/)`: If both the HTTP and HTTPS versions return 200 OK and serve the same content: Replace HTTP with HTTPS. Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Link: https://lore.kernel.org/r/20200627103125.71828-1-grandmaster@al2klimov.de Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2020-07-05	Merge tag 'io_uring-5.8-2020-07-05' of git://git.kernel.dk/linux-block	Linus Torvalds
	Pull io_uring fix from Jens Axboe: "Andres reported a regression with the fix that was merged earlier this week, where his setup of using signals to interrupt io_uring CQ waits no longer worked correctly. Fix this, and also limit our use of TWA_SIGNAL to the case where we need it, and continue using TWA_RESUME for task_work as before. Since the original is marked for 5.7 stable, let's flush this one out early" * tag 'io_uring-5.8-2020-07-05' of git://git.kernel.dk/linux-block: io_uring: fix regression with always ignoring signals in io_cqring_wait()
2020-07-04	io_uring: fix regression with always ignoring signals in io_cqring_wait()	Jens Axboe
	When switching to TWA_SIGNAL for task_work notifications, we also made any signal based condition in io_cqring_wait() return -ERESTARTSYS. This breaks applications that rely on using signals to abort someone waiting for events. Check if we have a signal pending because of queued task_work, and repeat the signal check once we've run the task_work. This provides a reliable way of telling the two apart. Additionally, only use TWA_SIGNAL if we are using an eventfd. If not, we don't have the dependency situation described in the original commit, and we can get by with just using TWA_RESUME like we previously did. Fixes: ce593a6c480a ("io_uring: use signal based task_work running") Cc: stable@vger.kernel.org # v5.7 Reported-by: Andres Freund <andres@anarazel.de> Tested-by: Andres Freund <andres@anarazel.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-04	exec: Remove do_execve_file	Eric W. Biederman
	Now that the last callser has been removed remove this code from exec. For anyone thinking of resurrecing do_execve_file please note that the code was buggy in several fundamental ways. - It did not ensure the file it was passed was read-only and that deny_write_access had been called on it. Which subtlely breaks invaniants in exec. - The caller of do_execve_file was expected to hold and put a reference to the file, but an extra reference for use by exec was not taken so that when exec put it's reference to the file an underflow occured on the file reference count. - The point of the interface was so that a pathname did not need to exist. Which breaks pathname based LSMs. Tetsuo Handa originally reported these issues[1]. While it was clear that deny_write_access was missing the fundamental incompatibility with the passed in O_RDWR filehandle was not immediately recognized. All of these issues were fixed by modifying the usermode driver code to have a path, so it did not need this hack. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> [1] https://lore.kernel.org/linux-fsdevel/2a8775b4-1dd5-9d5c-aa42-9872445e0942@i-love.sakura.ne.jp/ v1: https://lkml.kernel.org/r/871rm2f0hi.fsf_-_@x220.int.ebiederm.org v2: https://lkml.kernel.org/r/87lfk54p0m.fsf_-_@x220.int.ebiederm.org Link: https://lkml.kernel.org/r/20200702164140.4468-10-ebiederm@xmission.com Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-07-03	Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	Linus Torvalds
	Pull sysctl fix from Al Viro: "Another regression fix for sysctl changes this cycle..." * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: Call sysctl_head_finish on error
2020-07-03	Merge tag '5.8-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6	Linus Torvalds
	Pull cifs fixes from Steve French: "Eight cifs/smb3 fixes, most when specifying the multiuser mount flag. Five of the fixes are for stable" * tag '5.8-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: prevent truncation from long to int in wait_for_free_credits cifs: Fix the target file was deleted when rename failed. SMB3: Honor 'posix' flag for multiuser mounts SMB3: Honor 'handletimeout' flag for multiuser mounts SMB3: Honor lease disabling for multiuser mounts SMB3: Honor persistent/resilient handle flags for multiuser mounts SMB3: Honor 'seal' flag for multiuser mounts cifs: Display local UID details for SMB sessions in DebugData
2020-07-03	Merge tag 'xfs-5.8-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux	Linus Torvalds
	Pull xfs fix from Darrick Wong: "Fix a use-after-free bug when the fs shuts down" * tag 'xfs-5.8-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: fix use-after-free on CIL context on shutdown
2020-07-03	Merge tag 'gfs2-v5.8-rc3.fixes' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 fixes from Andreas Gruenbacher: "Various gfs2 fixes" * tag 'gfs2-v5.8-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: The freeze glock should never be frozen gfs2: When freezing gfs2, use GL_EXACT and not GL_NOCACHE gfs2: read-only mounts should grab the sd_freeze_gl glock gfs2: freeze should work on read-only mounts gfs2: eliminate GIF_ORDERED in favor of list_empty gfs2: Don't sleep during glock hash walk gfs2: fix trans slab error when withdraw occurs inside log_flush gfs2: Don't return NULL from gfs2_inode_lookup
2020-07-03	Call sysctl_head_finish on error	Matthew Wilcox (Oracle)
	This error path returned directly instead of calling sysctl_head_finish(). Fixes: ef9d965bc8b6 ("sysctl: reject gigantic reads/write to sysctl files") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-07-03	gfs2: The freeze glock should never be frozen	Bob Peterson
	Before this patch, some gfs2 code locked the freeze glock with LM_FLAG_NOEXP (Do not freeze) flag, and some did not. We never want to freeze the freeze glock, so this patch makes it consistently use LM_FLAG_NOEXP always. Signed-off-by: Bob Peterson <rpeterso@redhat.com>