summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2020-03-02xfs: add agf freeblocks verify in xfs_agf_verifyZheng Bin
We recently used fuzz(hydra) to test XFS and automatically generate tmp.img(XFS v5 format, but some metadata is wrong) xfs_repair information(just one AG): agf_freeblks 0, counted 3224 in ag 0 agf_longest 536874136, counted 3224 in ag 0 sb_fdblocks 613, counted 3228 Test as follows: mount tmp.img tmpdir cp file1M tmpdir sync In 4.19-stable, sync will stuck, the reason is: xfs_mountfs xfs_check_summary_counts if ((!xfs_sb_version_haslazysbcount(&mp->m_sb) || XFS_LAST_UNMOUNT_WAS_CLEAN(mp)) && !xfs_fs_has_sickness(mp, XFS_SICK_FS_COUNTERS)) return 0; -->just return, incore sb_fdblocks still be 613 xfs_initialize_perag_data cp file1M tmpdir -->ok(write file to pagecache) sync -->stuck(write pagecache to disk) xfs_map_blocks xfs_iomap_write_allocate while (count_fsb != 0) { nimaps = 0; while (nimaps == 0) { --> endless loop nimaps = 1; xfs_bmapi_write(..., &nimaps) --> nimaps becomes 0 again xfs_bmapi_write xfs_bmap_alloc xfs_bmap_btalloc xfs_alloc_vextent xfs_alloc_fix_freelist xfs_alloc_space_available -->fail(agf_freeblks is 0) In linux-next, sync not stuck, cause commit c2b3164320b5 ("xfs: use the latest extent at writeback delalloc conversion time") remove the above while, dmesg is as follows: [ 55.250114] XFS (loop0): page discard on page ffffea0008bc7380, inode 0x1b0c, offset 0. Users do not know why this page is discard, the better soultion is: 1. Like xfs_repair, make sure sb_fdblocks is equal to counted (xfs_initialize_perag_data did this, who is not called at this mount) 2. Add agf verify, if fail, will tell users to repair This patch use the second soultion. Signed-off-by: Zheng Bin <zhengbin13@huawei.com> Signed-off-by: Ren Xudong <renxudong1@huawei.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-02xfs: fix iclog release error check race with shutdownBrian Foster
Prior to commit df732b29c8 ("xfs: call xlog_state_release_iclog with l_icloglock held"), xlog_state_release_iclog() always performed a locked check of the iclog error state before proceeding into the sync state processing code. As of this commit, part of xlog_state_release_iclog() was open-coded into xfs_log_release_iclog() and as a result the locked error state check was lost. The lockless check still exists, but this doesn't account for the possibility of a race with a shutdown being performed by another task causing the iclog state to change while the original task waits on ->l_icloglock. This has reproduced very rarely via generic/475 and manifests as an assert failure in __xlog_state_release_iclog() due to an unexpected iclog state. Restore the locked error state check in xlog_state_release_iclog() to ensure that an iclog state update via shutdown doesn't race with the iclog release state processing code. Fixes: df732b29c807 ("xfs: call xlog_state_release_iclog with l_icloglock held") Reported-by: Zorro Lang <zlang@redhat.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-02io_uring: clean up io_closePavel Begunkov
Don't abuse labels for plain and straightworward code. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02io_uring: Ensure mask is initialized in io_arm_poll_handlerNathan Chancellor
Clang warns: fs/io_uring.c:4178:6: warning: variable 'mask' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized] if (def->pollin) ^~~~~~~~~~~ fs/io_uring.c:4182:2: note: uninitialized use occurs here mask |= POLLERR | POLLPRI; ^~~~ fs/io_uring.c:4178:2: note: remove the 'if' if its condition is always true if (def->pollin) ^~~~~~~~~~~~~~~~ fs/io_uring.c:4154:15: note: initialize the variable 'mask' to silence this warning __poll_t mask, ret; ^ = 0 1 warning generated. io_op_defs has many definitions where pollin is not set so mask indeed might be uninitialized. Initialize it to zero and change the next assignment to |=, in case further masks are added in the future to avoid missing changing the assignment then. Fixes: d7718a9d25a6 ("io_uring: use poll driven retry for files that support it") Link: https://github.com/ClangBuiltLinux/linux/issues/916 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02io_uring: remove io_prep_next_work()Pavel Begunkov
io-wq cares about IO_WQ_WORK_UNBOUND flag only while enqueueing, so it's useless setting it for a next req of a link. Thus, removed it from io_prep_linked_timeout(), and inline the function. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02io_uring: remove extra nxt check after puntPavel Begunkov
After __io_queue_sqe() ended up in io_queue_async_work(), it's already known that there is no @nxt req, so skip the check and return from the function. Also, @nxt initialisation now can be done just before io_put_req_find_next(), as there is no jumping until it's checked. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02io_uring: use poll driven retry for files that support itJens Axboe
Currently io_uring tries any request in a non-blocking manner, if it can, and then retries from a worker thread if we get -EAGAIN. Now that we have a new and fancy poll based retry backend, use that to retry requests if the file supports it. This means that, for example, an IORING_OP_RECVMSG on a socket no longer requires an async thread to complete the IO. If we get -EAGAIN reading from the socket in a non-blocking manner, we arm a poll handler for notification on when the socket becomes readable. When it does, the pending read is executed directly by the task again, through the io_uring task work handlers. Not only is this faster and more efficient, it also means we're not generating potentially tons of async threads that just sit and block, waiting for the IO to complete. The feature is marked with IORING_FEAT_FAST_POLL, meaning that async pollable IO is fast, and that poll<link>other_op is fast as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02io_uring: mark requests that we can do poll async in io_op_defsJens Axboe
Add a pollin/pollout field to the request table, and have commands that we can safely poll for properly marked. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02io_uring: add per-task callback handlerJens Axboe
For poll requests, it's not uncommon to link a read (or write) after the poll to execute immediately after the file is marked as ready. Since the poll completion is called inside the waitqueue wake up handler, we have to punt that linked request to async context. This slows down the processing, and actually means it's faster to not use a link for this use case. We also run into problems if the completion_lock is contended, as we're doing a different lock ordering than the issue side is. Hence we have to do trylock for completion, and if that fails, go async. Poll removal needs to go async as well, for the same reason. eventfd notification needs special case as well, to avoid stack blowing recursion or deadlocks. These are all deficiencies that were inherited from the aio poll implementation, but I think we can do better. When a poll completes, simply queue it up in the task poll list. When the task completes the list, we can run dependent links inline as well. This means we never have to go async, and we can remove a bunch of code associated with that, and optimizations to try and make that run faster. The diffstat speaks for itself. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02io_uring: store io_kiocb in wait->privateJens Axboe
Store the io_kiocb in the private field instead of the poll entry, this is in preparation for allowing multiple waitqueues. No functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02io-wq: use BIT for ulong hashPavel Begunkov
@hash_map is unsigned long, but BIT_ULL() is used for manipulations. BIT() is a better match as it returns exactly unsigned long value. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02io_uring: remove IO_WQ_WORK_CBPavel Begunkov
IO_WQ_WORK_CB is used only for linked timeouts, which will be armed before the work setup (i.e. mm, override creds, etc). The setup shouldn't take long, so it's ok to arm it a bit later and get rid of IO_WQ_WORK_CB. Make io-wq call work->func() only once, callbacks will handle the rest. i.e. the linked timeout handler will do the actual issue. And as a bonus, it removes an extra indirect call. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02io-wq: remove unused IO_WQ_WORK_HAS_MMPavel Begunkov
IO_WQ_WORK_HAS_MM is set but never used, remove it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02io_uring: extract kmsg copy helperPavel Begunkov
io_recvmsg() and io_sendmsg() duplicate nonblock -EAGAIN finilising part, so add helper for that. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02io_uring: clean io_poll_completePavel Begunkov
Deduplicate call to io_cqring_fill_event(), plain and easy Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02io_uring: add splice(2) supportPavel Begunkov
Add support for splice(2). - output file is specified as sqe->fd, so it's handled by generic code - hash_reg_file handled by generic code as well - len is 32bit, but should be fine - the fd_in is registered file, when SPLICE_F_FD_IN_FIXED is set, which is a splice flag (i.e. sqe->splice_flags). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02io_uring: add interface for getting filesPavel Begunkov
Preparation without functional changes. Adds io_get_file(), that allows to grab files not only into req->file. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02splice: make do_splice publicPavel Begunkov
Make do_splice(), so other kernel parts can reuse it Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02io_uring: remove req->in_asyncPavel Begunkov
req->in_async is not really needed, it only prevents propagation of @nxt for fast not-blocked submissions. Remove it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02io_uring: don't do full *prep_worker() from io-wqPavel Begunkov
io_prep_async_worker() called io_wq_assign_next() do many useless checks: io_req_work_grab_env() was already called during prep, and @do_hashed is not ever used. Add io_prep_next_work() -- simplified version, that can be called io-wq. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02io_uring: don't call work.func from sync ctxPavel Begunkov
Many operations define custom work.func before getting into an io-wq. There are several points against: - it calls io_wq_assign_next() from outside io-wq, that may be confusing - sync context would go unnecessary through io_req_cancelled() - prototypes are quite different, so work!=old_work looks strange - makes async/sync responsibilities fuzzy - adds extra overhead Don't call generic path and io-wq handlers from each other, but use helpers instead Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02io_uring: io_accept() should hold on to submit reference on retryJens Axboe
Don't drop an early reference, hang on to it and let the caller drop it. This makes it behave more like "regular" requests. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02io_uring: consider any io_read/write -EAGAIN as finalJens Axboe
If the -EAGAIN happens because of a static condition, then a poll or later retry won't fix it. We must call it again from blocking condition. Play it safe and ensure that any -EAGAIN condition from read or write must retry from async context. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02io-wq: remove io_wq_flush and IO_WQ_WORK_INTERNALPavel Begunkov
io_wq_flush() is buggy, during cancelation of a flush, the associated work may be passed to the caller's (i.e. io_uring) @match callback. That callback is expecting it to be embedded in struct io_kiocb. Cancelation of internal work probably doesn't make a lot of sense to begin with. As the flush helper is no longer used, just delete it and the associated work flag. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02Documentation: nfsroot.rst: Fix references to nfsroot.rstNiklas Söderlund
When converting and moving nfsroot.txt to nfsroot.rst the references to the old text file was not updated to match the change, fix this. Fixes: f9a9349846f92b2d ("Documentation: nfsroot.txt: convert to ReST") Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Link: https://lore.kernel.org/r/20200212181332.520545-1-niklas.soderlund+renesas@ragnatech.se Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2020-03-02io-wq: fix IO_WQ_WORK_NO_CANCEL cancellationPavel Begunkov
To cancel a work, io-wq sets IO_WQ_WORK_CANCEL and executes the callback. However, IO_WQ_WORK_NO_CANCEL works will just execute and may return next work, which will be ignored and lost. Cancel the whole link. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-01Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Two more bug fixes (including a regression) for 5.6" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: potential crash on allocation error in ext4_alloc_flex_bg_array() jbd2: fix data races at struct journal_head
2020-02-29ext4: potential crash on allocation error in ext4_alloc_flex_bg_array()Dan Carpenter
If sbi->s_flex_groups_allocated is zero and the first allocation fails then this code will crash. The problem is that "i--" will set "i" to -1 but when we compare "i >= sbi->s_flex_groups_allocated" then the -1 is type promoted to unsigned and becomes UINT_MAX. Since UINT_MAX is more than zero, the condition is true so we call kvfree(new_groups[-1]). The loop will carry on freeing invalid memory until it crashes. Fixes: 7c990728b99e ("ext4: fix potential race between s_flex_groups online resizing and access") Reviewed-by: Suraj Jitindar Singh <surajjs@amazon.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: stable@kernel.org Link: https://lore.kernel.org/r/20200228092142.7irbc44yaz3by7nb@kili.mountain Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-02-29jbd2: fix data races at struct journal_headQian Cai
journal_head::b_transaction and journal_head::b_next_transaction could be accessed concurrently as noticed by KCSAN, LTP: starting fsync04 /dev/zero: Can't open blockdev EXT4-fs (loop0): mounting ext3 file system using the ext4 subsystem EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null) ================================================================== BUG: KCSAN: data-race in __jbd2_journal_refile_buffer [jbd2] / jbd2_write_access_granted [jbd2] write to 0xffff99f9b1bd0e30 of 8 bytes by task 25721 on cpu 70: __jbd2_journal_refile_buffer+0xdd/0x210 [jbd2] __jbd2_journal_refile_buffer at fs/jbd2/transaction.c:2569 jbd2_journal_commit_transaction+0x2d15/0x3f20 [jbd2] (inlined by) jbd2_journal_commit_transaction at fs/jbd2/commit.c:1034 kjournald2+0x13b/0x450 [jbd2] kthread+0x1cd/0x1f0 ret_from_fork+0x27/0x50 read to 0xffff99f9b1bd0e30 of 8 bytes by task 25724 on cpu 68: jbd2_write_access_granted+0x1b2/0x250 [jbd2] jbd2_write_access_granted at fs/jbd2/transaction.c:1155 jbd2_journal_get_write_access+0x2c/0x60 [jbd2] __ext4_journal_get_write_access+0x50/0x90 [ext4] ext4_mb_mark_diskspace_used+0x158/0x620 [ext4] ext4_mb_new_blocks+0x54f/0xca0 [ext4] ext4_ind_map_blocks+0xc79/0x1b40 [ext4] ext4_map_blocks+0x3b4/0x950 [ext4] _ext4_get_block+0xfc/0x270 [ext4] ext4_get_block+0x3b/0x50 [ext4] __block_write_begin_int+0x22e/0xae0 __block_write_begin+0x39/0x50 ext4_write_begin+0x388/0xb50 [ext4] generic_perform_write+0x15d/0x290 ext4_buffered_write_iter+0x11f/0x210 [ext4] ext4_file_write_iter+0xce/0x9e0 [ext4] new_sync_write+0x29c/0x3b0 __vfs_write+0x92/0xa0 vfs_write+0x103/0x260 ksys_write+0x9d/0x130 __x64_sys_write+0x4c/0x60 do_syscall_64+0x91/0xb05 entry_SYSCALL_64_after_hwframe+0x49/0xbe 5 locks held by fsync04/25724: #0: ffff99f9911093f8 (sb_writers#13){.+.+}, at: vfs_write+0x21c/0x260 #1: ffff99f9db4c0348 (&sb->s_type->i_mutex_key#15){+.+.}, at: ext4_buffered_write_iter+0x65/0x210 [ext4] #2: ffff99f5e7dfcf58 (jbd2_handle){++++}, at: start_this_handle+0x1c1/0x9d0 [jbd2] #3: ffff99f9db4c0168 (&ei->i_data_sem){++++}, at: ext4_map_blocks+0x176/0x950 [ext4] #4: ffffffff99086b40 (rcu_read_lock){....}, at: jbd2_write_access_granted+0x4e/0x250 [jbd2] irq event stamp: 1407125 hardirqs last enabled at (1407125): [<ffffffff980da9b7>] __find_get_block+0x107/0x790 hardirqs last disabled at (1407124): [<ffffffff980da8f9>] __find_get_block+0x49/0x790 softirqs last enabled at (1405528): [<ffffffff98a0034c>] __do_softirq+0x34c/0x57c softirqs last disabled at (1405521): [<ffffffff97cc67a2>] irq_exit+0xa2/0xc0 Reported by Kernel Concurrency Sanitizer on: CPU: 68 PID: 25724 Comm: fsync04 Tainted: G L 5.6.0-rc2-next-20200221+ #7 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019 The plain reads are outside of jh->b_state_lock critical section which result in data races. Fix them by adding pairs of READ|WRITE_ONCE(). Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Qian Cai <cai@lca.pw> Link: https://lore.kernel.org/r/20200222043111.2227-1-cai@lca.pw Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-02-28Merge tag 'io_uring-5.6-2020-02-28' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes from Jens Axboe: - Fix for a race with IOPOLL used with SQPOLL (Xiaoguang) - Only show ->fdinfo if procfs is enabled (Tobias) - Fix for a chain with multiple personalities in the SQEs - Fix for a missing free of personality idr on exit - Removal of the spin-for-work optimization - Fix for next work lookup on request completion - Fix for non-vec read/write result progation in case of links - Fix for a fileset references on switch - Fix for a recvmsg/sendmsg 32-bit compatability mode * tag 'io_uring-5.6-2020-02-28' of git://git.kernel.dk/linux-block: io_uring: fix 32-bit compatability with sendmsg/recvmsg io_uring: define and set show_fdinfo only if procfs is enabled io_uring: drop file set ref put/get on switch io_uring: import_single_range() returns 0/-ERROR io_uring: pick up link work on submit reference drop io-wq: ensure work->task_pid is cleared on init io-wq: remove spin-for-work optimization io_uring: fix poll_list race for SETUP_IOPOLL|SETUP_SQPOLL io_uring: fix personality idr leak io_uring: handle multiple personalities in link chains
2020-02-28proc: Remove the now unnecessary internal mount of procEric W. Biederman
There remains no more code in the kernel using pids_ns->proc_mnt, therefore remove it from the kernel. The big benefit of this change is that one of the most error prone and tricky parts of the pid namespace implementation, maintaining kernel mounts of proc is removed. In addition removing the unnecessary complexity of the kernel mount fixes a regression that caused the proc mount options to be ignored. Now that the initial mount of proc comes from userspace, those mount options are again honored. This fixes Android's usage of the proc hidepid option. Reported-by: Alistair Strachan <astrachan@google.com> Fixes: e94591d0d90c ("proc: Convert proc_mount to use mount_ns.") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-02-28Merge tag 'zonefs-5.6-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs Pull zonefs fixes from Damien Le Moal: "Two fixes in here: - Revert the initial decision to silently ignore IOCB_NOWAIT for asynchronous direct IOs to sequential zone files. Instead, return an error to the user to signal that the feature is not supported (from Christoph) - A fix to zonefs Kconfig to select FS_IOMAP to avoid build failures if no other file system already selected this option (from Johannes)" * tag 'zonefs-5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs: zonefs: select FS_IOMAP zonefs: fix IOCB_NOWAIT handling
2020-02-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
The mptcp conflict was overlapping additions. The SMC conflict was an additional and removal happening at the same time. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27io_uring: fix 32-bit compatability with sendmsg/recvmsgJens Axboe
We must set MSG_CMSG_COMPAT if we're in compatability mode, otherwise the iovec import for these commands will not do the right thing and fail the command with -EINVAL. Found by running the test suite compiled as 32-bit. Cc: stable@vger.kernel.org Fixes: aa1fa28fc73e ("io_uring: add support for recvmsg()") Fixes: 0fa03c624d8f ("io_uring: add support for sendmsg()") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-27atomic_open(): saner calling conventions (return dentry on success)Al Viro
Currently it either returns -E... or puts (nd->path.mnt,dentry) into *path and returns 0. Make it return ERR_PTR(-E...) or dentry; adjust the caller. Fewer arguments and it's easier to keep track of *path contents that way. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-27handle_mounts(): start building a sane wrapper for follow_managed()Al Viro
All callers of follow_managed() follow it on success with the same steps - d_backing_inode(path->dentry) is calculated and stored into some struct inode * variable and, in all but one case, an unsigned variable (nd->seq to be) is zeroed. The single exception is lookup_fast() and there zeroing is correct thing to do - not doing it is a pointless microoptimization. Add a wrapper for follow_managed() that would do that combination. It's mostly a vehicle for code massage - it will be changing quite a bit, and the current calling conventions are by no means final. Right now it takes path, nameidata and (as out params) inode and seq, similar to __follow_mount_rcu(). Which will soon get folded into it... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-27make build_open_flags() treat O_CREAT | O_EXCL as implying O_NOFOLLOWAl Viro
O_CREAT | O_EXCL means "-EEXIST if we run into a trailing symlink". As it is, we might or might not have LOOKUP_FOLLOW in op->intent in that case - that depends upon having O_NOFOLLOW in open flags. It doesn't matter, since we won't be checking it in that case - do_last() bails out earlier. However, making sure it's not set (i.e. acting as if we had an explicit O_NOFOLLOW) makes the behaviour more explicit and allows to reorder the check for O_CREAT | O_EXCL in do_last() with the call of step_into() immediately following it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-27follow_automount() doesn't need the entire nameidataAl Viro
Only the address of ->total_link_count and the flags. And fix an off-by-one is ELOOP detection - make it consistent with symlink following, where we check if the pre-increment value has reached 40, rather than check the post-increment one. [kudos to Christian Brauner for spotted braino] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-27follow_automount(): get rid of dead^Wstillborn codeAl Viro
1) no instances of ->d_automount() have ever made use of the "return ERR_PTR(-EISDIR) if you don't feel like mounting anything" - that's a rudiment of plans that got superseded before the thing went into the tree. Despite the comment in follow_automount(), autofs has never done that. 2) if there's no ->d_automount() in dentry_operations, filesystems should not set DCACHE_NEED_AUTOMOUNT in the first place. None have ever done so... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-27fix automount/automount race properlyAl Viro
Protection against automount/automount races (two threads hitting the same referral point at the same time) is based upon do_add_mount() prevention of identical overmounts - trying to overmount the root of mounted tree with the same tree fails with -EBUSY. It's unreliable (the other thread might've mounted something on top of the automount it has triggered) *and* causes no end of headache for follow_automount() and its caller, since finish_automount() behaves like do_new_mount() - if the mountpoint to be is overmounted, it mounts on top what's overmounting it. It's not only wrong (we want to go into what's overmounting the automount point and quietly discard what we planned to mount there), it introduces the possibility of original parent mount getting dropped. That's what 8aef18845266 (VFS: Fix vfsmount overput on simultaneous automount) deals with, but it can't do anything about the reliability of conflict detection - if something had been overmounted the other thread's automount (e.g. that other thread having stepped into automount in mount(2)), we don't get that -EBUSY and the result is referral point under automounted NFS under explicit overmount under another copy of automounted NFS What we need is finish_automount() *NOT* digging into overmounts - if it finds one, it should just quietly discard the thing it was asked to mount. And don't bother with actually crossing into the results of finish_automount() - the same loop that calls follow_automount() will do that just fine on the next iteration. IOW, instead of calling lock_mount() have finish_automount() do it manually, _without_ the "move into overmount and retry" part. And leave crossing into the results to the caller of follow_automount(), which simplifies it a lot. Moral: if you end up with a lot of glue working around the calling conventions of something, perhaps these calling conventions are simply wrong... Fixes: 8aef18845266 (VFS: Fix vfsmount overput on simultaneous automount) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-27f2fs: fix to avoid potential deadlockChao Yu
Using f2fs_trylock_op() in f2fs_write_compressed_pages() to avoid potential deadlock like we did in f2fs_write_single_data_page(). Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-02-27f2fs: add missing function name in kernel messageChao Yu
Otherwise, we can not distinguish the exact location of messages, when there are more than one places printing same message. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-02-27f2fs: recycle unused compress_data.chksum feildChao Yu
In Struct compress_data, chksum field was never used, remove it. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-02-27f2fs: fix to avoid NULL pointer dereferenceChao Yu
Unable to handle kernel NULL pointer dereference at virtual address 00000000 PC is at f2fs_free_dic+0x60/0x2c8 LR is at f2fs_decompress_pages+0x3c4/0x3e8 f2fs_free_dic+0x60/0x2c8 f2fs_decompress_pages+0x3c4/0x3e8 __read_end_io+0x78/0x19c f2fs_post_read_work+0x6c/0x94 process_one_work+0x210/0x48c worker_thread+0x2e8/0x44c kthread+0x110/0x120 ret_from_fork+0x10/0x18 In f2fs_free_dic(), we can not use f2fs_put_page(,1) to release dic->tpages[i], as the page's mapping is NULL. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-02-27f2fs: fix leaking uninitialized memory in compressed clustersEric Biggers
When the compressed data of a cluster doesn't end on a page boundary, the remainder of the last page must be zeroed in order to avoid leaking uninitialized memory to disk. Fixes: 4c8ff7095bef ("f2fs: support data compression") Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-02-27f2fs: fix the panic in do_checkpoint()Sahitya Tummala
There could be a scenario where f2fs_sync_meta_pages() will not ensure that all F2FS_DIRTY_META pages are submitted for IO. Thus, resulting in the below panic in do_checkpoint() - f2fs_bug_on(sbi, get_pages(sbi, F2FS_DIRTY_META) && !f2fs_cp_error(sbi)); This can happen in a low-memory condition, where shrinker could also be doing the writepage operation (stack shown below) at the same time when checkpoint is running on another core. schedule down_write f2fs_submit_page_write -> by this time, this page in page cache is tagged as PAGECACHE_TAG_WRITEBACK and PAGECACHE_TAG_DIRTY is cleared, due to which f2fs_sync_meta_pages() cannot sync this page in do_checkpoint() path. f2fs_do_write_meta_page __f2fs_write_meta_page f2fs_write_meta_page shrink_page_list shrink_inactive_list shrink_node_memcg shrink_node kswapd Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-02-27f2fs: fix to wait all node page writebackChao Yu
There is a race condition that we may miss to wait for all node pages writeback, fix it. - fsync() - shrink - f2fs_do_sync_file - __write_node_page - set_page_writeback(page#0) : remove DIRTY/TOWRITE flag - f2fs_fsync_node_pages : won't find page #0 as TOWRITE flag was removeD - f2fs_wait_on_node_pages_writeback : wont' wait page #0 writeback as it was not in fsync_node_list list. - f2fs_add_fsync_node_entry Fixes: 50fa53eccf9f ("f2fs: fix to avoid broken of dnode block list") Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-02-27pstore: pstore_ftrace_seq_next should increase position indexVasily Averin
In Aug 2018 NeilBrown noticed commit 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and interface") "Some ->next functions do not increment *pos when they return NULL... Note that such ->next functions are buggy and should be fixed. A simple demonstration is dd if=/proc/swaps bs=1000 skip=1 Choose any block size larger than the size of /proc/swaps. This will always show the whole last line of /proc/swaps" /proc/swaps output was fixed recently, however there are lot of other affected files, and one of them is related to pstore subsystem. If .next function does not change position index, following .show function will repeat output related to current position index. There are at least 2 related problems: - read after lseek beyond end of file, described above by NeilBrown "dd if=<AFFECTED_FILE> bs=1000 skip=1" will generate whole last list - read after lseek on in middle of last line will output expected rest of last line but then repeat whole last line once again. If .show() function generates multy-line output (like pstore_ftrace_seq_show() does ?) following bash script cycles endlessly $ q=;while read -r r;do echo "$((++q)) $r";done < AFFECTED_FILE Unfortunately I'm not familiar enough to pstore subsystem and was unable to find affected pstore-related file on my test node. If .next function does not change position index, following .show function will repeat output related to current position index. Cc: stable@vger.kernel.org Fixes: 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code ...") Link: https://bugzilla.kernel.org/show_bug.cgi?id=206283 Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Link: https://lore.kernel.org/r/4e49830d-4c88-0171-ee24-1ee540028dad@virtuozzo.com [kees: with robustness tweak from Joel Fernandes <joelaf@google.com>] Signed-off-by: Kees Cook <keescook@chromium.org>
2020-02-27io_uring: define and set show_fdinfo only if procfs is enabledTobias Klauser
Follow the pattern used with other *_show_fdinfo functions and only define and use io_uring_show_fdinfo and its helper functions if CONFIG_PROC_FS is set. Fixes: 87ce955b24c9 ("io_uring: add ->show_fdinfo() for the io_uring file descriptor") Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-27gfs2: leaf_dealloc needs to allocate one more revokeBob Peterson
Function leaf_dealloc was not allocating enough journal space for revokes. Before, it allocated 'l_blocks' revokes, but it needs one more for the revoke of the dinode that is modified. This patch adds the needed revoke entry to function leaf_dealloc. Signed-off-by: Bob Peterson <rpeterso@redhat.com>