summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2022-05-02dlm: remove unnecessary error assignAlexander Aring
This patch removes unnecessary error assigns to 0 at places we know that error is zero because it was checked on non-zero before. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-05-02dlm: fix missing lkb refcount handlingAlexander Aring
We always call hold_lkb(lkb) if we increment lkb->lkb_wait_count. So, we always need to call unhold_lkb(lkb) if we decrement lkb->lkb_wait_count. This patch will add missing unhold_lkb(lkb) if we decrement lkb->lkb_wait_count. In case of setting lkb->lkb_wait_count to zero we need to countdown until reaching zero and call unhold_lkb(lkb). The waiters list unhold_lkb(lkb) can be removed because it's done for the last lkb_wait_count decrement iteration as it's done in _remove_from_waiters(). This issue was discovered by a dlm gfs2 test case which use excessively dlm_unlock(LKF_CANCEL) feature. Probably the lkb->lkb_wait_count value never reached above 1 if this feature isn't used and so it was not discovered before. The testcase ended in a rsb on the rsb keep data structure with a refcount of 1 but no lkb was associated with it, which is itself an invalid behaviour. A side effect of that was a condition in which the dlm was sending remove messages in a looping behaviour. With this patch that has not been reproduced. Cc: stable@vger.kernel.org Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-05-02block: ignore RWF_HIPRI hint for sync dioMing Lei
So far bio is marked as REQ_POLLED if RWF_HIPRI/IOCB_HIPRI is passed from userspace sync io interface, then block layer tries to poll until the bio is completed. But the current implementation calls blk_io_schedule() if bio_poll() returns 0, and this way causes io hang or timeout easily. But looks no one reports this kind of issue, which should have been triggered in normal io poll sanity test or blktests block/007 as observed by Changhui, that means it is very likely that no one uses it or no one cares it. Also after io_uring is invented, io poll for sync dio becomes legacy interface. So ignore RWF_HIPRI hint for sync dio. CC: linux-mm@kvack.org Cc: linux-xfs@vger.kernel.org Reported-by: Changhui Zhong <czhong@redhat.com> Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Changhui Zhong <czhong@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220420143110.2679002-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-02io_uring: assign non-fixed early for async workJens Axboe
We defer file assignment to ensure that fixed files work with links between a direct accept/open and the links that follow it. But this has the side effect that normal file assignment is then not complete by the time that request submission has been done. For deferred execution, if the file is a regular file, assign it when we do the async prep anyway. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-02Merge 5.18-rc5 into driver-core-nextGreg Kroah-Hartman
We need the kernfs/driver core fixes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-01fs-verity: define a function to return the integrity protected file digestMimi Zohar
Define a function named fsverity_get_digest() to return the verity file digest and the associated hash algorithm (enum hash_algo). This assumes that before calling fsverity_get_digest() the file must have been opened, which is even true for the IMA measure/appraise on file open policy rule use case (func=FILE_CHECK). do_open() calls vfs_open() immediately prior to ima_file_check(). Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2022-04-30Merge tag 'driver-core-5.18-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core fixes from Greg KH: "Here are some small driver core and kernfs fixes for some reported problems. They include: - kernfs regression that is causing oopses in 5.17 and newer releases - topology sysfs fixes for a few small reported problems. All of these have been in linux-next for a while with no reported issues" * tag 'driver-core-5.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: kernfs: fix NULL dereferencing in kernfs_remove topology: Fix up build warning in topology_is_visible() arch_topology: Do not set llc_sibling if llc_id is invalid topology: make core_mask include at least cluster_siblings topology/sysfs: Hide PPIN on systems that do not support it.
2022-04-30io_uring: replace smp_mb() with smp_mb__after_atomic() in io_sq_thread()Almog Khaikin
The IORING_SQ_NEED_WAKEUP flag is now set using atomic_or() which implies a full barrier on some architectures but it is not required to do so. Use the more appropriate smp_mb__after_atomic() which avoids the extra barrier on those architectures. Signed-off-by: Almog Khaikin <almogkh@gmail.com> Link: https://lore.kernel.org/r/20220426163403.112692-1-almogkh@gmail.com Fixes: 8018823e6987 ("io_uring: serialize ctx->rings->sq_flags with atomic_or/and") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-30io_uring: add IORING_SETUP_TASKRUN_FLAGJens Axboe
If IORING_SETUP_COOP_TASKRUN is set to use cooperative scheduling for running task_work, then IORING_SETUP_TASKRUN_FLAG can be set so the application can tell if task_work is pending in the kernel for this ring. This allows use cases like io_uring_peek_cqe() to still function appropriately, or for the task to know when it would be useful to call io_uring_wait_cqe() to run pending events. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20220426014904.60384-7-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-30io_uring: use TWA_SIGNAL_NO_IPI if IORING_SETUP_COOP_TASKRUN is usedJens Axboe
If this is set, io_uring will never use an IPI to deliver a task_work notification. This can be used in the common case where a single task or thread communicates with the ring, and doesn't rely on io_uring_cqe_peek(). This provides a noticeable win in performance, both from eliminating the IPI itself, but also from avoiding interrupting the submitting task unnecessarily. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20220426014904.60384-6-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-30io_uring: set task_work notify method at init timeJens Axboe
While doing so, switch SQPOLL to TWA_SIGNAL_NO_IPI as well, as that just does a task wakeup and then we can remove the special wakeup we have in task_work_add. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20220426014904.60384-5-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-30io-wq: use __set_notify_signal() to wake workersJens Axboe
The only difference between set_notify_signal() and __set_notify_signal() is that the former checks if it needs to deliver an IPI to force a reschedule. As the io-wq workers never leave the kernel, and IPI is never needed, they simply need a wakeup. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20220426014904.60384-4-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-30io_uring: serialize ctx->rings->sq_flags with atomic_or/andJens Axboe
Rather than require ctx->completion_lock for ensuring that we don't clobber the flags, use the atomic bitop helpers instead. This removes the need to grab the completion_lock, in preparation for needing to set or clear sq_flags when we don't know the status of this lock. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20220426014904.60384-3-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-29io_uring: return hint on whether more data is available after receiveJens Axboe
For now just use a CQE flag for this, with big CQE support we could return the actual number of bytes left. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-29Merge branch 'for-5.19/io_uring-socket' into for-5.19/io_uring-netJens Axboe
* for-5.19/io_uring-socket: (73 commits) io_uring: use the text representation of ops in trace io_uring: rename op -> opcode io_uring: add io_uring_get_opcode io_uring: add type to op enum io_uring: add socket(2) support net: add __sys_socket_file() io_uring: fix trace for reduced sqe padding io_uring: add fgetxattr and getxattr support io_uring: add fsetxattr and setxattr support fs: split off do_getxattr from getxattr fs: split off setxattr_copy and do_setxattr function from setxattr io_uring: return an error when cqe is dropped io_uring: use constants for cq_overflow bitfield io_uring: rework io_uring_enter to simplify return value io_uring: trace cqe overflows io_uring: add trace support for CQE overflow io_uring: allow re-poll if we made progress io_uring: support MSG_WAITALL for IORING_OP_SEND(MSG) io_uring: add support for IORING_ASYNC_CANCEL_ANY io_uring: allow IORING_OP_ASYNC_CANCEL with 'fd' key ...
2022-04-29Merge tag 'io_uring-5.18-2022-04-29' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes from Jens Axboe: "Pretty boring: - three patches just adding reserved field checks (me, Eugene) - Fixing a potential regression with IOPOLL caused by a block change (Joseph)" Boring is good. * tag 'io_uring-5.18-2022-04-29' of git://git.kernel.dk/linux-block: io_uring: check that data field is 0 in ringfd unregister io_uring: fix uninitialized field in rw io_kiocb io_uring: check reserved fields for recv/recvmsg io_uring: check reserved fields for send/sendmsg
2022-04-29fs: sysv: check sbi->s_firstdatazone in complete_read_superLiu Shixin
sbi->s_firstinodezone is initialized to 2 and sbi->s_firstdatazone is read from sbd. There's no guarantee that sbi->s_firstdatazone must bigger than sbi->s_firstinodezone. If sbi->s_firstdatazone less than 2, the filesystem can still be mounted unexpetly. At this point, sbi->s_ninodes flip to very large value and this filesystem is broken. We can observe this by executing 'df' command. When we execute, we will get an error message: "sysv_count_free_inodes: unable to read inode table" Link: https://lkml.kernel.org/r/20220330104215.530223-1-liushixin2@huawei.com Signed-off-by: Liu Shixin <liushixin2@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-29fat: add ratelimit to fat*_ent_bread()OGAWA Hirofumi
fat*_ent_bread() can be the cause of too many report on I/O error path. So use fat_msg_ratelimit() instead. Link: https://lkml.kernel.org/r/87bkxogfeq.fsf@mail.parknet.co.jp Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Reported-by: qianfan <qianfanguijin@163.com> Tested-by: qianfan <qianfanguijin@163.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-29fatfs: add FAT messages to printk indexJonathan Lassoff
In order for end users to quickly react to new issues that come up in production, it is proving useful to leverage the printk indexing system. This printk index enables kernel developers to use calls to printk() with changeable ad-hoc format strings (as they always have; no change of expectations), while enabling end users to examine format strings to detect changes. Since end users are using regular expressions to match messages printed through printk(), being able to detect changes in chosen format strings from release to release provides a useful signal to review printk()-matching regular expressions for any necessary updates. So that detailed FAT messages are captured by this printk index, this patch wraps fat_msg with a macro. [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/8aaa2dd7995e820292bb40d2120ab69756662c65.1648688136.git.jof@thejof.com Signed-off-by: Jonathan Lassoff <jof@thejof.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-29fatfs: remove redundant judgmentYubo Feng
iput() has already judged the incoming parameter, so there is no need to repeat outside. Link: https://lkml.kernel.org/r/1648265418-76563-1-git-send-email-fengyubo3@huawei.com Signed-off-by: Yubo Feng <fengyubo3@huawei.com> Reported-by: Hulk Robot <hulkci@huawei.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-29pipe: make poll_usage boolean and annotate its accessKuniyuki Iwashima
Patch series "Fix data-races around epoll reported by KCSAN." This series suppresses a false positive KCSAN's message and fixes a real data-race. This patch (of 2): pipe_poll() runs locklessly and assigns 1 to poll_usage. Once poll_usage is set to 1, it never changes in other places. However, concurrent writes of a value trigger KCSAN, so let's make KCSAN happy. BUG: KCSAN: data-race in pipe_poll / pipe_poll write to 0xffff8880042f6678 of 4 bytes by task 174 on cpu 3: pipe_poll (fs/pipe.c:656) ep_item_poll.isra.0 (./include/linux/poll.h:88 fs/eventpoll.c:853) do_epoll_wait (fs/eventpoll.c:1692 fs/eventpoll.c:1806 fs/eventpoll.c:2234) __x64_sys_epoll_wait (fs/eventpoll.c:2246 fs/eventpoll.c:2241 fs/eventpoll.c:2241) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) write to 0xffff8880042f6678 of 4 bytes by task 177 on cpu 1: pipe_poll (fs/pipe.c:656) ep_item_poll.isra.0 (./include/linux/poll.h:88 fs/eventpoll.c:853) do_epoll_wait (fs/eventpoll.c:1692 fs/eventpoll.c:1806 fs/eventpoll.c:2234) __x64_sys_epoll_wait (fs/eventpoll.c:2246 fs/eventpoll.c:2241 fs/eventpoll.c:2241) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 177 Comm: epoll_race Not tainted 5.17.0-58927-gf443e374ae13 #6 Hardware name: Red Hat KVM, BIOS 1.11.0-2.amzn2 04/01/2014 Link: https://lkml.kernel.org/r/20220322002653.33865-1-kuniyu@amazon.co.jp Link: https://lkml.kernel.org/r/20220322002653.33865-2-kuniyu@amazon.co.jp Fixes: 3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Kuniyuki Iwashima <kuni1840@gmail.com> Cc: "Soheil Hassas Yeganeh" <soheil@google.com> Cc: "Sridhar Samudrala" <sridhar.samudrala@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-29vmcore: convert read_from_oldmem() to take an iov_iterMatthew Wilcox (Oracle)
Remove the read_from_oldmem() wrapper introduced earlier and convert all the remaining callers to pass an iov_iter. Link: https://lkml.kernel.org/r/20220408090636.560886-4-bhe@redhat.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Tiezhu Yang <yangtiezhu@loongson.cn> Cc: Amit Daniel Kachhap <amit.kachhap@arm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-29vmcore: convert __read_vmcore to use an iov_iterMatthew Wilcox (Oracle)
This gets rid of copy_to() and let us use proc_read_iter() instead of proc_read(). Link: https://lkml.kernel.org/r/20220408090636.560886-3-bhe@redhat.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-29vmcore: convert copy_oldmem_page() to take an iov_iterMatthew Wilcox (Oracle)
Patch series "Convert vmcore to use an iov_iter", v5. For some reason several people have been sending bad patches to fix compiler warnings in vmcore recently. Here's how it should be done. Compile-tested only on x86. As noted in the first patch, s390 should take this conversion a bit further, but I'm not inclined to do that work myself. This patch (of 3): Instead of passing in a 'buf' and 'userbuf' argument, pass in an iov_iter. s390 needs more work to pass the iov_iter down further, or refactor, but I'd be more comfortable if someone who can test on s390 did that work. It's more convenient to convert the whole of read_from_oldmem() to take an iov_iter at the same time, so rename it to read_from_oldmem_iter() and add a temporary read_from_oldmem() wrapper that creates an iov_iter. Link: https://lkml.kernel.org/r/20220408090636.560886-1-bhe@redhat.com Link: https://lkml.kernel.org/r/20220408090636.560886-2-bhe@redhat.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-29fs/proc/kcore.c: remove check of list iterator against head past the loop bodyJakob Koschel
When list_for_each_entry() completes the iteration over the whole list without breaking the loop, the iterator value will be a bogus pointer computed based on the head element. While it is safe to use the pointer to determine if it was computed based on the head element, either with list_entry_is_head() or &pos->member == head, using the iterator variable after the loop should be avoided. In preparation to limit the scope of a list iterator to the list traversal loop, use a dedicated pointer to point to the found element [1]. [akpm@linux-foundation.org: reduce scope of `iter'] Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ [1] Link: https://lkml.kernel.org/r/20220331223700.902556-1-jakobkoschel@gmail.com Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Brian Johannesmeyer" <bjohannesmeyer@gmail.com> Cc: Cristiano Giuffrida <c.giuffrida@vu.nl> Cc: "Bos, H.J." <h.j.bos@vu.nl> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-29ocfs2: rewrite error handling of ocfs2_fill_superHeming Zhao via Ocfs2-devel
Current ocfs2_fill_super() uses one goto label "read_super_error" to handle all error cases. And with previous serial patches, the error handling should fork more branches to handle different error cases. This patch rewrite the error handling of ocfs2_fill_super. Link: https://lkml.kernel.org/r/20220424130952.2436-6-heming.zhao@suse.com Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-29ocfs2: ocfs2_mount_volume does cleanup job before return errorHeming Zhao via Ocfs2-devel
After this patch, when error, ocfs2_fill_super doesn't take care to release resources which are allocated in ocfs2_mount_volume. Link: https://lkml.kernel.org/r/20220424130952.2436-5-heming.zhao@suse.com Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-29ocfs2: ocfs2_initialize_super does cleanup job before return errorHeming Zhao via Ocfs2-devel
After this patch, when error, ocfs2_fill_super doesn't take care to release resources which are allocated in ocfs2_initialize_super. Link: https://lkml.kernel.org/r/20220424130952.2436-4-heming.zhao@suse.com Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-29ocfs2: change return type of ocfs2_resmap_initHeming Zhao via Ocfs2-devel
Since ocfs2_resmap_init() always return 0, change it to void. Link: https://lkml.kernel.org/r/20220424130952.2436-3-heming.zhao@suse.com Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-29ocfs2: fix mounting crash if journal is not allocedHeming Zhao via Ocfs2-devel
Patch series "rewrite error handling during mounting stage". This patch (of 5): After commit da5e7c87827e8 ("ocfs2: cleanup journal init and shutdown"), journal init later than before, it makes NULL pointer access in free routine. Crash flow: ocfs2_fill_super + ocfs2_mount_volume | + ocfs2_dlm_init //fail & return, osb->journal is NULL. | + ... | + ocfs2_check_volume //no chance to init osb->journal | + ... + ocfs2_dismount_volume ocfs2_release_system_inodes ... evict ... ocfs2_clear_inode ocfs2_checkpoint_inode ocfs2_ci_fully_checkpointed time_after(journal->j_trans_id, ci->ci_last_trans) + journal is empty, crash! For fixing, there are three solutions: 1> Partly revert commit da5e7c87827e8 For avoiding kernel crash, this make sense for us. We only concerned whether there has any non-system inode access before dlm init. The answer is NO. And all journal replay/recovery handling happen after dlm & journal init done. So this method is not graceful but workable. 2> Add osb->journal check in free inode routine (eg ocfs2_clear_inode) The fix code is special for mounting phase, but it will continue working after mounting stage. In another word, this method adds useless code in normal inode free flow. 3> Do directly free inode in mounting phase This method is brutal/complex and may introduce unsafe code, currently maintainer didn't like. At last, we chose method <1> and did partly reverted job. We reverted journal init codes, and kept cleanup codes flow. Link: https://lkml.kernel.org/r/20220424130952.2436-1-heming.zhao@suse.com Link: https://lkml.kernel.org/r/20220424130952.2436-2-heming.zhao@suse.com Fixes: da5e7c87827e8 ("ocfs2: cleanup journal init and shutdown") Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-29ocfs2: remove usage of list iterator variable after the loop bodyJakob Koschel
To move the list iterator variable into the list_for_each_entry_*() macro in the future it should be avoided to use the list iterator variable after the loop body. To *never* use the list iterator variable after the loop it was concluded to use a separate iterator variable [1]. Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ Link: https://lkml.kernel.org/r/20220322105014.3626194-1-jakobkoschel@gmail.com Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-29ocfs2: replace usage of found with dedicated list iterator variableJakob Koschel
To move the list iterator variable into the list_for_each_entry_*() macro in the future it should be avoided to use the list iterator variable after the loop body. To *never* use the list iterator variable after the loop it was concluded to use a separate iterator variable instead of a found boolean [1]. This removes the need to use a found variable and simply checking if the variable was set, can determine if the break/goto was hit. Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ Link: https://lkml.kernel.org/r/20220324071650.61168-1-jakobkoschel@gmail.com Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-29Merge tag 'ceph-for-5.18-rc5' of https://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph client fixes from Ilya Dryomov: "A fix for a NULL dereference that turns out to be easily triggerable by fsync (marked for stable) and a false positive WARN and snap_rwsem locking fixups" * tag 'ceph-for-5.18-rc5' of https://github.com/ceph/ceph-client: ceph: fix possible NULL pointer dereference for req->r_session ceph: remove incorrect session state check ceph: get snap_rwsem read lock in handle_cap_export for ceph_add_cap libceph: disambiguate cluster/pool full log message
2022-04-29io_uring: check that data field is 0 in ringfd unregisterEugene Syromiatnikov
Only allow data field to be 0 in struct io_uring_rsrc_update user arguments to allow for future possible usage. Fixes: e7a6c00dc77a ("io_uring: add support for registering ring file descriptors") Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com> Link: https://lore.kernel.org/r/20220429142218.GA28696@asgard.redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-28ksm: count ksm merging pages for each processxu xin
Some applications or containers want to use KSM by calling madvise() to advise areas of address space to be MERGEABLE. But they may not know which applications are more likely to cause real merges in the deployment. If this patch is applied, it helps them know their corresponding number of merged pages, and then optimize their app code. As current KSM only counts the number of KSM merging pages(e.g. ksm_pages_sharing and ksm_pages_shared) of the whole system, we cannot see the more fine-grained KSM merging, for the upper application optimization, the merging area cannot be set easily according to the KSM page merging probability of each process. Therefore, it is necessary to add extra statistical means so that the upper level users can know the detailed KSM merging information of each process. We add a new proc file named as ksm_merging_pages under /proc/<pid>/ to indicate the involved ksm merging pages of this process. [akpm@linux-foundation.org: fix comment typo, remove BUG_ON()s] Link: https://lkml.kernel.org/r/20220325082318.2352853-1-xu.xin16@zte.com.cn Signed-off-by: xu xin <xu.xin16@zte.com.cn> Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Yang Yang <yang.yang29@zte.com.cn> Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Reported-by: Zeal Robot <zealci@zte.com.cn> Cc: Kees Cook <keescook@chromium.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Ohhoon Kwon <ohoono.kwon@samsung.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Stephen Brennan <stephen.s.brennan@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Feng Tang <feng.tang@intel.com> Cc: Yang Yang <yang.yang29@zte.com.cn> Cc: Ran Xiaokai <ran.xiaokai@zte.com.cn> Cc: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-28mm: hugetlb_vmemmap: cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP*Muchun Song
The word of "free" is not expressive enough to express the feature of optimizing vmemmap pages associated with each HugeTLB, rename this keywork to "optimize". In this patch , cheanup configs to make code more expressive. Link: https://lkml.kernel.org/r/20220404074652.68024-4-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-28dax: fix missing writeprotect the pte entryMuchun Song
Currently dax_mapping_entry_mkclean() fails to clean and write protect the pte entry within a DAX PMD entry during an *sync operation. This can result in data loss in the following sequence: 1) process A mmap write to DAX PMD, dirtying PMD radix tree entry and making the pmd entry dirty and writeable. 2) process B mmap with the @offset (e.g. 4K) and @length (e.g. 4K) write to the same file, dirtying PMD radix tree entry (already done in 1)) and making the pte entry dirty and writeable. 3) fsync, flushing out PMD data and cleaning the radix tree entry. We currently fail to mark the pte entry as clean and write protected since the vma of process B is not covered in dax_entry_mkclean(). 4) process B writes to the pte. These don't cause any page faults since the pte entry is dirty and writeable. The radix tree entry remains clean. 5) fsync, which fails to flush the dirty PMD data because the radix tree entry was clean. 6) crash - dirty data that should have been fsync'd as part of 5) could still have been in the processor cache, and is lost. Just to use pfn_mkclean_range() to clean the pfns to fix this issue. Link: https://lkml.kernel.org/r/20220403053957.10770-6-songmuchun@bytedance.com Fixes: 4b4bb46d00b3 ("dax: clear dirty entry tags on cache flush") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-28dax: fix cache flush on PMD-mapped pagesMuchun Song
The flush_cache_page() only remove a PAGE_SIZE sized range from the cache. However, it does not cover the full pages in a THP except a head page. Replace it with flush_cache_range() to fix this issue. This is just a documentation issue with the respect to properly documenting the expected usage of cache flushing before modifying the pmd. However, in practice this is not a problem due to the fact that DAX is not available on architectures with virtually indexed caches per: commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Link: https://lkml.kernel.org/r/20220403053957.10770-3-songmuchun@bytedance.com Fixes: f729c8c9b24f ("dax: wrprotect pmd_t in dax_mapping_entry_mkclean") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-28fs/proc/task_mmu.c: remove redundant page validation of pte_pageXianting Tian
pte_page() always returns a valid page, so remove the redundant page validation, as we did in many other places. Link: https://lkml.kernel.org/r/20220316025947.328276-1-xianting.tian@linux.alibaba.com Signed-off-by: Xianting Tian <xianting.tian@linux.alibaba.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Sasha Levin <sashal@kernel.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-28mm: hugetlb_vmemmap: introduce ARCH_WANT_HUGETLB_PAGE_FREE_VMEMMAPMuchun Song
The feature of minimizing overhead of struct page associated with each HugeTLB page is implemented on x86_64, however, the infrastructure of this feature is already there, we could easily enable it for other architectures. Introduce ARCH_WANT_HUGETLB_PAGE_FREE_VMEMMAP for other architectures to be easily enabled. Just select this config if they want to enable this feature. Link: https://lkml.kernel.org/r/20220331065640.5777-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Barry Song <baohua@kernel.org> Tested-by: Barry Song <baohua@kernel.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: James Morse <james.morse@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Will Deacon <will@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-28io_uring: fix uninitialized field in rw io_kiocbJoseph Ravichandran
io_rw_init_file does not initialize kiocb->private, so when iocb_bio_iopoll reads kiocb->private it can contain uninitialized data. Fixes: 3e08773c3841 ("block: switch polling to be bio based") Signed-off-by: Joseph Ravichandran <jravi@mit.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-28xfs: rename xfs_*alloc*_log_count to _block_countDarrick J. Wong
These functions return the maximum number of blocks that could be logged in a particular transaction. "log count" is confusing since there's a separate concept of a log (operation) count in the reservation code, so let's change it to "block count" to be less confusing. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-04-28xfs: rewrite xfs_reflink_end_cow to use intentsDarrick J. Wong
Currently, the code that performs CoW remapping after a write has this odd behavior where it walks /backwards/ through the data fork to remap extents in reverse order. Earlier, we rewrote the reflink remap function to use deferred bmap log items instead of trying to cram as much into the first transaction that we could. Now do the same for the CoW remap code. There doesn't seem to be any performance impact; we're just making better use of code that we added for the benefit of reflink. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-04-28xfs: reduce transaction reservations with reflinkDarrick J. Wong
Before to the introduction of deferred refcount operations, reflink would try to cram refcount btree updates into the same transaction as an allocation or a free event. Mainline XFS has never actually done that, but we never refactored the transaction reservations to reflect that we now do all refcount updates in separate transactions. Fix this to reduce the transaction reservation size even farther, so that between this patch and the previous one, we reduce the tr_write and tr_itruncate sizes by 66%. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-04-28xfs: reduce the absurdly large log operation countDarrick J. Wong
Back in the early days of reflink and rmap development I set the transaction reservation sizes to be overly generous for rmap+reflink filesystems, and a little under-generous for rmap-only filesystems. Since we don't need *eight* transaction rolls to handle three new log intent items, decrease the logcounts to what we actually need, and amend the shadow reservation computation function to reflect what we used to do so that the minimum log size doesn't change. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-04-28xfs: report "max_resp" used for min log size computationDarrick J. Wong
Move the tracepoint that computes the size of the transaction used to compute the minimum log size into xfs_log_get_max_trans_res so that we only have to compute this stuff once. Leave xfs_log_get_max_trans_res as a non-static function so that xfs_db can call it to report the results of the userspace computation of the same value to diagnose mkfs/kernel misinteractions. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-04-28xfs: create shadow transaction reservations for computing minimum log sizeDarrick J. Wong
Every time someone changes the transaction reservation sizes, they introduce potential compatibility problems if the changes affect the minimum log size that we validate at mount time. If the minimum log size gets larger (which should be avoided because doing so presents a serious risk of log livelock), filesystems created with old mkfs will not mount on a newer kernel; if the minimum size shrinks, filesystems created with newer mkfs will not mount on older kernels. Therefore, enable the creation of a shadow log reservation structure where we can "undo" the effects of tweaks when computing minimum log sizes. These shadow reservations should never be used in practice, but they insulate us from perturbations in minimum log size. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-04-28xfs: remove a __xfs_bunmapi call from reflinkDarrick J. Wong
This raw call isn't necessary since we can always remove a full delalloc extent. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-04-28xfs: stop artificially limiting the length of bunmap callsDarrick J. Wong
In commit e1a4e37cc7b6, we clamped the length of bunmapi calls on the data forks of shared files to avoid two failure scenarios: one where the extent being unmapped is so sparsely shared that we exceed the transaction reservation with the sheer number of refcount btree updates and EFI intent items; and the other where we attach so many deferred updates to the transaction that we pin the log tail and later the log head meets the tail, causing the log to livelock. We avoid triggering the first problem by tracking the number of ops in the refcount btree cursor and forcing a requeue of the refcount intent item any time we think that we might be close to overflowing. This has been baked into XFS since before the original e1a4 patch. A recent patchset fixed the second problem by changing the deferred ops code to finish all the work items created by each round of trying to complete a refcount intent item, which eliminates the long chains of deferred items (27dad); and causing long-running transactions to relog their intent log items when space in the log gets low (74f4d). Because this clamp affects /any/ unmapping request regardless of the sharing factors of the component blocks, it degrades the performance of all large unmapping requests -- whereas with an unshared file we can unmap millions of blocks in one go, shared files are limited to unmapping a few thousand blocks at a time, which causes the upper level code to spin in a bunmapi loop even if it wasn't needed. This also eliminates one more place where log recovery behavior can differ from online behavior, because bunmapi operations no longer need to requeue. The fstest generic/447 was created to test the old fix, and it still passes with this applied. Partial-revert-of: e1a4e37cc7b6 ("xfs: try to avoid blowing out the transaction reservation when bunmaping a shared extent") Depends: 27dada070d59 ("xfs: change the order in which child and parent defer ops ar finished") Depends: 74f4d6a1e065 ("xfs: only relog deferred intent items if free space in the log gets low") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-04-28xfs: count EFIs when deciding to ask for a continuation of a refcount updateDarrick J. Wong
A long time ago, I added to XFS the ability to use deferred reference count operations as part of a transaction chain. This enabled us to avoid blowing out the transaction reservation when the blocks in a physical extent all had different reference counts because we could ask the deferred operation manager for a continuation, which would get us a clean transaction. The refcount code asks for a continuation when the number of refcount record updates reaches the point where we think that the transaction has logged enough full btree blocks due to refcount (and free space) btree shape changes and refcount record updates that we're in danger of overflowing the transaction. We did not previously count the EFIs logged to the refcount update transaction because the clamps on the length of a bunmap operation were sufficient to avoid overflowing the transaction reservation even in the worst case situation where every other block of the unmapped extent is shared. Unfortunately, the restrictions on bunmap length avoid failure in the worst case by imposing a maximum unmap length of ~3000 blocks, even for non-pathological cases. This seriously limits performance when freeing large extents. Therefore, track EFIs with the same counter as refcount record updates, and use that information as input into when we should ask for a continuation. This enables the next patch to drop the clumsy bunmap limitation. Depends: 27dada070d59 ("xfs: change the order in which child and parent defer ops ar finished") Depends: 74f4d6a1e065 ("xfs: only relog deferred intent items if free space in the log gets low") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>