Age | Commit message (Collapse) | Author |
|
This patch removes unnecessary error assigns to 0 at places we know that
error is zero because it was checked on non-zero before.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
We always call hold_lkb(lkb) if we increment lkb->lkb_wait_count.
So, we always need to call unhold_lkb(lkb) if we decrement
lkb->lkb_wait_count. This patch will add missing unhold_lkb(lkb) if we
decrement lkb->lkb_wait_count. In case of setting lkb->lkb_wait_count to
zero we need to countdown until reaching zero and call unhold_lkb(lkb).
The waiters list unhold_lkb(lkb) can be removed because it's done for
the last lkb_wait_count decrement iteration as it's done in
_remove_from_waiters().
This issue was discovered by a dlm gfs2 test case which use excessively
dlm_unlock(LKF_CANCEL) feature. Probably the lkb->lkb_wait_count value
never reached above 1 if this feature isn't used and so it was not
discovered before.
The testcase ended in a rsb on the rsb keep data structure with a
refcount of 1 but no lkb was associated with it, which is itself
an invalid behaviour. A side effect of that was a condition in which
the dlm was sending remove messages in a looping behaviour. With this
patch that has not been reproduced.
Cc: stable@vger.kernel.org
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
So far bio is marked as REQ_POLLED if RWF_HIPRI/IOCB_HIPRI is passed
from userspace sync io interface, then block layer tries to poll until
the bio is completed. But the current implementation calls
blk_io_schedule() if bio_poll() returns 0, and this way causes io hang or
timeout easily.
But looks no one reports this kind of issue, which should have been
triggered in normal io poll sanity test or blktests block/007 as
observed by Changhui, that means it is very likely that no one uses it
or no one cares it.
Also after io_uring is invented, io poll for sync dio becomes legacy
interface.
So ignore RWF_HIPRI hint for sync dio.
CC: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Reported-by: Changhui Zhong <czhong@redhat.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Changhui Zhong <czhong@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220420143110.2679002-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We defer file assignment to ensure that fixed files work with links
between a direct accept/open and the links that follow it. But this has
the side effect that normal file assignment is then not complete by the
time that request submission has been done.
For deferred execution, if the file is a regular file, assign it when
we do the async prep anyway.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We need the kernfs/driver core fixes in here as well.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Define a function named fsverity_get_digest() to return the verity file
digest and the associated hash algorithm (enum hash_algo).
This assumes that before calling fsverity_get_digest() the file must have
been opened, which is even true for the IMA measure/appraise on file
open policy rule use case (func=FILE_CHECK). do_open() calls vfs_open()
immediately prior to ima_file_check().
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are some small driver core and kernfs fixes for some reported
problems. They include:
- kernfs regression that is causing oopses in 5.17 and newer releases
- topology sysfs fixes for a few small reported problems.
All of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-5.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
kernfs: fix NULL dereferencing in kernfs_remove
topology: Fix up build warning in topology_is_visible()
arch_topology: Do not set llc_sibling if llc_id is invalid
topology: make core_mask include at least cluster_siblings
topology/sysfs: Hide PPIN on systems that do not support it.
|
|
The IORING_SQ_NEED_WAKEUP flag is now set using atomic_or() which
implies a full barrier on some architectures but it is not required to
do so. Use the more appropriate smp_mb__after_atomic() which avoids the
extra barrier on those architectures.
Signed-off-by: Almog Khaikin <almogkh@gmail.com>
Link: https://lore.kernel.org/r/20220426163403.112692-1-almogkh@gmail.com
Fixes: 8018823e6987 ("io_uring: serialize ctx->rings->sq_flags with atomic_or/and")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If IORING_SETUP_COOP_TASKRUN is set to use cooperative scheduling for
running task_work, then IORING_SETUP_TASKRUN_FLAG can be set so the
application can tell if task_work is pending in the kernel for this
ring. This allows use cases like io_uring_peek_cqe() to still function
appropriately, or for the task to know when it would be useful to
call io_uring_wait_cqe() to run pending events.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20220426014904.60384-7-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If this is set, io_uring will never use an IPI to deliver a task_work
notification. This can be used in the common case where a single task or
thread communicates with the ring, and doesn't rely on
io_uring_cqe_peek().
This provides a noticeable win in performance, both from eliminating
the IPI itself, but also from avoiding interrupting the submitting
task unnecessarily.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20220426014904.60384-6-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
While doing so, switch SQPOLL to TWA_SIGNAL_NO_IPI as well, as that
just does a task wakeup and then we can remove the special wakeup we
have in task_work_add.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20220426014904.60384-5-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The only difference between set_notify_signal() and __set_notify_signal()
is that the former checks if it needs to deliver an IPI to force a
reschedule. As the io-wq workers never leave the kernel, and IPI is never
needed, they simply need a wakeup.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20220426014904.60384-4-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Rather than require ctx->completion_lock for ensuring that we don't
clobber the flags, use the atomic bitop helpers instead. This removes
the need to grab the completion_lock, in preparation for needing to set
or clear sq_flags when we don't know the status of this lock.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20220426014904.60384-3-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
For now just use a CQE flag for this, with big CQE support we could
return the actual number of bytes left.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
* for-5.19/io_uring-socket: (73 commits)
io_uring: use the text representation of ops in trace
io_uring: rename op -> opcode
io_uring: add io_uring_get_opcode
io_uring: add type to op enum
io_uring: add socket(2) support
net: add __sys_socket_file()
io_uring: fix trace for reduced sqe padding
io_uring: add fgetxattr and getxattr support
io_uring: add fsetxattr and setxattr support
fs: split off do_getxattr from getxattr
fs: split off setxattr_copy and do_setxattr function from setxattr
io_uring: return an error when cqe is dropped
io_uring: use constants for cq_overflow bitfield
io_uring: rework io_uring_enter to simplify return value
io_uring: trace cqe overflows
io_uring: add trace support for CQE overflow
io_uring: allow re-poll if we made progress
io_uring: support MSG_WAITALL for IORING_OP_SEND(MSG)
io_uring: add support for IORING_ASYNC_CANCEL_ANY
io_uring: allow IORING_OP_ASYNC_CANCEL with 'fd' key
...
|
|
Pull io_uring fixes from Jens Axboe:
"Pretty boring:
- three patches just adding reserved field checks (me, Eugene)
- Fixing a potential regression with IOPOLL caused by a block change
(Joseph)"
Boring is good.
* tag 'io_uring-5.18-2022-04-29' of git://git.kernel.dk/linux-block:
io_uring: check that data field is 0 in ringfd unregister
io_uring: fix uninitialized field in rw io_kiocb
io_uring: check reserved fields for recv/recvmsg
io_uring: check reserved fields for send/sendmsg
|
|
sbi->s_firstinodezone is initialized to 2 and sbi->s_firstdatazone is read
from sbd. There's no guarantee that sbi->s_firstdatazone must bigger than
sbi->s_firstinodezone. If sbi->s_firstdatazone less than 2, the
filesystem can still be mounted unexpetly. At this point, sbi->s_ninodes
flip to very large value and this filesystem is broken. We can observe
this by executing 'df' command. When we execute, we will get an error
message:
"sysv_count_free_inodes: unable to read inode table"
Link: https://lkml.kernel.org/r/20220330104215.530223-1-liushixin2@huawei.com
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
fat*_ent_bread() can be the cause of too many report on I/O error path.
So use fat_msg_ratelimit() instead.
Link: https://lkml.kernel.org/r/87bkxogfeq.fsf@mail.parknet.co.jp
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reported-by: qianfan <qianfanguijin@163.com>
Tested-by: qianfan <qianfanguijin@163.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In order for end users to quickly react to new issues that come up in
production, it is proving useful to leverage the printk indexing system.
This printk index enables kernel developers to use calls to printk() with
changeable ad-hoc format strings (as they always have; no change of
expectations), while enabling end users to examine format strings to
detect changes.
Since end users are using regular expressions to match messages printed
through printk(), being able to detect changes in chosen format strings
from release to release provides a useful signal to review
printk()-matching regular expressions for any necessary updates.
So that detailed FAT messages are captured by this printk index, this
patch wraps fat_msg with a macro.
[akpm@linux-foundation.org: coding-style cleanups]
Link: https://lkml.kernel.org/r/8aaa2dd7995e820292bb40d2120ab69756662c65.1648688136.git.jof@thejof.com
Signed-off-by: Jonathan Lassoff <jof@thejof.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
iput() has already judged the incoming parameter, so there is no need to
repeat outside.
Link: https://lkml.kernel.org/r/1648265418-76563-1-git-send-email-fengyubo3@huawei.com
Signed-off-by: Yubo Feng <fengyubo3@huawei.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Fix data-races around epoll reported by KCSAN."
This series suppresses a false positive KCSAN's message and fixes a real
data-race.
This patch (of 2):
pipe_poll() runs locklessly and assigns 1 to poll_usage. Once poll_usage
is set to 1, it never changes in other places. However, concurrent writes
of a value trigger KCSAN, so let's make KCSAN happy.
BUG: KCSAN: data-race in pipe_poll / pipe_poll
write to 0xffff8880042f6678 of 4 bytes by task 174 on cpu 3:
pipe_poll (fs/pipe.c:656)
ep_item_poll.isra.0 (./include/linux/poll.h:88 fs/eventpoll.c:853)
do_epoll_wait (fs/eventpoll.c:1692 fs/eventpoll.c:1806 fs/eventpoll.c:2234)
__x64_sys_epoll_wait (fs/eventpoll.c:2246 fs/eventpoll.c:2241 fs/eventpoll.c:2241)
do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)
write to 0xffff8880042f6678 of 4 bytes by task 177 on cpu 1:
pipe_poll (fs/pipe.c:656)
ep_item_poll.isra.0 (./include/linux/poll.h:88 fs/eventpoll.c:853)
do_epoll_wait (fs/eventpoll.c:1692 fs/eventpoll.c:1806 fs/eventpoll.c:2234)
__x64_sys_epoll_wait (fs/eventpoll.c:2246 fs/eventpoll.c:2241 fs/eventpoll.c:2241)
do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 177 Comm: epoll_race Not tainted 5.17.0-58927-gf443e374ae13 #6
Hardware name: Red Hat KVM, BIOS 1.11.0-2.amzn2 04/01/2014
Link: https://lkml.kernel.org/r/20220322002653.33865-1-kuniyu@amazon.co.jp
Link: https://lkml.kernel.org/r/20220322002653.33865-2-kuniyu@amazon.co.jp
Fixes: 3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kuniyuki Iwashima <kuni1840@gmail.com>
Cc: "Soheil Hassas Yeganeh" <soheil@google.com>
Cc: "Sridhar Samudrala" <sridhar.samudrala@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Remove the read_from_oldmem() wrapper introduced earlier and convert all
the remaining callers to pass an iov_iter.
Link: https://lkml.kernel.org/r/20220408090636.560886-4-bhe@redhat.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Tiezhu Yang <yangtiezhu@loongson.cn>
Cc: Amit Daniel Kachhap <amit.kachhap@arm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This gets rid of copy_to() and let us use proc_read_iter() instead of
proc_read().
Link: https://lkml.kernel.org/r/20220408090636.560886-3-bhe@redhat.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Convert vmcore to use an iov_iter", v5.
For some reason several people have been sending bad patches to fix
compiler warnings in vmcore recently. Here's how it should be done.
Compile-tested only on x86. As noted in the first patch, s390 should take
this conversion a bit further, but I'm not inclined to do that work
myself.
This patch (of 3):
Instead of passing in a 'buf' and 'userbuf' argument, pass in an iov_iter.
s390 needs more work to pass the iov_iter down further, or refactor, but
I'd be more comfortable if someone who can test on s390 did that work.
It's more convenient to convert the whole of read_from_oldmem() to take an
iov_iter at the same time, so rename it to read_from_oldmem_iter() and add
a temporary read_from_oldmem() wrapper that creates an iov_iter.
Link: https://lkml.kernel.org/r/20220408090636.560886-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20220408090636.560886-2-bhe@redhat.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When list_for_each_entry() completes the iteration over the whole list
without breaking the loop, the iterator value will be a bogus pointer
computed based on the head element.
While it is safe to use the pointer to determine if it was computed based
on the head element, either with list_entry_is_head() or &pos->member ==
head, using the iterator variable after the loop should be avoided.
In preparation to limit the scope of a list iterator to the list traversal
loop, use a dedicated pointer to point to the found element [1].
[akpm@linux-foundation.org: reduce scope of `iter']
Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ [1]
Link: https://lkml.kernel.org/r/20220331223700.902556-1-jakobkoschel@gmail.com
Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Brian Johannesmeyer" <bjohannesmeyer@gmail.com>
Cc: Cristiano Giuffrida <c.giuffrida@vu.nl>
Cc: "Bos, H.J." <h.j.bos@vu.nl>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Current ocfs2_fill_super() uses one goto label "read_super_error" to
handle all error cases. And with previous serial patches, the error
handling should fork more branches to handle different error cases. This
patch rewrite the error handling of ocfs2_fill_super.
Link: https://lkml.kernel.org/r/20220424130952.2436-6-heming.zhao@suse.com
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
After this patch, when error, ocfs2_fill_super doesn't take care to
release resources which are allocated in ocfs2_mount_volume.
Link: https://lkml.kernel.org/r/20220424130952.2436-5-heming.zhao@suse.com
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
After this patch, when error, ocfs2_fill_super doesn't take care to
release resources which are allocated in ocfs2_initialize_super.
Link: https://lkml.kernel.org/r/20220424130952.2436-4-heming.zhao@suse.com
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Since ocfs2_resmap_init() always return 0, change it to void.
Link: https://lkml.kernel.org/r/20220424130952.2436-3-heming.zhao@suse.com
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "rewrite error handling during mounting stage".
This patch (of 5):
After commit da5e7c87827e8 ("ocfs2: cleanup journal init and shutdown"),
journal init later than before, it makes NULL pointer access in free
routine.
Crash flow:
ocfs2_fill_super
+ ocfs2_mount_volume
| + ocfs2_dlm_init //fail & return, osb->journal is NULL.
| + ...
| + ocfs2_check_volume //no chance to init osb->journal
|
+ ...
+ ocfs2_dismount_volume
ocfs2_release_system_inodes
...
evict
...
ocfs2_clear_inode
ocfs2_checkpoint_inode
ocfs2_ci_fully_checkpointed
time_after(journal->j_trans_id, ci->ci_last_trans)
+ journal is empty, crash!
For fixing, there are three solutions:
1> Partly revert commit da5e7c87827e8
For avoiding kernel crash, this make sense for us. We only
concerned whether there has any non-system inode access before dlm
init. The answer is NO. And all journal replay/recovery handling
happen after dlm & journal init done. So this method is not graceful
but workable.
2> Add osb->journal check in free inode routine (eg ocfs2_clear_inode)
The fix code is special for mounting phase, but it will continue
working after mounting stage. In another word, this method adds
useless code in normal inode free flow.
3> Do directly free inode in mounting phase
This method is brutal/complex and may introduce unsafe code,
currently maintainer didn't like.
At last, we chose method <1> and did partly reverted job. We reverted
journal init codes, and kept cleanup codes flow.
Link: https://lkml.kernel.org/r/20220424130952.2436-1-heming.zhao@suse.com
Link: https://lkml.kernel.org/r/20220424130952.2436-2-heming.zhao@suse.com
Fixes: da5e7c87827e8 ("ocfs2: cleanup journal init and shutdown")
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
To move the list iterator variable into the list_for_each_entry_*() macro
in the future it should be avoided to use the list iterator variable after
the loop body.
To *never* use the list iterator variable after the loop it was concluded
to use a separate iterator variable [1].
Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/
Link: https://lkml.kernel.org/r/20220322105014.3626194-1-jakobkoschel@gmail.com
Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
To move the list iterator variable into the list_for_each_entry_*() macro
in the future it should be avoided to use the list iterator variable after
the loop body.
To *never* use the list iterator variable after the loop it was concluded
to use a separate iterator variable instead of a found boolean [1].
This removes the need to use a found variable and simply checking if the
variable was set, can determine if the break/goto was hit.
Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/
Link: https://lkml.kernel.org/r/20220324071650.61168-1-jakobkoschel@gmail.com
Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Pull ceph client fixes from Ilya Dryomov:
"A fix for a NULL dereference that turns out to be easily triggerable
by fsync (marked for stable) and a false positive WARN and snap_rwsem
locking fixups"
* tag 'ceph-for-5.18-rc5' of https://github.com/ceph/ceph-client:
ceph: fix possible NULL pointer dereference for req->r_session
ceph: remove incorrect session state check
ceph: get snap_rwsem read lock in handle_cap_export for ceph_add_cap
libceph: disambiguate cluster/pool full log message
|
|
Only allow data field to be 0 in struct io_uring_rsrc_update user
arguments to allow for future possible usage.
Fixes: e7a6c00dc77a ("io_uring: add support for registering ring file descriptors")
Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Link: https://lore.kernel.org/r/20220429142218.GA28696@asgard.redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Some applications or containers want to use KSM by calling madvise() to
advise areas of address space to be MERGEABLE. But they may not know
which applications are more likely to cause real merges in the
deployment. If this patch is applied, it helps them know their
corresponding number of merged pages, and then optimize their app code.
As current KSM only counts the number of KSM merging pages(e.g.
ksm_pages_sharing and ksm_pages_shared) of the whole system, we cannot see
the more fine-grained KSM merging, for the upper application optimization,
the merging area cannot be set easily according to the KSM page merging
probability of each process. Therefore, it is necessary to add extra
statistical means so that the upper level users can know the detailed KSM
merging information of each process.
We add a new proc file named as ksm_merging_pages under /proc/<pid>/ to
indicate the involved ksm merging pages of this process.
[akpm@linux-foundation.org: fix comment typo, remove BUG_ON()s]
Link: https://lkml.kernel.org/r/20220325082318.2352853-1-xu.xin16@zte.com.cn
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Reported-by: Zeal Robot <zealci@zte.com.cn>
Cc: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Ohhoon Kwon <ohoono.kwon@samsung.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Stephen Brennan <stephen.s.brennan@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Cc: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Cc: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The word of "free" is not expressive enough to express the feature of
optimizing vmemmap pages associated with each HugeTLB, rename this keywork
to "optimize". In this patch , cheanup configs to make code more
expressive.
Link: https://lkml.kernel.org/r/20220404074652.68024-4-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently dax_mapping_entry_mkclean() fails to clean and write protect the
pte entry within a DAX PMD entry during an *sync operation. This can
result in data loss in the following sequence:
1) process A mmap write to DAX PMD, dirtying PMD radix tree entry and
making the pmd entry dirty and writeable.
2) process B mmap with the @offset (e.g. 4K) and @length (e.g. 4K)
write to the same file, dirtying PMD radix tree entry (already
done in 1)) and making the pte entry dirty and writeable.
3) fsync, flushing out PMD data and cleaning the radix tree entry. We
currently fail to mark the pte entry as clean and write protected
since the vma of process B is not covered in dax_entry_mkclean().
4) process B writes to the pte. These don't cause any page faults since
the pte entry is dirty and writeable. The radix tree entry remains
clean.
5) fsync, which fails to flush the dirty PMD data because the radix tree
entry was clean.
6) crash - dirty data that should have been fsync'd as part of 5) could
still have been in the processor cache, and is lost.
Just to use pfn_mkclean_range() to clean the pfns to fix this issue.
Link: https://lkml.kernel.org/r/20220403053957.10770-6-songmuchun@bytedance.com
Fixes: 4b4bb46d00b3 ("dax: clear dirty entry tags on cache flush")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The flush_cache_page() only remove a PAGE_SIZE sized range from the cache.
However, it does not cover the full pages in a THP except a head page.
Replace it with flush_cache_range() to fix this issue. This is just a
documentation issue with the respect to properly documenting the expected
usage of cache flushing before modifying the pmd. However, in practice
this is not a problem due to the fact that DAX is not available on
architectures with virtually indexed caches per:
commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches")
Link: https://lkml.kernel.org/r/20220403053957.10770-3-songmuchun@bytedance.com
Fixes: f729c8c9b24f ("dax: wrprotect pmd_t in dax_mapping_entry_mkclean")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
pte_page() always returns a valid page, so remove the redundant page
validation, as we did in many other places.
Link: https://lkml.kernel.org/r/20220316025947.328276-1-xianting.tian@linux.alibaba.com
Signed-off-by: Xianting Tian <xianting.tian@linux.alibaba.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The feature of minimizing overhead of struct page associated with each
HugeTLB page is implemented on x86_64, however, the infrastructure of this
feature is already there, we could easily enable it for other
architectures. Introduce ARCH_WANT_HUGETLB_PAGE_FREE_VMEMMAP for other
architectures to be easily enabled. Just select this config if they want
to enable this feature.
Link: https://lkml.kernel.org/r/20220331065640.5777-1-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Barry Song <baohua@kernel.org>
Tested-by: Barry Song <baohua@kernel.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: James Morse <james.morse@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
io_rw_init_file does not initialize kiocb->private, so when iocb_bio_iopoll
reads kiocb->private it can contain uninitialized data.
Fixes: 3e08773c3841 ("block: switch polling to be bio based")
Signed-off-by: Joseph Ravichandran <jravi@mit.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
These functions return the maximum number of blocks that could be logged
in a particular transaction. "log count" is confusing since there's a
separate concept of a log (operation) count in the reservation code, so
let's change it to "block count" to be less confusing.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Currently, the code that performs CoW remapping after a write has this
odd behavior where it walks /backwards/ through the data fork to remap
extents in reverse order. Earlier, we rewrote the reflink remap
function to use deferred bmap log items instead of trying to cram as
much into the first transaction that we could. Now do the same for the
CoW remap code. There doesn't seem to be any performance impact; we're
just making better use of code that we added for the benefit of reflink.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Before to the introduction of deferred refcount operations, reflink
would try to cram refcount btree updates into the same transaction as an
allocation or a free event. Mainline XFS has never actually done that,
but we never refactored the transaction reservations to reflect that we
now do all refcount updates in separate transactions. Fix this to
reduce the transaction reservation size even farther, so that between
this patch and the previous one, we reduce the tr_write and tr_itruncate
sizes by 66%.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Back in the early days of reflink and rmap development I set the
transaction reservation sizes to be overly generous for rmap+reflink
filesystems, and a little under-generous for rmap-only filesystems.
Since we don't need *eight* transaction rolls to handle three new log
intent items, decrease the logcounts to what we actually need, and amend
the shadow reservation computation function to reflect what we used to
do so that the minimum log size doesn't change.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Move the tracepoint that computes the size of the transaction used to
compute the minimum log size into xfs_log_get_max_trans_res so that we
only have to compute this stuff once.
Leave xfs_log_get_max_trans_res as a non-static function so that xfs_db
can call it to report the results of the userspace computation of the
same value to diagnose mkfs/kernel misinteractions.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Every time someone changes the transaction reservation sizes, they
introduce potential compatibility problems if the changes affect the
minimum log size that we validate at mount time. If the minimum log
size gets larger (which should be avoided because doing so presents a
serious risk of log livelock), filesystems created with old mkfs will
not mount on a newer kernel; if the minimum size shrinks, filesystems
created with newer mkfs will not mount on older kernels.
Therefore, enable the creation of a shadow log reservation structure
where we can "undo" the effects of tweaks when computing minimum log
sizes. These shadow reservations should never be used in practice, but
they insulate us from perturbations in minimum log size.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
This raw call isn't necessary since we can always remove a full delalloc
extent.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
In commit e1a4e37cc7b6, we clamped the length of bunmapi calls on the
data forks of shared files to avoid two failure scenarios: one where the
extent being unmapped is so sparsely shared that we exceed the
transaction reservation with the sheer number of refcount btree updates
and EFI intent items; and the other where we attach so many deferred
updates to the transaction that we pin the log tail and later the log
head meets the tail, causing the log to livelock.
We avoid triggering the first problem by tracking the number of ops in
the refcount btree cursor and forcing a requeue of the refcount intent
item any time we think that we might be close to overflowing. This has
been baked into XFS since before the original e1a4 patch.
A recent patchset fixed the second problem by changing the deferred ops
code to finish all the work items created by each round of trying to
complete a refcount intent item, which eliminates the long chains of
deferred items (27dad); and causing long-running transactions to relog
their intent log items when space in the log gets low (74f4d).
Because this clamp affects /any/ unmapping request regardless of the
sharing factors of the component blocks, it degrades the performance of
all large unmapping requests -- whereas with an unshared file we can
unmap millions of blocks in one go, shared files are limited to
unmapping a few thousand blocks at a time, which causes the upper level
code to spin in a bunmapi loop even if it wasn't needed.
This also eliminates one more place where log recovery behavior can
differ from online behavior, because bunmapi operations no longer need
to requeue. The fstest generic/447 was created to test the old fix, and
it still passes with this applied.
Partial-revert-of: e1a4e37cc7b6 ("xfs: try to avoid blowing out the transaction reservation when bunmaping a shared extent")
Depends: 27dada070d59 ("xfs: change the order in which child and parent defer ops ar finished")
Depends: 74f4d6a1e065 ("xfs: only relog deferred intent items if free space in the log gets low")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
A long time ago, I added to XFS the ability to use deferred reference
count operations as part of a transaction chain. This enabled us to
avoid blowing out the transaction reservation when the blocks in a
physical extent all had different reference counts because we could ask
the deferred operation manager for a continuation, which would get us a
clean transaction.
The refcount code asks for a continuation when the number of refcount
record updates reaches the point where we think that the transaction has
logged enough full btree blocks due to refcount (and free space) btree
shape changes and refcount record updates that we're in danger of
overflowing the transaction.
We did not previously count the EFIs logged to the refcount update
transaction because the clamps on the length of a bunmap operation were
sufficient to avoid overflowing the transaction reservation even in the
worst case situation where every other block of the unmapped extent is
shared.
Unfortunately, the restrictions on bunmap length avoid failure in the
worst case by imposing a maximum unmap length of ~3000 blocks, even for
non-pathological cases. This seriously limits performance when freeing
large extents.
Therefore, track EFIs with the same counter as refcount record updates,
and use that information as input into when we should ask for a
continuation. This enables the next patch to drop the clumsy bunmap
limitation.
Depends: 27dada070d59 ("xfs: change the order in which child and parent defer ops ar finished")
Depends: 74f4d6a1e065 ("xfs: only relog deferred intent items if free space in the log gets low")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|