Age | Commit message (Collapse) | Author |
|
This patch fixes a race in nodeid2con in cases that we parallel running
a lookup and both will create a connection structure for the same nodeid.
It's a rare case to create a new connection structure to keep reader
lockless we just do a lookup inside the protection area again and drop
previous work if this race happens.
Fixes: a47666eb763cc ("fs: dlm: make connection hash lockless")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
Calling pipe2() with O_NOTIFICATION_PIPE could results in memory
leaks unless watch_queue_init() is successful.
In case of watch_queue_init() failure in pipe2() we are left
with inode and pipe_inode_info instances that need to be freed. That
failure exit has been introduced in commit c73be61cede5 ("pipe: Add
general notification queue support") and its handling should've been
identical to nearby treatment of alloc_file_pseudo() failures - it
is dealing with the same situation. As it is, the mainline kernel
leaks in that case.
Another problem is that CONFIG_WATCH_QUEUE and !CONFIG_WATCH_QUEUE
cases are treated differently (and the former leaks just pipe_inode_info,
the latter - both pipe_inode_info and inode).
Fixed by providing a dummy wacth_queue_init() in !CONFIG_WATCH_QUEUE
case and by having failures of wacth_queue_init() handled the same way
we handle alloc_file_pseudo() ones.
Fixes: c73be61cede5 ("pipe: Add general notification queue support")
Signed-off-by: Qian Cai <cai@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
With suitably crafted reiserfs image and mount command reiserfs will
crash when trying to verify that XATTR_ROOT directory can be looked up
in / as that recurses back to xattr code like:
xattr_lookup+0x24/0x280 fs/reiserfs/xattr.c:395
reiserfs_xattr_get+0x89/0x540 fs/reiserfs/xattr.c:677
reiserfs_get_acl+0x63/0x690 fs/reiserfs/xattr_acl.c:209
get_acl+0x152/0x2e0 fs/posix_acl.c:141
check_acl fs/namei.c:277 [inline]
acl_permission_check fs/namei.c:309 [inline]
generic_permission+0x2ba/0x550 fs/namei.c:353
do_inode_permission fs/namei.c:398 [inline]
inode_permission+0x234/0x4a0 fs/namei.c:463
lookup_one_len+0xa6/0x200 fs/namei.c:2557
reiserfs_lookup_privroot+0x85/0x1e0 fs/reiserfs/xattr.c:972
reiserfs_fill_super+0x2b51/0x3240 fs/reiserfs/super.c:2176
mount_bdev+0x24f/0x360 fs/super.c:1417
Fix the problem by bailing from reiserfs_xattr_get() when xattrs are not
yet initialized.
CC: stable@vger.kernel.org
Reported-by: syzbot+9b33c9b118d77ff59b6f@syzkaller.appspotmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
We always use &req->task_work anyway, no point in passing it in.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
All request preparations are done only during submission, reflect it in
the code by moving io_req_prep() much earlier into io_queue_sqe().
That's much cleaner, because it doen't expose bits to async code which
it won't ever use. Also it makes the interface harder to misuse, and
there are potential places for bugs.
For instance, __io_queue() doesn't clear @sqe before proceeding to a
next linked request, that could have been disastrous, but hopefully
there are linked requests IFF sqe==NULL, so not actually a bug.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_issue_sqe() does two things at once, trying to prepare request and
issuing them. Split it in two and deduplicate with io_defer_prep().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
All io_*_prep() functions including io_{read,write}_prep() are called
only during submission where @force_nonblock is always true. Don't keep
propagating it and instead remove the @force_nonblock argument
from prep() altogether.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Move setting IOCB_NOWAIT from io_prep_rw() into io_read()/io_write(), so
it's set/cleared in a single place. Also remove @force_nonblock
parameter from io_prep_rw().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
REQ_F_NEED_CLEANUP is set only by io_*_prep() and they're guaranteed to
be called only once, so there is no one who may have set the flag
before. Kill REQ_F_NEED_CLEANUP check in these *prep() handlers.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Put brackets around bitwise ops in a complex expression
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Extract common code from if/else branches. That is cleaner and optimised
even better.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This flag is no longer used, remove it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The smart syzbot has found a reproducer for the following issue:
==================================================================
BUG: KASAN: use-after-free in instrument_atomic_write include/linux/instrumented.h:71 [inline]
BUG: KASAN: use-after-free in atomic_inc include/asm-generic/atomic-instrumented.h:240 [inline]
BUG: KASAN: use-after-free in io_wqe_inc_running fs/io-wq.c:301 [inline]
BUG: KASAN: use-after-free in io_wq_worker_running+0xde/0x110 fs/io-wq.c:613
Write of size 4 at addr ffff8882183db08c by task io_wqe_worker-0/7771
CPU: 0 PID: 7771 Comm: io_wqe_worker-0 Not tainted 5.9.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x198/0x1fd lib/dump_stack.c:118
print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383
__kasan_report mm/kasan/report.c:513 [inline]
kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
check_memory_region_inline mm/kasan/generic.c:186 [inline]
check_memory_region+0x13d/0x180 mm/kasan/generic.c:192
instrument_atomic_write include/linux/instrumented.h:71 [inline]
atomic_inc include/asm-generic/atomic-instrumented.h:240 [inline]
io_wqe_inc_running fs/io-wq.c:301 [inline]
io_wq_worker_running+0xde/0x110 fs/io-wq.c:613
schedule_timeout+0x148/0x250 kernel/time/timer.c:1879
io_wqe_worker+0x517/0x10e0 fs/io-wq.c:580
kthread+0x3b5/0x4a0 kernel/kthread.c:292
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
Allocated by task 7768:
kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
kasan_set_track mm/kasan/common.c:56 [inline]
__kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:461
kmem_cache_alloc_node_trace+0x17b/0x3f0 mm/slab.c:3594
kmalloc_node include/linux/slab.h:572 [inline]
kzalloc_node include/linux/slab.h:677 [inline]
io_wq_create+0x57b/0xa10 fs/io-wq.c:1064
io_init_wq_offload fs/io_uring.c:7432 [inline]
io_sq_offload_start fs/io_uring.c:7504 [inline]
io_uring_create fs/io_uring.c:8625 [inline]
io_uring_setup+0x1836/0x28e0 fs/io_uring.c:8694
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Freed by task 21:
kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
kasan_set_track+0x1c/0x30 mm/kasan/common.c:56
kasan_set_free_info+0x1b/0x30 mm/kasan/generic.c:355
__kasan_slab_free+0xd8/0x120 mm/kasan/common.c:422
__cache_free mm/slab.c:3418 [inline]
kfree+0x10e/0x2b0 mm/slab.c:3756
__io_wq_destroy fs/io-wq.c:1138 [inline]
io_wq_destroy+0x2af/0x460 fs/io-wq.c:1146
io_finish_async fs/io_uring.c:6836 [inline]
io_ring_ctx_free fs/io_uring.c:7870 [inline]
io_ring_exit_work+0x1e4/0x6d0 fs/io_uring.c:7954
process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
kthread+0x3b5/0x4a0 kernel/kthread.c:292
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
The buggy address belongs to the object at ffff8882183db000
which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 140 bytes inside of
1024-byte region [ffff8882183db000, ffff8882183db400)
The buggy address belongs to the page:
page:000000009bada22b refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2183db
flags: 0x57ffe0000000200(slab)
raw: 057ffe0000000200 ffffea0008604c48 ffffea00086a8648 ffff8880aa040700
raw: 0000000000000000 ffff8882183db000 0000000100000002 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff8882183daf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff8882183db000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff8882183db080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8882183db100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8882183db180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
which is down to the comment below,
/* all workers gone, wq exit can proceed */
if (!nr_workers && refcount_dec_and_test(&wqe->wq->refs))
complete(&wqe->wq->done);
because there might be multiple cases of wqe in a wq and we would wait
for every worker in every wqe to go home before releasing wq's resources
on destroying.
To that end, rework wq's refcount by making it independent of the tracking
of workers because after all they are two different things, and keeping
it balanced when workers come and go. Note the manager kthread, like
other workers, now holds a grab to wq during its lifetime.
Finally to help destroy wq, check IO_WQ_BIT_EXIT upon creating worker
and do nothing for exiting wq.
Cc: stable@vger.kernel.org # v5.5+
Reported-by: syzbot+45fa0a195b941764e0f0@syzkaller.appspotmail.com
Reported-by: syzbot+9af99580130003da82b1@syzkaller.appspotmail.com
Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In most cases we'll specify IORING_SETUP_SQPOLL and run multiple
io_uring instances in a host. Since all sqthreads are named
"io_uring-sq", it's hard to distinguish the relations between
application process and its io_uring sqthread.
With this patch, application can get its corresponding sqthread pid
and cpu through show_fdinfo.
Steps:
1. Get io_uring fd first.
$ ls -l /proc/<pid>/fd | grep -w io_uring
2. Then get io_uring instance related info, including corresponding
sqthread pid and cpu.
$ cat /proc/<pid>/fdinfo/<io_uring_fd>
pos: 0
flags: 02000002
mnt_id: 13
SqThread: 6929
SqThreadCpu: 2
UserFiles: 1
0: testfile
UserBufs: 0
PollList:
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
[axboe: fixed for new shared SQPOLL]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We do this for CQ ring wait, in case task_work completions come in. We
should do the same in io_uring_register(), to avoid spurious -EINTR
if the ring quiescing ends up having to process task_work to complete
the operation
Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There are a few operations that are offloaded to the worker threads. In
this case, we lose process context and end up in kthread context. This
results in ios to be not accounted to the issuing cgroup and
consequently end up as issued by root. Just like others, adopt the
personality of the blkcg too when issuing via the workqueues.
For the SQPOLL thread, it will live and attach in the inited cgroup's
context.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_uring does account any registered buffer as pinned/locked memory, and
checks limit and fails if the given user doesn't have a big enough limit
to register the ranges specified. However, if huge pages are used, we
are potentially under-accounting the memory in terms of what gets pinned
on the vm side.
This patch rectifies that, by ensuring that we account the full size of
a compound page, regardless of how much of it is being registered. Huge
pages are not accounted mulitple times - if multiple sections of a huge
page is registered, then the page is only accounted once.
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Fixes coccicheck warning:
fs/io_uring.c:4242:13-14: Unneeded semicolon
Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In the spirit of fairness, cap the max number of SQ entries we'll submit
for SQPOLL if we have multiple rings. If we don't do that, we could be
submitting tons of entries for one ring, while others are waiting to get
service.
The value of 8 is somewhat arbitrarily chosen as something that allows
a fair bit of batching, without using an excessive time per ring.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There's really no point in having this union, it just means that we're
always allocating enough room to cater to any command. But that's
pointless, as the ->io field is request type private anyway.
This gets rid of the io_async_ctx structure, and fills in the required
size in the io_op_defs[] instead.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Testing ctx->user_bufs for NULL in io_import_fixed() is not neccessary,
because in that case ctx->nr_user_bufs would be zero, and the following
check would fail.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When io_req_map_rw() is called from io_rw_prep_async(), it memcpy()
iorw->iter into itself. Even though it doesn't lead to an error, such a
memcpy()'s aliasing rules violation is considered to be a bad practise.
Inline io_req_map_rw() into io_rw_prep_async(). We don't really need any
remapping there, so it's much simpler than the generic implementation.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Set rw->free_iovec to @iovec, that gives an identical result and stresses
that @iovec param rw->free_iovec play the same role.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Don't touch iter->iov and iov in between __io_import_iovec() and
io_req_map_rw(), the former function aleady sets it correctly, because it
creates one more case with NULL'ed iov to consider in io_req_map_rw().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When using SQPOLL, applications can run into the issue of running out of
SQ ring entries because the thread hasn't consumed them yet. The only
option for dealing with that is checking later, or busy checking for the
condition.
Provide IORING_ENTER_SQ_WAIT if applications want to wait on this
condition.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
These structures are never written, move them appropriately.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We support using IORING_SETUP_ATTACH_WQ to share async backends between
rings created by the same process, this now also allows the same to
happen with SQPOLL. The setup procedure remains the same, the caller
sets io_uring_params->wq_fd to the 'parent' context, and then the newly
created ring will attach to that async backend.
This means that multiple rings can share the same SQPOLL thread, saving
resources.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Remove the SQPOLL thread from the ctx, and use the io_sq_data as the
data structure we pass in. io_sq_data has a list of ctx's that we can
then iterate over and handle.
As of now we're ready to handle multiple ctx's, though we're still just
handling a single one after this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Move all the necessary state out of io_ring_ctx, and into a new
structure, io_sq_data. The latter now deals with any state or
variables associated with the SQPOLL thread itself.
In preparation for supporting more than one io_ring_ctx per SQPOLL
thread.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This is done in preparation for handling more than one ctx, but it also
cleans up the code a bit since io_sq_thread() was a bit too unwieldy to
get a get overview on.
__io_sq_thread() is now the main handler, and it returns an enum sq_ret
that tells io_sq_thread() what it ended up doing. The parent then makes
a decision on idle, spinning, or work handling based on that.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We need to decouple the clearing on wakeup from the the inline schedule,
as that is going to be required for handling multiple rings in one
thread.
Wrap our wakeup handler so we can clear it when we get the wakeup, by
definition that is when we no longer need the flag set.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This is in preparation to sharing the poller thread between rings. For
that we need per-ring wait_queue_entry storage, and we can't easily put
that on the stack if one thread is managing multiple rings.
We'll also be sharing the wait_queue_head across rings for the purposes
of wakeups, provide the usual private ring wait_queue_head for now but
make it a pointer so we can easily override it when sharing.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We're not handling signals by default in kernel threads, and we never
use TWA_SIGNAL for the SQPOLL thread internally. Hence we can never
have a signal pending, and we don't need to check for it (nor flush it).
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
During a context switch the scheduler invokes wq_worker_sleeping() with
disabled preemption. Disabling preemption is needed because it protects
access to `worker->sleeping'. As an optimisation it avoids invoking
schedule() within the schedule path as part of possible wake up (thus
preempt_enable_no_resched() afterwards).
The io-wq has been added to the mix in the same section with disabled
preemption. This breaks on PREEMPT_RT because io_wq_worker_sleeping()
acquires a spinlock_t. Also within the schedule() the spinlock_t must be
acquired after tsk_is_pi_blocked() otherwise it will block on the
sleeping lock again while scheduling out.
While playing with `io_uring-bench' I didn't notice a significant
latency spike after converting io_wqe::lock to a raw_spinlock_t. The
latency was more or less the same.
In order to keep the spinlock_t it would have to be moved after the
tsk_is_pi_blocked() check which would introduce a branch instruction
into the hot path.
The lock is used to maintain the `work_list' and wakes one task up at
most.
Should io_wqe_cancel_pending_work() cause latency spikes, while
searching for a specific item, then it would need to drop the lock
during iterations.
revert_creds() is also invoked under the lock. According to debug
cred::non_rcu is 0. Otherwise it should be moved outside of the locked
section because put_cred_rcu()->free_uid() acquires a sleeping lock.
Convert io_wqe::lock to a raw_spinlock_t.c
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This patch adds a new IORING_SETUP_R_DISABLED flag to start the
rings disabled, allowing the user to register restrictions,
buffers, files, before to start processing SQEs.
When IORING_SETUP_R_DISABLED is set, SQE are not processed and
SQPOLL kthread is not started.
The restrictions registration are allowed only when the rings
are disable to prevent concurrency issue while processing SQEs.
The rings can be enabled using IORING_REGISTER_ENABLE_RINGS
opcode with io_uring_register(2).
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The new io_uring_register(2) IOURING_REGISTER_RESTRICTIONS opcode
permanently installs a feature allowlist on an io_ring_ctx.
The io_ring_ctx can then be passed to untrusted code with the
knowledge that only operations present in the allowlist can be
executed.
The allowlist approach ensures that new features added to io_uring
do not accidentally become available when an existing application
is launched on a newer kernel version.
Currently is it possible to restrict sqe opcodes, sqe flags, and
register opcodes.
IOURING_REGISTER_RESTRICTIONS can only be made once. Afterwards
it is not possible to change restrictions anymore.
This prevents untrusted code from removing restrictions.
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If we don't get and assign the namespace for the async work, then certain
paths just don't work properly (like /dev/stdin, /proc/mounts, etc).
Anything that references the current namespace of the given task should
be assigned for async work on behalf of that task.
Cc: stable@vger.kernel.org # v5.5+
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Grab actual references to the files_struct. To avoid circular references
issues due to this, we add a per-task note that keeps track of what
io_uring contexts a task has used. When the tasks execs or exits its
assigned files, we cancel requests based on this tracking.
With that, we can grab proper references to the files table, and no
longer need to rely on stashing away ring_fd and ring_file to check
if the ring_fd may have been closed.
Cc: stable@vger.kernel.org # v5.5+
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This allows us to selectively flush out pending overflows, depending on
the task and/or files_struct being passed in.
No intended functional changes in this patch.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Return whether we found and canceled requests or not. This is in
preparation for using this information, no functional changes in this
patch.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Sometimes we assign a weak reference to it, sometimes we grab a
reference to it. Clean this up and make it unconditional, and drop the
flag related to tracking this state.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We can grab a reference to the task instead of stashing away the task
files_struct. This is doable without creating a circular reference
between the ring fd and the task itself.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
No functional changes in this patch, prep patch for grabbing references
to the files_struct.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We currently cancel these when the ring exits, and we cancel all of
them. This is in preparation for killing only the ones associated
with a given task.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
* io_uring-5.9:
io_uring: fix async buffered reads when readahead is disabled
io_uring: fix potential ABBA deadlock in ->show_fdinfo()
io_uring: always delete double poll wait entry on match
|
|
We use a device's allocation state tree to track ranges in a device used
for allocated chunks, and we set ranges in this tree when allocating a new
chunk. However after a device replace operation, we were not setting the
allocated ranges in the new device's allocation state tree, so that tree
is empty after a device replace.
This means that a fitrim operation after a device replace will trim the
device ranges that have allocated chunks and extents, as we trim every
range for which there is not a range marked in the device's allocation
state tree. It is also important during chunk allocation, since the
device's allocation state is used to determine if a range is already
allocated when allocating a new chunk.
This is trivial to reproduce and the following script triggers the bug:
$ cat reproducer.sh
#!/bin/bash
DEV1="/dev/sdg"
DEV2="/dev/sdh"
DEV3="/dev/sdi"
wipefs -a $DEV1 $DEV2 $DEV3 &> /dev/null
# Create a raid1 test fs on 2 devices.
mkfs.btrfs -f -m raid1 -d raid1 $DEV1 $DEV2 > /dev/null
mount $DEV1 /mnt/btrfs
xfs_io -f -c "pwrite -S 0xab 0 10M" /mnt/btrfs/foo
echo "Starting to replace $DEV1 with $DEV3"
btrfs replace start -B $DEV1 $DEV3 /mnt/btrfs
echo
echo "Running fstrim"
fstrim /mnt/btrfs
echo
echo "Unmounting filesystem"
umount /mnt/btrfs
echo "Mounting filesystem in degraded mode using $DEV3 only"
wipefs -a $DEV1 $DEV2 &> /dev/null
mount -o degraded $DEV3 /mnt/btrfs
if [ $? -ne 0 ]; then
dmesg | tail
echo
echo "Failed to mount in degraded mode"
exit 1
fi
echo
echo "File foo data (expected all bytes = 0xab):"
od -A d -t x1 /mnt/btrfs/foo
umount /mnt/btrfs
When running the reproducer:
$ ./replace-test.sh
wrote 10485760/10485760 bytes at offset 0
10 MiB, 2560 ops; 0.0901 sec (110.877 MiB/sec and 28384.5216 ops/sec)
Starting to replace /dev/sdg with /dev/sdi
Running fstrim
Unmounting filesystem
Mounting filesystem in degraded mode using /dev/sdi only
mount: /mnt/btrfs: wrong fs type, bad option, bad superblock on /dev/sdi, missing codepage or helper program, or other error.
[19581.748641] BTRFS info (device sdg): dev_replace from /dev/sdg (devid 1) to /dev/sdi started
[19581.803842] BTRFS info (device sdg): dev_replace from /dev/sdg (devid 1) to /dev/sdi finished
[19582.208293] BTRFS info (device sdi): allowing degraded mounts
[19582.208298] BTRFS info (device sdi): disk space caching is enabled
[19582.208301] BTRFS info (device sdi): has skinny extents
[19582.212853] BTRFS warning (device sdi): devid 2 uuid 1f731f47-e1bb-4f00-bfbb-9e5a0cb4ba9f is missing
[19582.213904] btree_readpage_end_io_hook: 25839 callbacks suppressed
[19582.213907] BTRFS error (device sdi): bad tree block start, want 30490624 have 0
[19582.214780] BTRFS warning (device sdi): failed to read root (objectid=7): -5
[19582.231576] BTRFS error (device sdi): open_ctree failed
Failed to mount in degraded mode
So fix by setting all allocated ranges in the replace target device when
the replace operation is finishing, when we are holding the chunk mutex
and we can not race with new chunk allocations.
A test case for fstests follows soon.
Fixes: 1c11b63eff2a67 ("btrfs: replace pending/pinned chunks lists with io tree")
CC: stable@vger.kernel.org # 5.2+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When closing and freeing the source device we could end up doing our
final blkdev_put() on the bdev, which will grab the bd_mutex. As such
we want to be holding as few locks as possible, so move this call
outside of the dev_replace->lock_finishing_cancel_unmount lock. Since
we're modifying the fs_devices we need to make sure we're holding the
uuid_mutex here, so take that as well.
There's a report from syzbot probably hitting one of the cases where
the bd_mutex and device_list_mutex are taken in the wrong order, however
it's not with device replace, like this patch fixes. As there's no
reproducer available so far, we can't verify the fix.
https://lore.kernel.org/lkml/000000000000fc04d105afcf86d7@google.com/
dashboard link: https://syzkaller.appspot.com/bug?extid=84a0634dc5d21d488419
WARNING: possible circular locking dependency detected
5.9.0-rc5-syzkaller #0 Not tainted
------------------------------------------------------
syz-executor.0/6878 is trying to acquire lock:
ffff88804c17d780 (&bdev->bd_mutex){+.+.}-{3:3}, at: blkdev_put+0x30/0x520 fs/block_dev.c:1804
but task is already holding lock:
ffff8880908cfce0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: close_fs_devices.part.0+0x2e/0x800 fs/btrfs/volumes.c:1159
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock_common kernel/locking/mutex.c:956 [inline]
__mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
btrfs_finish_chunk_alloc+0x281/0xf90 fs/btrfs/volumes.c:5255
btrfs_create_pending_block_groups+0x2f3/0x700 fs/btrfs/block-group.c:2109
__btrfs_end_transaction+0xf5/0x690 fs/btrfs/transaction.c:916
find_free_extent_update_loop fs/btrfs/extent-tree.c:3807 [inline]
find_free_extent+0x23b7/0x2e60 fs/btrfs/extent-tree.c:4127
btrfs_reserve_extent+0x166/0x460 fs/btrfs/extent-tree.c:4206
cow_file_range+0x3de/0x9b0 fs/btrfs/inode.c:1063
btrfs_run_delalloc_range+0x2cf/0x1410 fs/btrfs/inode.c:1838
writepage_delalloc+0x150/0x460 fs/btrfs/extent_io.c:3439
__extent_writepage+0x441/0xd00 fs/btrfs/extent_io.c:3653
extent_write_cache_pages.constprop.0+0x69d/0x1040 fs/btrfs/extent_io.c:4249
extent_writepages+0xcd/0x2b0 fs/btrfs/extent_io.c:4370
do_writepages+0xec/0x290 mm/page-writeback.c:2352
__writeback_single_inode+0x125/0x1400 fs/fs-writeback.c:1461
writeback_sb_inodes+0x53d/0xf40 fs/fs-writeback.c:1721
wb_writeback+0x2ad/0xd40 fs/fs-writeback.c:1894
wb_do_writeback fs/fs-writeback.c:2039 [inline]
wb_workfn+0x2dc/0x13e0 fs/fs-writeback.c:2080
process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
kthread+0x3b5/0x4a0 kernel/kthread.c:292
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
-> #3 (sb_internal#2){.+.+}-{0:0}:
percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
__sb_start_write+0x234/0x470 fs/super.c:1672
sb_start_intwrite include/linux/fs.h:1690 [inline]
start_transaction+0xbe7/0x1170 fs/btrfs/transaction.c:624
find_free_extent_update_loop fs/btrfs/extent-tree.c:3789 [inline]
find_free_extent+0x25e1/0x2e60 fs/btrfs/extent-tree.c:4127
btrfs_reserve_extent+0x166/0x460 fs/btrfs/extent-tree.c:4206
cow_file_range+0x3de/0x9b0 fs/btrfs/inode.c:1063
btrfs_run_delalloc_range+0x2cf/0x1410 fs/btrfs/inode.c:1838
writepage_delalloc+0x150/0x460 fs/btrfs/extent_io.c:3439
__extent_writepage+0x441/0xd00 fs/btrfs/extent_io.c:3653
extent_write_cache_pages.constprop.0+0x69d/0x1040 fs/btrfs/extent_io.c:4249
extent_writepages+0xcd/0x2b0 fs/btrfs/extent_io.c:4370
do_writepages+0xec/0x290 mm/page-writeback.c:2352
__writeback_single_inode+0x125/0x1400 fs/fs-writeback.c:1461
writeback_sb_inodes+0x53d/0xf40 fs/fs-writeback.c:1721
wb_writeback+0x2ad/0xd40 fs/fs-writeback.c:1894
wb_do_writeback fs/fs-writeback.c:2039 [inline]
wb_workfn+0x2dc/0x13e0 fs/fs-writeback.c:2080
process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
kthread+0x3b5/0x4a0 kernel/kthread.c:292
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
-> #2 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}:
__flush_work+0x60e/0xac0 kernel/workqueue.c:3041
wb_shutdown+0x180/0x220 mm/backing-dev.c:355
bdi_unregister+0x174/0x590 mm/backing-dev.c:872
del_gendisk+0x820/0xa10 block/genhd.c:933
loop_remove drivers/block/loop.c:2192 [inline]
loop_control_ioctl drivers/block/loop.c:2291 [inline]
loop_control_ioctl+0x3b1/0x480 drivers/block/loop.c:2257
vfs_ioctl fs/ioctl.c:48 [inline]
__do_sys_ioctl fs/ioctl.c:753 [inline]
__se_sys_ioctl fs/ioctl.c:739 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:739
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (loop_ctl_mutex){+.+.}-{3:3}:
__mutex_lock_common kernel/locking/mutex.c:956 [inline]
__mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
lo_open+0x19/0xd0 drivers/block/loop.c:1893
__blkdev_get+0x759/0x1aa0 fs/block_dev.c:1507
blkdev_get fs/block_dev.c:1639 [inline]
blkdev_open+0x227/0x300 fs/block_dev.c:1753
do_dentry_open+0x4b9/0x11b0 fs/open.c:817
do_open fs/namei.c:3251 [inline]
path_openat+0x1b9a/0x2730 fs/namei.c:3368
do_filp_open+0x17e/0x3c0 fs/namei.c:3395
do_sys_openat2+0x16d/0x420 fs/open.c:1168
do_sys_open fs/open.c:1184 [inline]
__do_sys_open fs/open.c:1192 [inline]
__se_sys_open fs/open.c:1188 [inline]
__x64_sys_open+0x119/0x1c0 fs/open.c:1188
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&bdev->bd_mutex){+.+.}-{3:3}:
check_prev_add kernel/locking/lockdep.c:2496 [inline]
check_prevs_add kernel/locking/lockdep.c:2601 [inline]
validate_chain kernel/locking/lockdep.c:3218 [inline]
__lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4426
lock_acquire+0x1f3/0xae0 kernel/locking/lockdep.c:5006
__mutex_lock_common kernel/locking/mutex.c:956 [inline]
__mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
blkdev_put+0x30/0x520 fs/block_dev.c:1804
btrfs_close_bdev fs/btrfs/volumes.c:1117 [inline]
btrfs_close_bdev fs/btrfs/volumes.c:1107 [inline]
btrfs_close_one_device fs/btrfs/volumes.c:1133 [inline]
close_fs_devices.part.0+0x1a4/0x800 fs/btrfs/volumes.c:1161
close_fs_devices fs/btrfs/volumes.c:1193 [inline]
btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
close_ctree+0x688/0x6cb fs/btrfs/disk-io.c:4149
generic_shutdown_super+0x144/0x370 fs/super.c:464
kill_anon_super+0x36/0x60 fs/super.c:1108
btrfs_kill_super+0x38/0x50 fs/btrfs/super.c:2265
deactivate_locked_super+0x94/0x160 fs/super.c:335
deactivate_super+0xad/0xd0 fs/super.c:366
cleanup_mnt+0x3a3/0x530 fs/namespace.c:1118
task_work_run+0xdd/0x190 kernel/task_work.c:141
tracehook_notify_resume include/linux/tracehook.h:188 [inline]
exit_to_user_mode_loop kernel/entry/common.c:163 [inline]
exit_to_user_mode_prepare+0x1e1/0x200 kernel/entry/common.c:190
syscall_exit_to_user_mode+0x7e/0x2e0 kernel/entry/common.c:265
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
&bdev->bd_mutex --> sb_internal#2 --> &fs_devs->device_list_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_devs->device_list_mutex);
lock(sb_internal#2);
lock(&fs_devs->device_list_mutex);
lock(&bdev->bd_mutex);
*** DEADLOCK ***
3 locks held by syz-executor.0/6878:
#0: ffff88809070c0e0 (&type->s_umount_key#70){++++}-{3:3}, at: deactivate_super+0xa5/0xd0 fs/super.c:365
#1: ffffffff8a5b37a8 (uuid_mutex){+.+.}-{3:3}, at: btrfs_close_devices+0x23/0x1f0 fs/btrfs/volumes.c:1178
#2: ffff8880908cfce0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: close_fs_devices.part.0+0x2e/0x800 fs/btrfs/volumes.c:1159
stack backtrace:
CPU: 0 PID: 6878 Comm: syz-executor.0 Not tainted 5.9.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x198/0x1fd lib/dump_stack.c:118
check_noncircular+0x324/0x3e0 kernel/locking/lockdep.c:1827
check_prev_add kernel/locking/lockdep.c:2496 [inline]
check_prevs_add kernel/locking/lockdep.c:2601 [inline]
validate_chain kernel/locking/lockdep.c:3218 [inline]
__lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4426
lock_acquire+0x1f3/0xae0 kernel/locking/lockdep.c:5006
__mutex_lock_common kernel/locking/mutex.c:956 [inline]
__mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
blkdev_put+0x30/0x520 fs/block_dev.c:1804
btrfs_close_bdev fs/btrfs/volumes.c:1117 [inline]
btrfs_close_bdev fs/btrfs/volumes.c:1107 [inline]
btrfs_close_one_device fs/btrfs/volumes.c:1133 [inline]
close_fs_devices.part.0+0x1a4/0x800 fs/btrfs/volumes.c:1161
close_fs_devices fs/btrfs/volumes.c:1193 [inline]
btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
close_ctree+0x688/0x6cb fs/btrfs/disk-io.c:4149
generic_shutdown_super+0x144/0x370 fs/super.c:464
kill_anon_super+0x36/0x60 fs/super.c:1108
btrfs_kill_super+0x38/0x50 fs/btrfs/super.c:2265
deactivate_locked_super+0x94/0x160 fs/super.c:335
deactivate_super+0xad/0xd0 fs/super.c:366
cleanup_mnt+0x3a3/0x530 fs/namespace.c:1118
task_work_run+0xdd/0x190 kernel/task_work.c:141
tracehook_notify_resume include/linux/tracehook.h:188 [inline]
exit_to_user_mode_loop kernel/entry/common.c:163 [inline]
exit_to_user_mode_prepare+0x1e1/0x200 kernel/entry/common.c:190
syscall_exit_to_user_mode+0x7e/0x2e0 kernel/entry/common.c:265
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x460027
RSP: 002b:00007fff59216328 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 0000000000076035 RCX: 0000000000460027
RDX: 0000000000403188 RSI: 0000000000000002 RDI: 00007fff592163d0
RBP: 0000000000000333 R08: 0000000000000000 R09: 000000000000000b
R10: 0000000000000005 R11: 0000000000000246 R12: 00007fff59217460
R13: 0000000002df2a60 R14: 0000000000000000 R15: 00007fff59217460
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add syzbot reference ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Nathan popped up on #xfs and pointed out that we fail to handle
finobt btree blocks in xlog_recover_get_buf_lsn(). This means they
always fall through the entire magic number matching code to "recover
immediately". Whilst most of the time this is the correct behaviour,
occasionally it will be incorrect and could potentially overwrite
more recent metadata because we don't check the LSN in the on disk
metadata at all.
This bug has been present since the finobt was first introduced, and
is a potential cause of the occasional xfs_iget_check_free_state()
failures we see that indicate that the inode btree state does not
match the on disk inode state.
Fixes: aafc3c246529 ("xfs: support the XFS_BTNUM_FINOBT free inode btree type")
Reported-by: Nathan Scott <nathans@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
|
|
autofs got broken in some configurations by commit 13c164b1a186
("autofs: switch to kernel_write") because there is now an extra LSM
permission check done by security_file_permission() in rw_verify_area().
autofs is one if the few places that really does want the much more
limited __kernel_write(), because the write is an internal kernel one
that shouldn't do any user permission checks (it also doesn't need the
file_start_write/file_end_write logic, since it's just a pipe).
There are a couple of other cases like that - accounting, core dumping,
and splice - but autofs stands out because it can be built as a module.
As a result, we need to export this internal __kernel_write() function
again.
We really don't want any other module to use this, but we don't have a
"EXPORT_SYMBOL_FOR_AUTOFS_ONLY()". But we can mark it GPL-only to at
least approximate that "internal use only" for licensing.
While in this area, make autofs pass in NULL for the file position
pointer, since it's always a pipe, and we now use a NULL file pointer
for streaming file descriptors (see file_ppos() and commit 438ab720c675:
"vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files")
This effectively reverts commits 9db977522449 ("fs: unexport
__kernel_write") and 13c164b1a186 ("autofs: switch to kernel_write").
Fixes: 13c164b1a186 ("autofs: switch to kernel_write")
Reported-by: Ondrej Mosnacek <omosnace@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This patch reworks the current receive handling of dlm. As I tried to
change the send handling to fix reorder issues I took a look into the
receive handling and simplified it, it works as the following:
Each connection has a preallocated receive buffer with a minimum length of
4096. On receive, the upper layer protocol will process all dlm message
until there is not enough data anymore. If there exists "leftover" data at
the end of the receive buffer because the dlm message wasn't fully received
it will be copied to the begin of the preallocated receive buffer. Next
receive more data will be appended to the previous "leftover" data and
processing will begin again.
This will remove a lot of code of the current mechanism. Inside the
processing functionality we will ensure with a memmove() that the dlm
message should be memory aligned. To have a dlm message always started
at the beginning of the buffer will reduce some amount of memmove()
calls because src and dest pointers are the same.
The cluster attribute "buffer_size" becomes a new meaning, it's now the
size of application layer receive buffer size. If this is changed during
runtime the receive buffer will be reallocated. It's important that the
receive buffer size has at minimum the size of the maximum possible dlm
message size otherwise the received message cannot be placed inside
the receive buffer size.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|