Age | Commit message (Collapse) | Author |
|
Bring in the fixes for __pollwait() and waitqueue_active() interactions.
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Now that poll_wait() provides a full barrier we can remove smp_rmb() from
io_uring_poll().
In fact I don't think smp_rmb() was correct, it can't serialize LOADs and
STOREs.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250107162730.GA18940@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes.
Besides the one-liners in Btrfs there's fix to the io_uring and
encoded read integration (added in this development cycle). The update
to io_uring provides more space for the ongoing command that is then
used in Btrfs to handle some cases.
- io_uring and encoded read:
- provide stable storage for io_uring command data
- make a copy of encoded read ioctl call, reuse that in case the
call would block and will be called again
- properly initialize zlib context for hardware compression on s390
- fix max extent size calculation on filesystems with non-zoned
devices
- fix crash in scrub on crafted image due to invalid extent tree"
* tag 'for-6.13-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zlib: fix avail_in bytes for s390 zlib HW compression path
btrfs: zoned: calculate max_extent_size properly on non-zoned setup
btrfs: avoid NULL pointer dereference if no valid extent tree
btrfs: don't read from userspace twice in btrfs_uring_encoded_read()
io_uring: add io_uring_cmd_get_async_data helper
io_uring/cmd: add per-op data to struct io_uring_cmd_data
io_uring/cmd: rename struct uring_cache to io_uring_cmd_data
|
|
io_eventfd_do_signal() is invoked from an RCU callback, but when
dropping the reference to the io_ev_fd, it calls io_eventfd_free()
directly if the refcount drops to zero. This isn't correct, as any
potential freeing of the io_ev_fd should be deferred another RCU grace
period.
Just call io_eventfd_put() rather than open-code the dec-and-test and
free, which will correctly defer it another RCU grace period.
Fixes: 21a091b970cd ("io_uring: signal registered eventfd to process deferred task work")
Reported-by: Jann Horn <jannh@google.com>
Cc: stable@vger.kernel.org
Tested-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Li Zetao<lizetao1@huawei.com>
Reviewed-by: Prasanna Kumar T S M <ptsm@linux.microsoft.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If we kill a ring and then immediately exit the task, we'll get
cancellattion running by the task and a kthread in io_ring_exit_work.
For DEFER_TASKRUN, we do want to limit it to only one entity executing
it, however it's currently not an issue as it's protected by uring_lock.
Silence lockdep assertions for now, we'll return to it later.
Reported-by: syzbot+1bcb75613069ad4957fc@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7e5f68281acb0f081f65fde435833c68a3b7e02f.1736257837.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In case an op handler for ->uring_cmd() needs stable storage for user
data, it can allocate io_uring_cmd_data->op_data and use it for the
duration of the request. When the request gets cleaned up, uring_cmd
will free it automatically.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In preparation for making this more generically available for
->uring_cmd() usage that needs stable command data, rename it and move
it to io_uring/cmd.h instead.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
After update only the first shot of a multishot timeout request adheres
to the new timeout value while all subsequent retries continue to use
the old value. Don't forget to update the timeout stored in struct
io_timeout_data.
Cc: stable@vger.kernel.org
Fixes: ea97f6c8558e8 ("io_uring: add support for multishot timeouts")
Reported-by: Christian Mazakas <christian.mazakas@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e6516c3304eb654ec234cfa65c88a9579861e597.1736015288.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
As we don't use iov_iter_advance() but our own logic in io_import_fixed(),
we can remove the logic that over-sets the iter's count to len + offset
then adjusts it later to len. This helps to make the code cleaner.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Link: https://lore.kernel.org/r/20250103150412.12549-1-minhquangbui99@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
For non-pollable files, buffer ring consumption will commit upfront.
This is fine, but io_ring_buffer_select() will return the address of the
buffer after having committed it. For incrementally consumed buffers,
this is incorrect as it will modify the buffer address.
Store the pre-committed value and return that. If that isn't done, then
the initial part of the buffer is not used and the application will
correctly assume the content arrived at the start of the userspace
buffer, but the kernel will have put it later in the buffer. Or it can
cause a spurious -EFAULT returned in the CQE, depending on the buffer
size. As bounds are suitably checked for doing the actual IO, no adverse
side effects are possible - it's just a data misplacement within the
existing buffer.
Reported-by: Gwendal Fernet <gwendalfernet@gmail.com>
Cc: stable@vger.kernel.org
Fixes: ae98dbf43d75 ("io_uring/kbuf: add support for incremental buffer consumption")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
syzbot reports that ->msg_inq may get used uinitialized from the
following path:
BUG: KMSAN: uninit-value in io_recv_buf_select io_uring/net.c:1094 [inline]
BUG: KMSAN: uninit-value in io_recv+0x930/0x1f90 io_uring/net.c:1158
io_recv_buf_select io_uring/net.c:1094 [inline]
io_recv+0x930/0x1f90 io_uring/net.c:1158
io_issue_sqe+0x420/0x2130 io_uring/io_uring.c:1740
io_queue_sqe io_uring/io_uring.c:1950 [inline]
io_req_task_submit+0xfa/0x1d0 io_uring/io_uring.c:1374
io_handle_tw_list+0x55f/0x5c0 io_uring/io_uring.c:1057
tctx_task_work_run+0x109/0x3e0 io_uring/io_uring.c:1121
tctx_task_work+0x6d/0xc0 io_uring/io_uring.c:1139
task_work_run+0x268/0x310 kernel/task_work.c:239
io_run_task_work+0x43a/0x4a0 io_uring/io_uring.h:343
io_cqring_wait io_uring/io_uring.c:2527 [inline]
__do_sys_io_uring_enter io_uring/io_uring.c:3439 [inline]
__se_sys_io_uring_enter+0x204f/0x4ce0 io_uring/io_uring.c:3330
__x64_sys_io_uring_enter+0x11f/0x1a0 io_uring/io_uring.c:3330
x64_sys_call+0xce5/0x3c30 arch/x86/include/generated/asm/syscalls_64.h:427
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
and it is correct, as it's never initialized upfront. Hence the first
submission can end up using it uninitialized, if the recv wasn't
successful and the networking stack didn't honor ->msg_get_inq being set
and filling in the output value of ->msg_inq as requested.
Set it to 0 upfront when it's allocated, just to silence this KMSAN
warning. There's no side effect of using it uninitialized, it'll just
potentially cause the next receive to use a recv value hint that's not
accurate.
Fixes: c6f32c7d9e09 ("io_uring/net: get rid of ->prep_async() for receive side")
Reported-by: syzbot+068ff190354d2f74892f@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This is not the hot path, it's a slow path. Yet the locking for it is
in the hot path, and __cold does not prevent it from being inlined.
Move the locking to the function itself, and mark it noinline as well
to avoid it polluting the icache of the hot path.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
syzbot reports that a recent fix causes nesting issues between the (now)
raw timeoutlock and the eventfd locking:
=============================
[ BUG: Invalid wait context ]
6.13.0-rc4-00080-g9828a4c0901f #29 Not tainted
-----------------------------
kworker/u32:0/68094 is trying to lock:
ffff000014d7a520 (&ctx->wqh#2){..-.}-{3:3}, at: eventfd_signal_mask+0x64/0x180
other info that might help us debug this:
context-{5:5}
6 locks held by kworker/u32:0/68094:
#0: ffff0000c1d98148 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work+0x4e8/0xfc0
#1: ffff80008d927c78 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work+0x53c/0xfc0
#2: ffff0000c59bc3d8 (&ctx->completion_lock){+.+.}-{3:3}, at: io_kill_timeouts+0x40/0x180
#3: ffff0000c59bc358 (&ctx->timeout_lock){-.-.}-{2:2}, at: io_kill_timeouts+0x48/0x180
#4: ffff800085127aa0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x8/0x38
#5: ffff800085127aa0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x8/0x38
stack backtrace:
CPU: 7 UID: 0 PID: 68094 Comm: kworker/u32:0 Not tainted 6.13.0-rc4-00080-g9828a4c0901f #29
Hardware name: linux,dummy-virt (DT)
Workqueue: iou_exit io_ring_exit_work
Call trace:
show_stack+0x1c/0x30 (C)
__dump_stack+0x24/0x30
dump_stack_lvl+0x60/0x80
dump_stack+0x14/0x20
__lock_acquire+0x19f8/0x60c8
lock_acquire+0x1a4/0x540
_raw_spin_lock_irqsave+0x90/0xd0
eventfd_signal_mask+0x64/0x180
io_eventfd_signal+0x64/0x108
io_req_local_work_add+0x294/0x430
__io_req_task_work_add+0x1c0/0x270
io_kill_timeout+0x1f0/0x288
io_kill_timeouts+0xd4/0x180
io_uring_try_cancel_requests+0x2e8/0x388
io_ring_exit_work+0x150/0x550
process_one_work+0x5e8/0xfc0
worker_thread+0x7ec/0xc80
kthread+0x24c/0x300
ret_from_fork+0x10/0x20
because after the preempt-rt fix for the timeout lock nesting inside
the io-wq lock, we now have the eventfd spinlock nesting inside the
raw timeout spinlock.
Rather than play whack-a-mole with other nesting on the timeout lock,
split the deletion and killing of timeouts so queueing the task_work
for the timeout cancelations can get done outside of the timeout lock.
Reported-by: syzbot+b1fc199a40b65d601b65@syzkaller.appspotmail.com
Fixes: 020b40f35624 ("io_uring: make ctx->timeout_lock a raw spinlock")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The io-wq path can downgrade a multishot request to oneshot mode,
however io_read_mshot() doesn't handle that and would still post
multiple CQEs. That's not allowed, because io_req_post_cqe() requires
stricter context requirements.
The described can only happen with pollable files that don't support
FMODE_NOWAIT, which is an odd combination, so if even allowed it should
be fairly rare.
Cc: stable@vger.kernel.org
Reported-by: chase xd <sl1589472800@gmail.com>
Fixes: bee1d5becdf5b ("io_uring: disable io-wq execution of multishot NOWAIT requests")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c5c8c4a50a882fd581257b81bf52eee260ac29fd.1735407848.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
A previous commit mistakenly moved the clearing of the in-progress byte
count into the section that's dependent on having a cached iovec or not,
but it should be cleared for any IO. If not, then extra bytes may be
added at IO completion time, causing potentially weird behavior like
over-reporting the amount of IO done.
Fixes: d7f11616edf5 ("io_uring/rw: Allocate async data through helper")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202412271132.a09c3500-lkp@intel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
It's a pointer, don't use 0 for that. sparse throws a warning for that,
as the kernel test robot noticed.
Fixes: d7f11616edf5 ("io_uring/rw: Allocate async data through helper")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202412180253.YML3qN4d-lkp@intel.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
A previous commit changed overwriting kiocb->ki_flags with
->f_iocb_flags with masking it in. This breaks for retry situations,
where we don't necessarily want to retain previously set flags, like
IOCB_NOWAIT.
The use case needs IOCB_HAS_METADATA to be persistent, but the change
makes all flags persistent, which is an issue. Add a request flag to
track whether the request has metadata or not, as that is persistent
across issues.
Fixes: 59a7d12a7fb5 ("io_uring: introduce attributes for read/write and PI support")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
kfree can handle slab objects nowadays. Drop the extra callback and just
use kfree.
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20241216204615.759089-10-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There are two remaining uses of the old async data allocator that do not
rely on the alloc cache. I don't want to make them use the new
allocator helper because that would require a if(cache) check, which
will result in dead code for the cached case (for callers passing a
cache, gcc can't prove the cache isn't NULL, and will therefore preserve
the check. Since this is an inline function and just a few lines long,
keep a second helper to deal with cases where we don't have an async
data cache.
No functional change intended here. This is just moving the helper
around and making it inline.
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20241216204615.759089-9-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This abstract away the cache details.
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20241216204615.759089-8-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This abstracts away the cache details.
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20241216204615.759089-7-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This abstracts away the cache details and simplify the code.
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20241216204615.759089-6-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This abstracts away the cache details to simplify the code.
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20241216204615.759089-5-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Instead of open-coding the allocation, use the generic alloc_cache
helper.
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20241216204615.759089-4-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This helper replaces io_alloc_async_data by using the folded allocation.
Do it in a header to allow the compiler to decide whether to inline.
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20241216204615.759089-3-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
BUG: KASAN: slab-use-after-free in __lock_acquire+0x370b/0x4a10 kernel/locking/lockdep.c:5089
Call Trace:
<TASK>
...
_raw_spin_lock_irqsave+0x3d/0x60 kernel/locking/spinlock.c:162
class_raw_spinlock_irqsave_constructor include/linux/spinlock.h:551 [inline]
try_to_wake_up+0xb5/0x23c0 kernel/sched/core.c:4205
io_sq_thread_park+0xac/0xe0 io_uring/sqpoll.c:55
io_sq_thread_finish+0x6b/0x310 io_uring/sqpoll.c:96
io_sq_offload_create+0x162/0x11d0 io_uring/sqpoll.c:497
io_uring_create io_uring/io_uring.c:3724 [inline]
io_uring_setup+0x1728/0x3230 io_uring/io_uring.c:3806
...
Kun Hu reports that the SQPOLL creating error path has UAF, which
happens if io_uring_alloc_task_context() fails and then io_sq_thread()
manages to run and complete before the rest of error handling code,
which means io_sq_thread_finish() is looking at already killed task.
Note that this is mostly theoretical, requiring fault injection on
the allocation side to trigger in practice.
Cc: stable@vger.kernel.org
Reported-by: Kun Hu <huk23@m.fudan.edu.cn>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0f2f1aa5729332612bd01fe0f2f385fd1f06ce7c.1735231717.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The allocation paths that use alloc_cache duplicate the same code
pattern, sometimes in a quite convoluted way. Fold the allocation into
the cache code itself, making it just an allocator function, and keeping
the cache policy invisible to callers. Another justification for doing
this, beyond code simplicity, is that it makes it trivial to test the
impact of disabling the cache and using slab directly, which I've used
for slab improvement experiments.
One relevant detail is that we provide a callback to optionally
initialize memory only when we actually reach slab. This allows us to
avoid blindly executing the allocation with GFP_ZERO and only clean
fields when they matter.
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20241216204615.759089-2-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
With *ENTER_EXT_ARG_REG instead of passing a user pointer with arguments
for the waiting loop the user can specify an offset into a pre-mapped
region of memory, in which case the
[offset, offset + sizeof(io_uring_reg_wait)) will be intepreted as the
argument.
As we address a kernel array using a user given index, it'd be a subject
to speculation type of exploits. Use array_index_nospec() to prevent
that. Make sure to pass not the full region size but truncate by the
maximum offset allowed considering the structure size.
Fixes: d617b3147d54c ("io_uring: restore back registered wait arguments")
Fixes: aa00f67adc2c0 ("io_uring: add support for fixed wait regions")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1e3d9da7c43d619de7bcf41d1cd277ab2688c443.1733694126.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When io_check_coalesce_buffer() meets a single page buffer it bails out
and tells that it can be coalesced. That's fine for registered buffers
as io_coalesce_buffer() wouldn't change anything, but the region code
now uses the function to decided on whether to vmap the buffer or not.
Report that a single page buffer is trivially coalescable and let
io_sqe_buffer_register() to filter them.
Fixes: c4d0ac1c1567 ("io_uring/memmap: optimise single folio regions")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cb83e053f318857068447d40c95becebcd8aeced.1733689833.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Remove unnecessary call to iov_iter_save_state() in io_prep_rw_setup()
as io_import_iovec() already does this. Then the result from
io_import_iovec() can be returned directly.
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Tested-by: Li Zetao <lizetao1@huawei.com>
Link: https://lore.kernel.org/r/20241207004144.783631-1-dw@davidwei.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Shifting reg.bgid << IORING_OFF_PBUF_SHIFT results in a promotion
from __u16 to a 32 bit signed integer, this is then sign extended
to a 64 bit unsigned long on 64 bit architectures. If reg.bgid is
greater than 0x7fff then this leads to a sign extended result where
all the upper 32 bits of mmap_offset are set to 1. Fix this by
casting reg.bgid to the same type as mmap_offset before performing
the shift.
Fixes: ef62de3c4ad5 ("io_uring/kbuf: use region api for pbuf rings")
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20241204153923.401674-1-colin.i.king@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add the ability to pass additional attributes along with read/write.
Application can prepare attibute specific information and pass its
address using the SQE field:
__u64 attr_ptr;
Along with setting a mask indicating attributes being passed:
__u64 attr_type_mask;
Overall 64 attributes are allowed and currently one attribute
'IORING_RW_ATTR_FLAG_PI' is supported.
With PI attribute, userspace can pass following information:
- flags: integrity check flags IO_INTEGRITY_CHK_{GUARD/APPTAG/REFTAG}
- len: length of PI/metadata buffer
- addr: address of metadata buffer
- seed: seed value for reftag remapping
- app_tag: application defined 16b value
Process this information to prepare uio_meta_descriptor and pass it down
using kiocb->private.
PI attribute is supported only for direct IO.
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20241128112240.8867-7-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
All mapped memory is now backed by regions and we can unify and clean
up io_region_validate_mmap() and io_uring_mmap(). Extract a function
looking up a region, the rest of the handling should be generic and just
needs the region.
There is one more ring type specific code, i.e. the mmaping size
truncation quirk for IORING_OFF_[S,C]Q_RING, which is left as is.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f5e1eda1562bfd34276de07465525ae5f10e1e84.1732886067.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Convert internal parts of the provided buffer ring managment to the
region API. It's the last non-region mapped ring we have, so it also
kills a bunch of now unused memmap.c helpers.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6c40cf7beaa648558acd4d84bc0fb3279a35d74b.1732886067.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
struct io_buffer_list refcounting was needed for RCU based sync with
mmap, now we can kill it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4a9cc54bf0077bb2bf2f3daf917549ddd41080da.1732886067.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
A preparation / cleanup patch simplifying the buf ring - mmap
synchronisation. Instead of relying on RCU, which is trickier, do it by
grabbing the mmap_lock when when anyone tries to publish or remove a
registered buffer to / from ->io_bl_xa.
Modifications of the xarray should always be protected by both
->uring_lock and ->mmap_lock, while lookups should hold either of them.
While a struct io_buffer_list is in the xarray, the mmap related fields
like ->flags and ->buf_pages should stay stable.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/af13bde56ee1a26bcaefaa9aad37a9ea318a590e.1732886067.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Convert internal parts of the CQ/SQ array managment to the region API.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/46fc3c801290d6b1ac16023d78f6b8e685c87fd6.1732886067.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Convert internal parts of the SQ managment to the region API.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1fb73ced6b835cb319ab0fe1dc0b2e982a9a5650.1732886067.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
A preparation patch, pass the context to io_register_free_rings.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c1865fd2b3d4db22d1a1aac7dd06ea22cb990834.1732886067.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The patch implements mmap for the param region and enables the kernel
allocation mode. Internally it uses a fixed mmap offset, however the
user has to use the offset returned in
struct io_uring_region_desc::mmap_offset.
Note, mmap doesn't and can't take ->uring_lock and the region / ring
lookup is protected by ->mmap_lock, and it's directly peeking at
ctx->param_region. We can't protect io_create_region() with the
mmap_lock as it'd deadlock, which is why io_create_region_mmap_safe()
initialises it for us in a temporary variable and then publishes it
with the lock taken. It's intentionally decoupled from main region
helpers, and in the future we might want to have a list of active
regions, which then could be protected by the ->mmap_lock.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0f1212bd6af7fb39b63514b34fae8948014221d1.1732886067.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Allow the kernel to allocate memory for a region. That's the classical
way SQ/CQ are allocated. It's not yet useful to user space as there
is no way to mmap it, which is why it's explicitly disabled in
io_register_mem_region().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7b8c40e6542546bbf93f4842a9a42a7373b81e0d.1732886067.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Kernel allocated compound pages will have just one reference for the
entire page array, add a flag telling io_free_region about that.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a7abfa7535e9728d5fcade29a1ea1605ec2c04ce.1732886067.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In preparation to adding kernel allocated regions extract a new helper
that pins user pages.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a17d7c39c3de4266b66b75b2dcf768150e1fc618.1732886067.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We don't need to vmap if memory is already physically contiguous. There
are two important cases it covers: PAGE_SIZE regions and huge pages.
Use io_check_coalesce_buffer() to get the number of contiguous folios.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d5240af23064a824c29d14d2406f1ae764bf4505.1732886067.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Regions are going to become more complex with allocation options and
optimisations, I want to split initialisation into steps and for that it
needs a sane fail path. Reuse io_free_region(), it's smart enough to
undo only what's needed and leaves the structure in a consistent state.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b853b4ec407cc80d033d021bdd2c14e22378fc78.1732886067.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Move memory accounting before page pinning. It shouldn't even try to pin
pages if it's not allowed, and accounting is also relatively
inexpensive. It also give a better code structure as we do generic
accounting and then can branch for different mapping types.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1e242b8038411a222e8b269d35e021fa5015289f.1732886067.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In preparation to kernel allocated regions add a flag telling if
the region contains user pinned pages or not.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0dc91564642654405bab080b7ec911cb4a43ec6e.1732886067.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add internal flags for struct io_mapped_region. The first flag we need
is IO_REGION_F_VMAPPED, that indicates that the pointer has to be
unmapped on region destruction. For now all regions are vmap'ed, so it's
set unconditionally.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5a3d8046a038da97c0f8a8c8f1733fa3fc689d31.1732886067.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_try_coalesce_buffer() is a useful helper collecting useful info about
a set of pages, I want to reuse it for analysing ring/etc. mappings. I
don't need the entire thing and only interested if it can be coalesced
into a single page, but that's better than duplicating the parsing.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/353b447953cd5d34c454a7d909bb6024c391d6e2.1732886067.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
->resize_lock is used for resizing rings, but it's a good idea to reuse
it in other cases as well. Rename it into mmap_lock as it's protects
from races with mmap.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/68f705306f3ac4d2fb999eb80ea1615015ce9f7f.1732886067.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|