linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2025-01-10	Merge branch 'vfs-6.14.poll' into vfs.fixes	Christian Brauner
	Bring in the fixes for __pollwait() and waitqueue_active() interactions. Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-10	io_uring_poll: kill the no longer necessary barrier after poll_wait()	Oleg Nesterov
	Now that poll_wait() provides a full barrier we can remove smp_rmb() from io_uring_poll(). In fact I don't think smp_rmb() was correct, it can't serialize LOADs and STOREs. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250107162730.GA18940@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-09	Merge tag 'for-6.13-rc6-tag' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes. Besides the one-liners in Btrfs there's fix to the io_uring and encoded read integration (added in this development cycle). The update to io_uring provides more space for the ongoing command that is then used in Btrfs to handle some cases. - io_uring and encoded read: - provide stable storage for io_uring command data - make a copy of encoded read ioctl call, reuse that in case the call would block and will be called again - properly initialize zlib context for hardware compression on s390 - fix max extent size calculation on filesystems with non-zoned devices - fix crash in scrub on crafted image due to invalid extent tree" * tag 'for-6.13-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zlib: fix avail_in bytes for s390 zlib HW compression path btrfs: zoned: calculate max_extent_size properly on non-zoned setup btrfs: avoid NULL pointer dereference if no valid extent tree btrfs: don't read from userspace twice in btrfs_uring_encoded_read() io_uring: add io_uring_cmd_get_async_data helper io_uring/cmd: add per-op data to struct io_uring_cmd_data io_uring/cmd: rename struct uring_cache to io_uring_cmd_data
2025-01-09	io_uring/eventfd: ensure io_eventfd_signal() defers another RCU period	Jens Axboe
	io_eventfd_do_signal() is invoked from an RCU callback, but when dropping the reference to the io_ev_fd, it calls io_eventfd_free() directly if the refcount drops to zero. This isn't correct, as any potential freeing of the io_ev_fd should be deferred another RCU grace period. Just call io_eventfd_put() rather than open-code the dec-and-test and free, which will correctly defer it another RCU grace period. Fixes: 21a091b970cd ("io_uring: signal registered eventfd to process deferred task work") Reported-by: Jann Horn <jannh@google.com> Cc: stable@vger.kernel.org Tested-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: Li Zetao<lizetao1@huawei.com> Reviewed-by: Prasanna Kumar T S M <ptsm@linux.microsoft.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-07	io_uring: silence false positive warnings	Pavel Begunkov
	If we kill a ring and then immediately exit the task, we'll get cancellattion running by the task and a kthread in io_ring_exit_work. For DEFER_TASKRUN, we do want to limit it to only one entity executing it, however it's currently not an issue as it's protected by uring_lock. Silence lockdep assertions for now, we'll return to it later. Reported-by: syzbot+1bcb75613069ad4957fc@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7e5f68281acb0f081f65fde435833c68a3b7e02f.1736257837.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-06	io_uring/cmd: add per-op data to struct io_uring_cmd_data	Jens Axboe
	In case an op handler for ->uring_cmd() needs stable storage for user data, it can allocate io_uring_cmd_data->op_data and use it for the duration of the request. When the request gets cleaned up, uring_cmd will free it automatically. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-06	io_uring/cmd: rename struct uring_cache to io_uring_cmd_data	Jens Axboe
	In preparation for making this more generically available for ->uring_cmd() usage that needs stable command data, rename it and move it to io_uring/cmd.h instead. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-04	io_uring/timeout: fix multishot updates	Pavel Begunkov
	After update only the first shot of a multishot timeout request adheres to the new timeout value while all subsequent retries continue to use the old value. Don't forget to update the timeout stored in struct io_timeout_data. Cc: stable@vger.kernel.org Fixes: ea97f6c8558e8 ("io_uring: add support for multishot timeouts") Reported-by: Christian Mazakas <christian.mazakas@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e6516c3304eb654ec234cfa65c88a9579861e597.1736015288.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-03	io_uring/rsrc: simplify the bvec iter count calculation	Bui Quang Minh
	As we don't use iov_iter_advance() but our own logic in io_import_fixed(), we can remove the logic that over-sets the iter's count to len + offset then adjusts it later to len. This helps to make the code cleaner. Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com> Link: https://lore.kernel.org/r/20250103150412.12549-1-minhquangbui99@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-03	io_uring/kbuf: use pre-committed buffer address for non-pollable file	Jens Axboe
	For non-pollable files, buffer ring consumption will commit upfront. This is fine, but io_ring_buffer_select() will return the address of the buffer after having committed it. For incrementally consumed buffers, this is incorrect as it will modify the buffer address. Store the pre-committed value and return that. If that isn't done, then the initial part of the buffer is not used and the application will correctly assume the content arrived at the start of the userspace buffer, but the kernel will have put it later in the buffer. Or it can cause a spurious -EFAULT returned in the CQE, depending on the buffer size. As bounds are suitably checked for doing the actual IO, no adverse side effects are possible - it's just a data misplacement within the existing buffer. Reported-by: Gwendal Fernet <gwendalfernet@gmail.com> Cc: stable@vger.kernel.org Fixes: ae98dbf43d75 ("io_uring/kbuf: add support for incremental buffer consumption") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-02	io_uring/net: always initialize kmsg->msg.msg_inq upfront	Jens Axboe
	syzbot reports that ->msg_inq may get used uinitialized from the following path: BUG: KMSAN: uninit-value in io_recv_buf_select io_uring/net.c:1094 [inline] BUG: KMSAN: uninit-value in io_recv+0x930/0x1f90 io_uring/net.c:1158 io_recv_buf_select io_uring/net.c:1094 [inline] io_recv+0x930/0x1f90 io_uring/net.c:1158 io_issue_sqe+0x420/0x2130 io_uring/io_uring.c:1740 io_queue_sqe io_uring/io_uring.c:1950 [inline] io_req_task_submit+0xfa/0x1d0 io_uring/io_uring.c:1374 io_handle_tw_list+0x55f/0x5c0 io_uring/io_uring.c:1057 tctx_task_work_run+0x109/0x3e0 io_uring/io_uring.c:1121 tctx_task_work+0x6d/0xc0 io_uring/io_uring.c:1139 task_work_run+0x268/0x310 kernel/task_work.c:239 io_run_task_work+0x43a/0x4a0 io_uring/io_uring.h:343 io_cqring_wait io_uring/io_uring.c:2527 [inline] __do_sys_io_uring_enter io_uring/io_uring.c:3439 [inline] __se_sys_io_uring_enter+0x204f/0x4ce0 io_uring/io_uring.c:3330 __x64_sys_io_uring_enter+0x11f/0x1a0 io_uring/io_uring.c:3330 x64_sys_call+0xce5/0x3c30 arch/x86/include/generated/asm/syscalls_64.h:427 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f and it is correct, as it's never initialized upfront. Hence the first submission can end up using it uninitialized, if the recv wasn't successful and the networking stack didn't honor ->msg_get_inq being set and filling in the output value of ->msg_inq as requested. Set it to 0 upfront when it's allocated, just to silence this KMSAN warning. There's no side effect of using it uninitialized, it'll just potentially cause the next receive to use a recv value hint that's not accurate. Fixes: c6f32c7d9e09 ("io_uring/net: get rid of ->prep_async() for receive side") Reported-by: syzbot+068ff190354d2f74892f@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-30	io_uring: ensure io_queue_deferred() is out-of-line	Jens Axboe
	This is not the hot path, it's a slow path. Yet the locking for it is in the hot path, and __cold does not prevent it from being inlined. Move the locking to the function itself, and mark it noinline as well to avoid it polluting the icache of the hot path. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-30	io_uring/timeout: flush timeouts outside of the timeout lock	Jens Axboe
	syzbot reports that a recent fix causes nesting issues between the (now) raw timeoutlock and the eventfd locking: ============================= [ BUG: Invalid wait context ] 6.13.0-rc4-00080-g9828a4c0901f #29 Not tainted ----------------------------- kworker/u32:0/68094 is trying to lock: ffff000014d7a520 (&ctx->wqh#2){..-.}-{3:3}, at: eventfd_signal_mask+0x64/0x180 other info that might help us debug this: context-{5:5} 6 locks held by kworker/u32:0/68094: #0: ffff0000c1d98148 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work+0x4e8/0xfc0 #1: ffff80008d927c78 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work+0x53c/0xfc0 #2: ffff0000c59bc3d8 (&ctx->completion_lock){+.+.}-{3:3}, at: io_kill_timeouts+0x40/0x180 #3: ffff0000c59bc358 (&ctx->timeout_lock){-.-.}-{2:2}, at: io_kill_timeouts+0x48/0x180 #4: ffff800085127aa0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x8/0x38 #5: ffff800085127aa0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x8/0x38 stack backtrace: CPU: 7 UID: 0 PID: 68094 Comm: kworker/u32:0 Not tainted 6.13.0-rc4-00080-g9828a4c0901f #29 Hardware name: linux,dummy-virt (DT) Workqueue: iou_exit io_ring_exit_work Call trace: show_stack+0x1c/0x30 (C) __dump_stack+0x24/0x30 dump_stack_lvl+0x60/0x80 dump_stack+0x14/0x20 __lock_acquire+0x19f8/0x60c8 lock_acquire+0x1a4/0x540 _raw_spin_lock_irqsave+0x90/0xd0 eventfd_signal_mask+0x64/0x180 io_eventfd_signal+0x64/0x108 io_req_local_work_add+0x294/0x430 __io_req_task_work_add+0x1c0/0x270 io_kill_timeout+0x1f0/0x288 io_kill_timeouts+0xd4/0x180 io_uring_try_cancel_requests+0x2e8/0x388 io_ring_exit_work+0x150/0x550 process_one_work+0x5e8/0xfc0 worker_thread+0x7ec/0xc80 kthread+0x24c/0x300 ret_from_fork+0x10/0x20 because after the preempt-rt fix for the timeout lock nesting inside the io-wq lock, we now have the eventfd spinlock nesting inside the raw timeout spinlock. Rather than play whack-a-mole with other nesting on the timeout lock, split the deletion and killing of timeouts so queueing the task_work for the timeout cancelations can get done outside of the timeout lock. Reported-by: syzbot+b1fc199a40b65d601b65@syzkaller.appspotmail.com Fixes: 020b40f35624 ("io_uring: make ctx->timeout_lock a raw spinlock") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-28	io_uring/rw: fix downgraded mshot read	Pavel Begunkov
	The io-wq path can downgrade a multishot request to oneshot mode, however io_read_mshot() doesn't handle that and would still post multiple CQEs. That's not allowed, because io_req_post_cqe() requires stricter context requirements. The described can only happen with pollable files that don't support FMODE_NOWAIT, which is an odd combination, so if even allowed it should be fairly rare. Cc: stable@vger.kernel.org Reported-by: chase xd <sl1589472800@gmail.com> Fixes: bee1d5becdf5b ("io_uring: disable io-wq execution of multishot NOWAIT requests") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c5c8c4a50a882fd581257b81bf52eee260ac29fd.1735407848.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-27	io_uring/rw: always clear ->bytes_done on io_async_rw setup	Jens Axboe
	A previous commit mistakenly moved the clearing of the in-progress byte count into the section that's dependent on having a cached iovec or not, but it should be cleared for any IO. If not, then extra bytes may be added at IO completion time, causing potentially weird behavior like over-reporting the amount of IO done. Fixes: d7f11616edf5 ("io_uring/rw: Allocate async data through helper") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202412271132.a09c3500-lkp@intel.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-27	io_uring/rw: use NULL for rw->free_iovec assigment	Jens Axboe
	It's a pointer, don't use 0 for that. sparse throws a warning for that, as the kernel test robot noticed. Fixes: d7f11616edf5 ("io_uring/rw: Allocate async data through helper") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202412180253.YML3qN4d-lkp@intel.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-27	io_uring/rw: don't mask in f_iocb_flags	Jens Axboe
	A previous commit changed overwriting kiocb->ki_flags with ->f_iocb_flags with masking it in. This breaks for retry situations, where we don't necessarily want to retain previously set flags, like IOCB_NOWAIT. The use case needs IOCB_HAS_METADATA to be persistent, but the change makes all flags persistent, which is an issue. Add a request flag to track whether the request has metadata or not, as that is persistent across issues. Fixes: 59a7d12a7fb5 ("io_uring: introduce attributes for read/write and PI support") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-27	io_uring/msg_ring: Drop custom destructor	Gabriel Krisman Bertazi
	kfree can handle slab objects nowadays. Drop the extra callback and just use kfree. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20241216204615.759089-10-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-27	io_uring: Move old async data allocation helper to header	Gabriel Krisman Bertazi
	There are two remaining uses of the old async data allocator that do not rely on the alloc cache. I don't want to make them use the new allocator helper because that would require a if(cache) check, which will result in dead code for the cached case (for callers passing a cache, gcc can't prove the cache isn't NULL, and will therefore preserve the check. Since this is an inline function and just a few lines long, keep a second helper to deal with cases where we don't have an async data cache. No functional change intended here. This is just moving the helper around and making it inline. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20241216204615.759089-9-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-27	io_uring/rw: Allocate async data through helper	Gabriel Krisman Bertazi
	This abstract away the cache details. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20241216204615.759089-8-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-27	io_uring/net: Allocate msghdr async data through helper	Gabriel Krisman Bertazi
	This abstracts away the cache details. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20241216204615.759089-7-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-27	io_uring/uring_cmd: Allocate async data through generic helper	Gabriel Krisman Bertazi
	This abstracts away the cache details and simplify the code. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20241216204615.759089-6-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-27	io_uring/poll: Allocate apoll with generic alloc_cache helper	Gabriel Krisman Bertazi
	This abstracts away the cache details to simplify the code. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20241216204615.759089-5-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-27	io_uring/futex: Allocate ifd with generic alloc_cache helper	Gabriel Krisman Bertazi
	Instead of open-coding the allocation, use the generic alloc_cache helper. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20241216204615.759089-4-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-27	io_uring: Add generic helper to allocate async data	Gabriel Krisman Bertazi
	This helper replaces io_alloc_async_data by using the folded allocation. Do it in a header to allow the compiler to decide whether to inline. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20241216204615.759089-3-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-26	io_uring/sqpoll: fix sqpoll error handling races	Pavel Begunkov
	BUG: KASAN: slab-use-after-free in __lock_acquire+0x370b/0x4a10 kernel/locking/lockdep.c:5089 Call Trace: <TASK> ... _raw_spin_lock_irqsave+0x3d/0x60 kernel/locking/spinlock.c:162 class_raw_spinlock_irqsave_constructor include/linux/spinlock.h:551 [inline] try_to_wake_up+0xb5/0x23c0 kernel/sched/core.c:4205 io_sq_thread_park+0xac/0xe0 io_uring/sqpoll.c:55 io_sq_thread_finish+0x6b/0x310 io_uring/sqpoll.c:96 io_sq_offload_create+0x162/0x11d0 io_uring/sqpoll.c:497 io_uring_create io_uring/io_uring.c:3724 [inline] io_uring_setup+0x1728/0x3230 io_uring/io_uring.c:3806 ... Kun Hu reports that the SQPOLL creating error path has UAF, which happens if io_uring_alloc_task_context() fails and then io_sq_thread() manages to run and complete before the rest of error handling code, which means io_sq_thread_finish() is looking at already killed task. Note that this is mostly theoretical, requiring fault injection on the allocation side to trigger in practice. Cc: stable@vger.kernel.org Reported-by: Kun Hu <huk23@m.fudan.edu.cn> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0f2f1aa5729332612bd01fe0f2f385fd1f06ce7c.1735231717.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23	io_uring: Fold allocation into alloc_cache helper	Gabriel Krisman Bertazi
	The allocation paths that use alloc_cache duplicate the same code pattern, sometimes in a quite convoluted way. Fold the allocation into the cache code itself, making it just an allocator function, and keeping the cache policy invisible to callers. Another justification for doing this, beyond code simplicity, is that it makes it trivial to test the impact of disabling the cache and using slab directly, which I've used for slab improvement experiments. One relevant detail is that we provide a callback to optionally initialize memory only when we actually reach slab. This allows us to avoid blindly executing the allocation with GFP_ZERO and only clean fields when they matter. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20241216204615.759089-2-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23	io_uring: prevent reg-wait speculations	Pavel Begunkov
	With *ENTER_EXT_ARG_REG instead of passing a user pointer with arguments for the waiting loop the user can specify an offset into a pre-mapped region of memory, in which case the [offset, offset + sizeof(io_uring_reg_wait)) will be intepreted as the argument. As we address a kernel array using a user given index, it'd be a subject to speculation type of exploits. Use array_index_nospec() to prevent that. Make sure to pass not the full region size but truncate by the maximum offset allowed considering the structure size. Fixes: d617b3147d54c ("io_uring: restore back registered wait arguments") Fixes: aa00f67adc2c0 ("io_uring: add support for fixed wait regions") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1e3d9da7c43d619de7bcf41d1cd277ab2688c443.1733694126.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23	io_uring: don't vmap single page regions	Pavel Begunkov
	When io_check_coalesce_buffer() meets a single page buffer it bails out and tells that it can be coalesced. That's fine for registered buffers as io_coalesce_buffer() wouldn't change anything, but the region code now uses the function to decided on whether to vmap the buffer or not. Report that a single page buffer is trivially coalescable and let io_sqe_buffer_register() to filter them. Fixes: c4d0ac1c1567 ("io_uring/memmap: optimise single folio regions") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/cb83e053f318857068447d40c95becebcd8aeced.1733689833.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23	io_uring: clean up io_prep_rw_setup()	David Wei
	Remove unnecessary call to iov_iter_save_state() in io_prep_rw_setup() as io_import_iovec() already does this. Then the result from io_import_iovec() can be returned directly. Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Tested-by: Li Zetao <lizetao1@huawei.com> Link: https://lore.kernel.org/r/20241207004144.783631-1-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23	io_uring/kbuf: fix unintentional sign extension on shift of reg.bgid	Colin Ian King
	Shifting reg.bgid << IORING_OFF_PBUF_SHIFT results in a promotion from __u16 to a 32 bit signed integer, this is then sign extended to a 64 bit unsigned long on 64 bit architectures. If reg.bgid is greater than 0x7fff then this leads to a sign extended result where all the upper 32 bits of mmap_offset are set to 1. Fix this by casting reg.bgid to the same type as mmap_offset before performing the shift. Fixes: ef62de3c4ad5 ("io_uring/kbuf: use region api for pbuf rings") Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Link: https://lore.kernel.org/r/20241204153923.401674-1-colin.i.king@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23	io_uring: introduce attributes for read/write and PI support	Anuj Gupta
	Add the ability to pass additional attributes along with read/write. Application can prepare attibute specific information and pass its address using the SQE field: __u64 attr_ptr; Along with setting a mask indicating attributes being passed: __u64 attr_type_mask; Overall 64 attributes are allowed and currently one attribute 'IORING_RW_ATTR_FLAG_PI' is supported. With PI attribute, userspace can pass following information: - flags: integrity check flags IO_INTEGRITY_CHK_{GUARD/APPTAG/REFTAG} - len: length of PI/metadata buffer - addr: address of metadata buffer - seed: seed value for reftag remapping - app_tag: application defined 16b value Process this information to prepare uio_meta_descriptor and pass it down using kiocb->private. PI attribute is supported only for direct IO. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20241128112240.8867-7-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23	io_uring/memmap: unify io_uring mmap'ing code	Pavel Begunkov
	All mapped memory is now backed by regions and we can unify and clean up io_region_validate_mmap() and io_uring_mmap(). Extract a function looking up a region, the rest of the handling should be generic and just needs the region. There is one more ring type specific code, i.e. the mmaping size truncation quirk for IORING_OFF_[S,C]Q_RING, which is left as is. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f5e1eda1562bfd34276de07465525ae5f10e1e84.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23	io_uring/kbuf: use region api for pbuf rings	Pavel Begunkov
	Convert internal parts of the provided buffer ring managment to the region API. It's the last non-region mapped ring we have, so it also kills a bunch of now unused memmap.c helpers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6c40cf7beaa648558acd4d84bc0fb3279a35d74b.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23	io_uring/kbuf: remove pbuf ring refcounting	Pavel Begunkov
	struct io_buffer_list refcounting was needed for RCU based sync with mmap, now we can kill it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4a9cc54bf0077bb2bf2f3daf917549ddd41080da.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23	io_uring/kbuf: use mmap_lock to sync with mmap	Pavel Begunkov
	A preparation / cleanup patch simplifying the buf ring - mmap synchronisation. Instead of relying on RCU, which is trickier, do it by grabbing the mmap_lock when when anyone tries to publish or remove a registered buffer to / from ->io_bl_xa. Modifications of the xarray should always be protected by both ->uring_lock and ->mmap_lock, while lookups should hold either of them. While a struct io_buffer_list is in the xarray, the mmap related fields like ->flags and ->buf_pages should stay stable. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/af13bde56ee1a26bcaefaa9aad37a9ea318a590e.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23	io_uring: use region api for CQ	Pavel Begunkov
	Convert internal parts of the CQ/SQ array managment to the region API. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/46fc3c801290d6b1ac16023d78f6b8e685c87fd6.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23	io_uring: use region api for SQ	Pavel Begunkov
	Convert internal parts of the SQ managment to the region API. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1fb73ced6b835cb319ab0fe1dc0b2e982a9a5650.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23	io_uring: pass ctx to io_register_free_rings	Pavel Begunkov
	A preparation patch, pass the context to io_register_free_rings. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c1865fd2b3d4db22d1a1aac7dd06ea22cb990834.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23	io_uring/memmap: implement mmap for regions	Pavel Begunkov
	The patch implements mmap for the param region and enables the kernel allocation mode. Internally it uses a fixed mmap offset, however the user has to use the offset returned in struct io_uring_region_desc::mmap_offset. Note, mmap doesn't and can't take ->uring_lock and the region / ring lookup is protected by ->mmap_lock, and it's directly peeking at ctx->param_region. We can't protect io_create_region() with the mmap_lock as it'd deadlock, which is why io_create_region_mmap_safe() initialises it for us in a temporary variable and then publishes it with the lock taken. It's intentionally decoupled from main region helpers, and in the future we might want to have a list of active regions, which then could be protected by the ->mmap_lock. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0f1212bd6af7fb39b63514b34fae8948014221d1.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23	io_uring/memmap: implement kernel allocated regions	Pavel Begunkov
	Allow the kernel to allocate memory for a region. That's the classical way SQ/CQ are allocated. It's not yet useful to user space as there is no way to mmap it, which is why it's explicitly disabled in io_register_mem_region(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7b8c40e6542546bbf93f4842a9a42a7373b81e0d.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23	io_uring/memmap: add IO_REGION_F_SINGLE_REF	Pavel Begunkov
	Kernel allocated compound pages will have just one reference for the entire page array, add a flag telling io_free_region about that. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a7abfa7535e9728d5fcade29a1ea1605ec2c04ce.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23	io_uring/memmap: helper for pinning region pages	Pavel Begunkov
	In preparation to adding kernel allocated regions extract a new helper that pins user pages. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a17d7c39c3de4266b66b75b2dcf768150e1fc618.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23	io_uring/memmap: optimise single folio regions	Pavel Begunkov
	We don't need to vmap if memory is already physically contiguous. There are two important cases it covers: PAGE_SIZE regions and huge pages. Use io_check_coalesce_buffer() to get the number of contiguous folios. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d5240af23064a824c29d14d2406f1ae764bf4505.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23	io_uring/memmap: reuse io_free_region for failure path	Pavel Begunkov
	Regions are going to become more complex with allocation options and optimisations, I want to split initialisation into steps and for that it needs a sane fail path. Reuse io_free_region(), it's smart enough to undo only what's needed and leaves the structure in a consistent state. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b853b4ec407cc80d033d021bdd2c14e22378fc78.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23	io_uring/memmap: account memory before pinning	Pavel Begunkov
	Move memory accounting before page pinning. It shouldn't even try to pin pages if it's not allowed, and accounting is also relatively inexpensive. It also give a better code structure as we do generic accounting and then can branch for different mapping types. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1e242b8038411a222e8b269d35e021fa5015289f.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23	io_uring/memmap: flag regions with user pages	Pavel Begunkov
	In preparation to kernel allocated regions add a flag telling if the region contains user pinned pages or not. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0dc91564642654405bab080b7ec911cb4a43ec6e.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23	io_uring/memmap: flag vmap'ed regions	Pavel Begunkov
	Add internal flags for struct io_mapped_region. The first flag we need is IO_REGION_F_VMAPPED, that indicates that the pointer has to be unmapped on region destruction. For now all regions are vmap'ed, so it's set unconditionally. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5a3d8046a038da97c0f8a8c8f1733fa3fc689d31.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23	io_uring/rsrc: export io_check_coalesce_buffer	Pavel Begunkov
	io_try_coalesce_buffer() is a useful helper collecting useful info about a set of pages, I want to reuse it for analysing ring/etc. mappings. I don't need the entire thing and only interested if it can be coalesced into a single page, but that's better than duplicating the parsing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/353b447953cd5d34c454a7d909bb6024c391d6e2.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23	io_uring: rename ->resize_lock	Pavel Begunkov
	->resize_lock is used for resizing rings, but it's a good idea to reuse it in other cases as well. Rename it into mmap_lock as it's protects from races with mmap. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/68f705306f3ac4d2fb999eb80ea1615015ce9f7f.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>