summaryrefslogtreecommitdiff
path: root/io_uring
AgeCommit message (Collapse)Author
2 daysMerge tag 'for-6.17/io_uring-20250728' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring updates from Jens Axboe: - Optimization to avoid reference counts on non-cloned registered buffers. This is how these buffers were handled prior to having cloning support, and we can still use that approach as long as the buffers haven't been cloned to another ring. - Cleanup and improvement for uring_cmd, where btrfs was the only user of storing allocated data for the lifetime of the uring_cmd. Clean that up so we can get rid of the need to do that. - Avoid unnecessary memory copies in uring_cmd usage. This is particularly important as a lot of uring_cmd usage necessitates the use of 128b SQEs. - A few updates for recv multishot, where it's now possible to add fairness limits for limiting how much is transferred for each retry loop. Additionally, recv multishot now supports an overall cap as well, where once reached the multishot recv will terminate. The latter is useful for buffer management and juggling many recv streams at the same time. - Add support for returning the TX timestamps via a new socket command. This feature can work in either singleshot or multishot mode, where the latter triggers a completion whenever new timestamps are available. This is an alternative to using the existing error queue. - Add support for an io_uring "mock" file, which is the start of being able to do 100% targeted testing in terms of exercising io_uring request handling. The idea is to have a file type that can be anything the tester would like, and behave exactly how you want it to behave in terms of hitting the code paths you want. - Improve zcrx by using sgtables to de-duplicate and improve dma address handling. - Prep work for supporting larger pages for zcrx. - Various little improvements and fixes. * tag 'for-6.17/io_uring-20250728' of git://git.kernel.dk/linux: (42 commits) io_uring/zcrx: fix leaking pages on sg init fail io_uring/zcrx: don't leak pages on account failure io_uring/zcrx: fix null ifq on area destruction io_uring: fix breakage in EXPERT menu io_uring/cmd: remove struct io_uring_cmd_data btrfs/ioctl: store btrfs_uring_encoded_data in io_btrfs_cmd io_uring/cmd: introduce IORING_URING_CMD_REISSUE flag io_uring/zcrx: account area memory io_uring: export io_[un]account_mem io_uring/net: Support multishot receive len cap io_uring: deduplicate wakeup handling io_uring/net: cast min_not_zero() type io_uring/poll: cleanup apoll freeing io_uring/net: allow multishot receive per-invocation cap io_uring/net: move io_sr_msg->retry_flags to io_sr_msg->flags io_uring/net: use passed in 'len' in io_recv_buf_select() io_uring/zcrx: prepare fallback for larger pages io_uring/zcrx: assert area type in io_zcrx_iov_page io_uring/zcrx: allocate sgtable for umem areas io_uring/zcrx: introduce io_populate_area_dma ...
3 daysMerge tag 'vfs-6.17-rc1.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc VFS updates from Christian Brauner: "This contains the usual selections of misc updates for this cycle. Features: - Add ext4 IOCB_DONTCACHE support This refactors the address_space_operations write_begin() and write_end() callbacks to take const struct kiocb * as their first argument, allowing IOCB flags such as IOCB_DONTCACHE to propagate to the filesystem's buffered I/O path. Ext4 is updated to implement handling of the IOCB_DONTCACHE flag and advertises support via the FOP_DONTCACHE file operation flag. Additionally, the i915 driver's shmem write paths are updated to bypass the legacy write_begin/write_end interface in favor of directly calling write_iter() with a constructed synchronous kiocb. Another i915 change replaces a manual write loop with kernel_write() during GEM shmem object creation. Cleanups: - don't duplicate vfs_open() in kernel_file_open() - proc_fd_getattr(): don't bother with S_ISDIR() check - fs/ecryptfs: replace snprintf with sysfs_emit in show function - vfs: Remove unnecessary list_for_each_entry_safe() from evict_inodes() - filelock: add new locks_wake_up_waiter() helper - fs: Remove three arguments from block_write_end() - VFS: change old_dir and new_dir in struct renamedata to dentrys - netfs: Remove unused declaration netfs_queue_write_request() Fixes: - eventpoll: Fix semi-unbounded recursion - eventpoll: fix sphinx documentation build warning - fs/read_write: Fix spelling typo - fs: annotate data race between poll_schedule_timeout() and pollwake() - fs/pipe: set FMODE_NOWAIT in create_pipe_files() - docs/vfs: update references to i_mutex to i_rwsem - fs/buffer: remove comment about hard sectorsize - fs/buffer: remove the min and max limit checks in __getblk_slow() - fs/libfs: don't assume blocksize <= PAGE_SIZE in generic_check_addressable - fs_context: fix parameter name in infofc() macro - fs: Prevent file descriptor table allocations exceeding INT_MAX" * tag 'vfs-6.17-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (24 commits) netfs: Remove unused declaration netfs_queue_write_request() eventpoll: fix sphinx documentation build warning ext4: support uncached buffered I/O mm/pagemap: add write_begin_get_folio() helper function fs: change write_begin/write_end interface to take struct kiocb * drm/i915: Refactor shmem_pwrite() to use kiocb and write_iter drm/i915: Use kernel_write() in shmem object create eventpoll: Fix semi-unbounded recursion vfs: Remove unnecessary list_for_each_entry_safe() from evict_inodes() fs/libfs: don't assume blocksize <= PAGE_SIZE in generic_check_addressable fs/buffer: remove the min and max limit checks in __getblk_slow() fs: Prevent file descriptor table allocations exceeding INT_MAX fs: Remove three arguments from block_write_end() fs/ecryptfs: replace snprintf with sysfs_emit in show function fs: annotate suspected data race between poll_schedule_timeout() and pollwake() docs/vfs: update references to i_mutex to i_rwsem fs/buffer: remove comment about hard sectorsize fs_context: fix parameter name in infofc() macro VFS: change old_dir and new_dir in struct renamedata to dentrys proc_fd_getattr(): don't bother with S_ISDIR() check ...
10 daysio_uring/zcrx: fix leaking pages on sg init failPavel Begunkov
If sg_alloc_table_from_pages() fails, io_import_umem() returns without cleaning up pinned pages first. Fix it. Fixes: b84621d96ee02 ("io_uring/zcrx: allocate sgtable for umem areas") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9fd94d1bc8c316611eccfec7579799182ff3fb0a.1753091564.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
10 daysio_uring/zcrx: don't leak pages on account failurePavel Begunkov
Someone needs to release pinned pages in io_import_umem() if accounting fails. Assign them to the area but return an error, the following io_zcrx_free_area() will clean them up. Fixes: 262ab205180d2 ("io_uring/zcrx: account area memory") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e19f283a912f200c0d427e376cb789fc3f3d69bc.1753091564.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
10 daysio_uring/zcrx: fix null ifq on area destructionPavel Begunkov
Dan reports that ifq can be null when infering arguments for io_unaccount_mem() from io_zcrx_free_area(). Fix it by always setting a correct ifq. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202507180628.gBxrOgqr-lkp@intel.com/ Fixes: 262ab205180d2 ("io_uring/zcrx: account area memory") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20670d163bb90dba2a81a4150f1125603cefb101.1753091564.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
13 daysMerge tag 'io_uring-6.16-20250718' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring fixes from Jens Axboe: - dmabug offset fix for zcrx - Fix for the POLLERR connect work around handling * tag 'io_uring-6.16-20250718' of git://git.kernel.dk/linux: io_uring/poll: fix POLLERR handling io_uring/zcrx: disallow user selected dmabuf offset and size
13 daysio_uring/cmd: remove struct io_uring_cmd_dataCaleb Sander Mateos
There are no more users of struct io_uring_cmd_data and its op_data field. Remove it to shave 8 bytes from struct io_async_cmd and eliminate a store and load for every uring_cmd. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Acked-by: David Sterba <dsterba@suse.com> Link: https://lore.kernel.org/r/20250708202212.2851548-5-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
13 daysio_uring/cmd: introduce IORING_URING_CMD_REISSUE flagCaleb Sander Mateos
Add a flag IORING_URING_CMD_REISSUE that ->uring_cmd() implementations can use to tell whether this is the first or subsequent issue of the uring_cmd. This will allow ->uring_cmd() implementations to store information in the io_uring_cmd's pdu across issues. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Acked-by: David Sterba <dsterba@suse.com> Link: https://lore.kernel.org/r/20250708202212.2851548-3-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-16io_uring/zcrx: account area memoryPavel Begunkov
zcrx areas can be quite large and need to be accounted and checked against RLIMIT_MEMLOCK. In practise it shouldn't be a big issue as the inteface already requires cap_net_admin. Cc: stable@vger.kernel.org Fixes: cf96310c5f9a0 ("io_uring/zcrx: add io_zcrx_area") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4b53f0c575bd062f63d12bec6cac98037fc66aeb.1752699568.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-16io_uring: export io_[un]account_memPavel Begunkov
Export pinned memory accounting helpers, they'll be used by zcrx shortly. Cc: stable@vger.kernel.org Fixes: cf96310c5f9a0 ("io_uring/zcrx: add io_zcrx_area") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9a61e54bd89289b39570ae02fe620e12487439e4.1752699568.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-16io_uring/poll: fix POLLERR handlingPavel Begunkov
8c8492ca64e7 ("io_uring/net: don't retry connect operation on EPOLLERR") is a little dirty hack that 1) wrongfully assumes that POLLERR equals to a failed request, which breaks all POLLERR users, e.g. all error queue recv interfaces. 2) deviates the connection request behaviour from connect(2), and 3) racy and solved at a wrong level. Nothing can be done with 2) now, and 3) is beyond the scope of the patch. At least solve 1) by moving the hack out of generic poll handling into io_connect(). Cc: stable@vger.kernel.org Fixes: 8c8492ca64e79 ("io_uring/net: don't retry connect operation on EPOLLERR") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3dc89036388d602ebd84c28e5042e457bdfc952b.1752682444.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-16io_uring/net: Support multishot receive len capNorman Maurer
At the moment its very hard to do fine grained backpressure when using multishot as the kernel might produce a lot of completions before the user has a chance to cancel a previous submitted multishot recv. This change adds support to issue a multishot recv that is capped by a len, which means the kernel will only rearm until X amount of data is received. When the limit is reached the completion will signal to the user that a re-arm needs to happen manually by not setting the IORING_CQE_F_MORE flag. Signed-off-by: Norman Maurer <norman_maurer@apple.com> Link: https://lore.kernel.org/r/20250715140249.31186-1-norman_maurer@apple.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-15io_uring: deduplicate wakeup handlingJens Axboe
Both io_poll_wq_wake() and io_cqring_wake() contain the exact same code, and most of the comment in the latter applies equally to both. Move the test and wakeup handling into a basic helper that they can both use, and move part of the comment that applies generically to this new helper. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-14io_uring/net: cast min_not_zero() typeJens Axboe
kernel test robot reports that xtensa complains about different signedness for a min_not_zero() comparison. Cast the int part to size_t to avoid this issue. Fixes: e227c8cdb47b ("io_uring/net: use passed in 'len' in io_recv_buf_select()") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202507150504.zO5FsCPm-lkp@intel.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-14io_uring/zcrx: disallow user selected dmabuf offset and sizePavel Begunkov
zcrx shouldn't be so frivolous about cutting a dmabuf sgtable and taking a subrange into it, the dmabuf layer might be not expecting that. It shouldn't be a problem for now, but since the zcrx dmabuf support is new and there shouldn't be any real users, let's play safe and reject user provided ranges into dmabufs. Also, it shouldn't be needed as userspace should size them appropriately. Fixes: a5c98e9424573 ("io_uring/zcrx: dmabuf backed zerocopy receive") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/be899f1afed32053eb2e2079d0da241514674aca.1752443579.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-12io_uring/poll: cleanup apoll freeingJens Axboe
No point having REQ_F_POLLED in both IO_REQ_CLEAN_FLAGS and in IO_REQ_CLEAN_SLOW_FLAGS, and having both io_free_batch_list() and then io_clean_op() check for it and clean it. Move REQ_F_POLLED to IO_REQ_CLEAN_SLOW_FLAGS and drop it from IO_REQ_CLEAN_FLAGS, and have only io_free_batch_list() do the check and freeing. Link: https://lore.kernel.org/io-uring/20250712000344.1579663-2-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-11Merge tag 'io_uring-6.16-20250710' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring fixes from Jens Axboe: - Remove a pointless warning in the zcrx code - Fix for MSG_RING commands, where the allocated io_kiocb needs to be freed under RCU as well - Revert the work-around we had in place for the anon inodes pretending to be regular files. Since that got reworked upstream, the work-around is no longer needed * tag 'io_uring-6.16-20250710' of git://git.kernel.dk/linux: Revert "io_uring: gate REQ_F_ISREG on !S_ANON_INODE as well" io_uring/msg_ring: ensure io_kiocb freeing is deferred for RCU io_uring/zcrx: fix pp destruction warnings
2025-07-10io_uring/net: allow multishot receive per-invocation capJens Axboe
If an application is handling multiple receive streams using recv multishot, then the amount of retries and buffer peeking for multishot and bundles can process too much per socket before moving on. This isn't directly controllable by the application. By default, io_uring will retry a recv MULTISHOT_MAX_RETRY (32) times, if the socket keeps having data to receive. And if using bundles, then each bundle peek will potentially map up to PEEK_MAX_IMPORT (256) iovecs of data. Once these limits are hit, then a requeue operation will be done, where the request will get retried after other pending requests have had a time to get executed. Add support for capping the per-invocation receive length, before a requeue condition is considered for each receive. This is done by setting sqe->mshot_len to the byte value. For example, if this is set to 1024, then each receive will be requeued by 1024 bytes received. Link: https://lore.kernel.org/io-uring/20250709203420.1321689-4-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-10io_uring/net: move io_sr_msg->retry_flags to io_sr_msg->flagsJens Axboe
There's plenty of space left, as sr->flags is a 16-bit type. The UAPI bits are the lower 8 bits, as that's all that sqe->ioprio can carry in the SQE anyway. Use a few of the upper 8 bits for internal uses, rather than have two separate flags entries. Link: https://lore.kernel.org/io-uring/20250709203420.1321689-2-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-10io_uring/net: use passed in 'len' in io_recv_buf_select()Jens Axboe
len is a pointer to the desired len, use that rather than grab it from sr->len again. No functional changes as of this patch, but it does prepare io_recv_buf_select() for getting passed in a value that differs from sr->len. Link: https://lore.kernel.org/io-uring/20250709203420.1321689-3-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-08io_uring/zcrx: prepare fallback for larger pagesPavel Begunkov
io_zcrx_copy_chunk() processes one page at a time, which won't be sufficient when the net_iov size grows. Introduce a structure keeping the target niov page and other parameters, it's more convenient and can be reused later. And add a helper function that can efficient copy buffers of an arbitrary length. For 64bit archs the loop inside should be compiled out. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e84bc705a4e1edeb9aefff470d96558d8232388f.1751466461.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-08io_uring/zcrx: assert area type in io_zcrx_iov_pagePavel Begunkov
Add a simple debug assertion to io_zcrx_iov_page() making it's not trying to return pages for a dmabuf area. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c3c30a926a18436a399a1768f3cc86c76cd17fa7.1751466461.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-08io_uring/zcrx: allocate sgtable for umem areasPavel Begunkov
Currently, dma addresses for umem areas are stored directly in niovs. It's memory efficient but inconvenient. I need a better format 1) to share code with dmabuf areas, and 2) for disentangling page, folio and niov sizes. dmabuf already provides sg_table, create one for user memory as well. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: David Wei <dw@davidwei.uk> Link: https://lore.kernel.org/r/f3c15081827c1bf5427d3a2e693bc526476b87ee.1751466461.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-08io_uring/zcrx: introduce io_populate_area_dmaPavel Begunkov
Add a helper that initialises page-pool dma addresses from a sg table. It'll be reused in following patches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: David Wei <dw@davidwei.uk> Link: https://lore.kernel.org/r/a8972a77be9b5675abc585d6e2e6e30f9c7dbd85.1751466461.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-08io_uring/zcrx: return error from io_zcrx_map_area_*Pavel Begunkov
io_zcrx_map_area_*() helpers return the number of processed niovs, which we use to unroll some of the mappings for user memory areas. It's unhandy, and dmabuf doesn't care about it. Return an error code instead and move failure partial unmapping into io_zcrx_map_area_umem(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: David Wei <dw@davidwei.uk> Link: https://lore.kernel.org/r/42668e82be3a84b07ee8fc76d1d6d5ac0f137fe5.1751466461.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-08io_uring/zcrx: always pass page to io_zcrx_copy_chunkPavel Begunkov
io_zcrx_copy_chunk() currently takes either a page or virtual address. Unify the parameters, make it take pages and resolve the linear part into a page the same way general networking code does that. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: David Wei <dw@davidwei.uk> Link: https://lore.kernel.org/r/b8f9f4bac027f5f44a9ccf85350912d1db41ceb8.1751466461.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-08Revert "io_uring: gate REQ_F_ISREG on !S_ANON_INODE as well"Jens Axboe
This reverts commit 6f11adcc6f36ffd8f33dbdf5f5ce073368975bc3. The problematic commit was fixed in mainline, so the work-around in io_uring can be removed at this point. Anonymous inodes no longer pretend to be regular files after: 1e7ab6f67824 ("anon_inode: rework assertions") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-08io_uring/msg_ring: ensure io_kiocb freeing is deferred for RCUJens Axboe
syzbot reports that defer/local task_work adding via msg_ring can hit a request that has been freed: CPU: 1 UID: 0 PID: 19356 Comm: iou-wrk-19354 Not tainted 6.16.0-rc4-syzkaller-00108-g17bbde2e1716 #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 Call Trace: <TASK> dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:408 [inline] print_report+0xd2/0x2b0 mm/kasan/report.c:521 kasan_report+0x118/0x150 mm/kasan/report.c:634 io_req_local_work_add io_uring/io_uring.c:1184 [inline] __io_req_task_work_add+0x589/0x950 io_uring/io_uring.c:1252 io_msg_remote_post io_uring/msg_ring.c:103 [inline] io_msg_data_remote io_uring/msg_ring.c:133 [inline] __io_msg_ring_data+0x820/0xaa0 io_uring/msg_ring.c:151 io_msg_ring_data io_uring/msg_ring.c:173 [inline] io_msg_ring+0x134/0xa00 io_uring/msg_ring.c:314 __io_issue_sqe+0x17e/0x4b0 io_uring/io_uring.c:1739 io_issue_sqe+0x165/0xfd0 io_uring/io_uring.c:1762 io_wq_submit_work+0x6e9/0xb90 io_uring/io_uring.c:1874 io_worker_handle_work+0x7cd/0x1180 io_uring/io-wq.c:642 io_wq_worker+0x42f/0xeb0 io_uring/io-wq.c:696 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> which is supposed to be safe with how requests are allocated. But msg ring requests alloc and free on their own, and hence must defer freeing to a sane time. Add an rcu_head and use kfree_rcu() in both spots where requests are freed. Only the one in io_msg_tw_complete() is strictly required as it has been visible on the other ring, but use it consistently in the other spot as well. This should not cause any other issues outside of KASAN rightfully complaining about it. Link: https://lore.kernel.org/io-uring/686cd2ea.a00a0220.338033.0007.GAE@google.com/ Reported-by: syzbot+54cbbfb4db9145d26fc2@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Fixes: 0617bb500bfa ("io_uring/msg_ring: improve handling of target CQE posting") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-07io_uring/rw: cast rw->flags assignment to rwf_tJens Axboe
kernel test robot reports that a recent change of the sqe->rw_flags field throws a sparse warning on 32-bit archs: >> io_uring/rw.c:291:19: sparse: sparse: incorrect type in assignment (different base types) @@ expected restricted __kernel_rwf_t [usertype] flags @@ got unsigned int @@ io_uring/rw.c:291:19: sparse: expected restricted __kernel_rwf_t [usertype] flags io_uring/rw.c:291:19: sparse: got unsigned int Force cast it to rwf_t to silence that new sparse warning. Fixes: cf73d9970ea4 ("io_uring: don't use int for ABI") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202507032211.PwSNPNSP-lkp@intel.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-07io_uring/zcrx: fix pp destruction warningsPavel Begunkov
With multiple page pools and in some other cases we can have allocated niovs on page pool destruction. Remove a misplaced warning checking that all niovs are returned to zcrx on io_pp_zc_destroy(). It was reported before but apparently got lost. Reported-by: Pedro Tammela <pctammela@mojatatu.com> Fixes: 34a3e60821ab9 ("io_uring/zcrx: implement zerocopy receive pp memory provider") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b9e6d919d2964bc48ddbf8eb52fc9f5d118e9bc1.1751878185.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-06Merge branch 'io_uring-6.16' into for-6.17/io_uringJens Axboe
Merge in 6.16 io_uring fixes, to avoid clashes with pending net and settings changes. * io_uring-6.16: io_uring: gate REQ_F_ISREG on !S_ANON_INODE as well io_uring/kbuf: flag partial buffer mappings io_uring/net: mark iov as dynamically allocated even for single segments io_uring: fix resource leak in io_import_dmabuf() io_uring: don't assume uaddr alignment in io_vec_fill_bvec io_uring/rsrc: don't rely on user vaddr alignment io_uring/rsrc: fix folio unpinning io_uring: make fallocate be hashed work
2025-07-02io_uring/rsrc: skip atomic refcount for uncloned buffersCaleb Sander Mateos
io_buffer_unmap() performs an atomic decrement of the io_mapped_ubuf's reference count in case it has been cloned into another io_ring_ctx's registered buffer table. This is an expensive operation and unnecessary in the common case that the io_mapped_ubuf is only registered once. Load the reference count first and check whether it's 1. In that case, skip the atomic decrement and immediately free the io_mapped_ubuf. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250619143435.3474028-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-02io_uring/mock: add trivial poll handlerPavel Begunkov
Add a flag that enables polling on the mock file. For now it's trivially says that there is always data available, it'll be extended in the future. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f16de043ec4876d65fae294fc99ade57415fba0c.1750599274.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-02io_uring/mock: support for async read/writePavel Begunkov
Let the user to specify a delay to read/write request. io_uring will start a timer, return -EIOCBQUEUED and complete the request asynchronously after the delay pass. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/38f9d2e143fda8522c90a724b74630e68f9bbd16.1750599274.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-02io_uring/mock: allow to choose FMODE_NOWAITPavel Begunkov
Add an option to choose whether the file supports FMODE_NOWAIT, that changes the execution path io_uring request takes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1e532565b05a05b23589d237c24ee1a3d90c2fd9.1750599274.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-02io_uring/mock: add sync read/writePavel Begunkov
Add support for synchronous zero read/write for mock files. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/571f3c9fe688e918256a06a722d3db6ced9ca3d5.1750599274.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-02io_uring/mock: add cmd using vectored regbufsPavel Begunkov
There is a command api allowing to import vectored registered buffers, add a new mock command that uses the feature and simply copies the specified registered buffer into user space or vice versa. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/229a113fd7de6b27dbef9567f7c0bf4475c9017d.1750599274.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-02io_uring/mock: add basic infra for test mock filesPavel Begunkov
io_uring commands provide an ioctl style interface for files to implement file specific operations. io_uring provides many features and advanced api to commands, and it's getting hard to test as it requires specific files/devices. Add basic infrastucture for creating special mock files that will be implementing the cmd api and using various io_uring features we want to test. It'll also be useful to test some more obscure read/write/polling edge cases in the future. Suggested-by: chase xd <sl1589472800@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/93f21b0af58c1367a2b22635d5a7d694ad0272fc.1750599274.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30Merge tag 'io_uring-6.16-20250630' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring fix from Jens Axboe: "Now that anonymous inodes set S_IFREG, this breaks the io_uring read/write retries for short reads/writes. As things like timerfd and eventfd are anon inodes, applications that previously did: unsigned long event_data[2]; io_uring_prep_read(sqe, evfd, event_data, sizeof(event_data), 0); and just got a short read when 1 event was posted, will now wait for the full amount before posting a completion. This caused issues for the ghostty application, making it basically unusable due to excessive buffering" * tag 'io_uring-6.16-20250630' of git://git.kernel.dk/linux: io_uring: gate REQ_F_ISREG on !S_ANON_INODE as well
2025-06-29io_uring: gate REQ_F_ISREG on !S_ANON_INODE as wellJens Axboe
io_uring marks a request as dealing with a regular file on S_ISREG. This drives things like retries on short reads or writes, which is generally not expected on a regular file (or bdev). Applications tend to not expect that, so io_uring tries hard to ensure it doesn't deliver short IO on regular files. However, a recent commit added S_IFREG to anonymous inodes. When io_uring is used to read from various things that are backed by anon inodes, like eventfd, timerfd, etc, then it'll now all of a sudden wait for more data when rather than deliver what was read or written in a single operation. This breaks applications that issue reads on anon inodes, if they ask for more data than a single read delivers. Add a check for !S_ANON_INODE as well before setting REQ_F_ISREG to prevent that. Cc: Christian Brauner <brauner@kernel.org> Cc: stable@vger.kernel.org Link: https://github.com/ghostty-org/ghostty/discussions/7720 Fixes: cfd86ef7e8e7 ("anon_inode: use a proper mode internally") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-27Merge tag 'io_uring-6.16-20250626' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring fixes from Jens Axboe: - Two tweaks for a recent fix: fixing a memory leak if multiple iovecs were initially mapped but only the first was used and hence turned into a UBUF rathan than an IOVEC iterator, and catching a case where a retry would be done even if the previous segment wasn't full - Small series fixing an issue making the vm unhappy if debugging is turned on, hitting a VM_BUG_ON_PAGE() - Fix a resource leak in io_import_dmabuf() in the error handling case, which is a regression in this merge window - Mark fallocate as needing to be write serialized, as is already done for truncate and buffered writes * tag 'io_uring-6.16-20250626' of git://git.kernel.dk/linux: io_uring/kbuf: flag partial buffer mappings io_uring/net: mark iov as dynamically allocated even for single segments io_uring: fix resource leak in io_import_dmabuf() io_uring: don't assume uaddr alignment in io_vec_fill_bvec io_uring/rsrc: don't rely on user vaddr alignment io_uring/rsrc: fix folio unpinning io_uring: make fallocate be hashed work
2025-06-26io_uring/kbuf: flag partial buffer mappingsJens Axboe
A previous commit aborted mapping more for a non-incremental ring for bundle peeking, but depending on where in the process this peeking happened, it would not necessarily prevent a retry by the user. That can create gaps in the received/read data. Add struct buf_sel_arg->partial_map, which can pass this information back. The networking side can then map that to internal state and use it to gate retry as well. Since this necessitates a new flag, change io_sr_msg->retry to a retry_flags member, and store both the retry and partial map condition in there. Cc: stable@vger.kernel.org Fixes: 26ec15e4b0c1 ("io_uring/kbuf: don't truncate end buffer for multiple buffer peeks") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-25io_uring/net: mark iov as dynamically allocated even for single segmentsJens Axboe
A bigger array of vecs could've been allocated, but io_ring_buffers_peek() still decided to cap the mapped range depending on how much data was available. Hence don't rely on the segment count to know if the request should be marked as needing cleanup, always check upfront if the iov array is different than the fast_iov array. Fixes: 26ec15e4b0c1 ("io_uring/kbuf: don't truncate end buffer for multiple buffer peeks") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-25io_uring: fix resource leak in io_import_dmabuf()Penglei Jiang
Replace the return statement with setting ret = -EINVAL and jumping to the err label to ensure resources are released via io_release_dmabuf. Fixes: a5c98e942457 ("io_uring/zcrx: dmabuf backed zerocopy receive") Signed-off-by: Penglei Jiang <superman.xpt@gmail.com> Link: https://lore.kernel.org/r/20250625102703.68336-1-superman.xpt@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-24io_uring: don't assume uaddr alignment in io_vec_fill_bvecPavel Begunkov
There is no guaranteed alignment for user pointers. Don't use mask trickery and adjust the offset by bv_offset. Cc: stable@vger.kernel.org Reported-by: David Hildenbrand <david@redhat.com> Fixes: 9ef4cbbcb4ac3 ("io_uring: add infra for importing vectored reg buffers") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/io-uring/19530391f5c361a026ac9b401ff8e123bde55d98.1750771718.git.asml.silence@gmail.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-24io_uring/rsrc: don't rely on user vaddr alignmentPavel Begunkov
There is no guaranteed alignment for user pointers, however the calculation of an offset of the first page into a folio after coalescing uses some weird bit mask logic, get rid of it. Cc: stable@vger.kernel.org Reported-by: David Hildenbrand <david@redhat.com> Fixes: a8edbb424b139 ("io_uring/rsrc: enable multi-hugepage buffer coalescing") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/io-uring/e387b4c78b33f231105a601d84eefd8301f57954.1750771718.git.asml.silence@gmail.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-24io_uring/rsrc: fix folio unpinningPavel Begunkov
syzbot complains about an unmapping failure: [ 108.070381][ T14] kernel BUG at mm/gup.c:71! [ 108.070502][ T14] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP [ 108.123672][ T14] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20250221-8.fc42 02/21/2025 [ 108.127458][ T14] Workqueue: iou_exit io_ring_exit_work [ 108.174205][ T14] Call trace: [ 108.175649][ T14] sanity_check_pinned_pages+0x7cc/0x7d0 (P) [ 108.178138][ T14] unpin_user_page+0x80/0x10c [ 108.180189][ T14] io_release_ubuf+0x84/0xf8 [ 108.182196][ T14] io_free_rsrc_node+0x250/0x57c [ 108.184345][ T14] io_rsrc_data_free+0x148/0x298 [ 108.186493][ T14] io_sqe_buffers_unregister+0x84/0xa0 [ 108.188991][ T14] io_ring_ctx_free+0x48/0x480 [ 108.191057][ T14] io_ring_exit_work+0x764/0x7d8 [ 108.193207][ T14] process_one_work+0x7e8/0x155c [ 108.195431][ T14] worker_thread+0x958/0xed8 [ 108.197561][ T14] kthread+0x5fc/0x75c [ 108.199362][ T14] ret_from_fork+0x10/0x20 We can pin a tail page of a folio, but then io_uring will try to unpin the head page of the folio. While it should be fine in terms of keeping the page actually alive, mm folks say it's wrong and triggers a debug warning. Use unpin_user_folio() instead of unpin_user_page*. Cc: stable@vger.kernel.org Debugged-by: David Hildenbrand <david@redhat.com> Reported-by: syzbot+1d335893772467199ab6@syzkaller.appspotmail.com Closes: https://lkml.kernel.org/r/683f1551.050a0220.55ceb.0017.GAE@google.com Fixes: a8edbb424b139 ("io_uring/rsrc: enable multi-hugepage buffer coalescing") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/io-uring/a28b0f87339ac2acf14a645dad1e95bbcbf18acd.1750771718.git.asml.silence@gmail.com/ [axboe: adapt to current tree, massage commit message] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-23io_uring/netcmd: add tx timestamping cmd supportPavel Begunkov
Add a new socket command which returns tx time stamps to the user. It provide an alternative to the existing error queue recvmsg interface. The command works in a polled multishot mode, which means io_uring will poll the socket and keep posting timestamps until the request is cancelled or fails in any other way (e.g. with no space in the CQ). It reuses the net infra and grabs timestamps from the socket's error queue. The command requires IORING_SETUP_CQE32. All non-final CQEs (marked with IORING_CQE_F_MORE) have cqe->res set to the tskey, and the upper 16 bits of cqe->flags keep tstype (i.e. offset by IORING_CQE_BUFFER_SHIFT). The timevalue is store in the upper part of the extended CQE. The final completion won't have IORING_CQE_F_MORE and will have cqe->res storing 0/error. Suggested-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/92ee66e6b33b8de062a977843d825f58f21ecd37.1750065793.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-23io_uring: add mshot helper for posting CQE32Pavel Begunkov
Add a helper for posting 32 byte CQEs in a multishot mode and add a cmd helper on top. As it specifically works with requests, the helper ignore the passed in cqe->user_data and sets it to the one stored in the request. The command helper is only valid with multishot requests. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c29d7720c16e1f981cfaa903df187138baa3946b.1750065793.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-23io_uring/cmd: allow multishot polled commandsPavel Begunkov
Some commands like timestamping in the next patch can make use of multishot polling, i.e. REQ_F_APOLL_MULTISHOT. Add support for that, which is condensed in a single helper called io_cmd_poll_multishot(). The user who wants to continue with a request in a multishot mode must call the function, and only if it returns 0 the user is free to proceed. Apart from normal terminal errors, it can also end up with -EIOCBQUEUED, in which case the user must forward it to the core io_uring. It's forbidden to use task work while the request is executing in a multishot mode. The API is not foolproof, hence it's not exported to modules nor exposed in public headers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bcf97c31659662c72b69fc8fcdf2a88cfc16e430.1750065793.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>