summaryrefslogtreecommitdiff
path: root/io_uring/rw.c
AgeCommit message (Collapse)Author
2025-05-26Merge tag 'for-6.16/io_uring-20250523' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring updates from Jens Axboe: - Avoid indirect function calls in io-wq for executing and freeing work. The design of io-wq is such that it can be a generic mechanism, but as it's just used by io_uring now, may as well avoid these indirect calls - Clean up registered buffers for networking - Add support for IORING_OP_PIPE. Pretty straight forward, allows creating pipes with io_uring, particularly useful for having these be instantiated as direct descriptors - Clean up the coalescing support fore registered buffers - Add support for multiple interface queues for zero-copy rx networking. As this feature was merged for 6.15 it supported just a single ifq per ring - Clean up the eventfd support - Add dma-buf support to zero-copy rx - Clean up and improving the request draining support - Clean up provided buffer support, most notably with an eye toward making the legacy support less intrusive - Minor fdinfo cleanups, dropping support for dumping what credentials are registered - Improve support for overflow CQE handling, getting rid of GFP_ATOMIC for allocating overflow entries where possible - Improve detection of cases where io-wq doesn't need to spawn a new worker unnecessarily - Various little cleanups * tag 'for-6.16/io_uring-20250523' of git://git.kernel.dk/linux: (59 commits) io_uring/cmd: warn on reg buf imports by ineligible cmds io_uring/io-wq: only create a new worker if it can make progress io_uring/io-wq: ignore non-busy worker going to sleep io_uring/io-wq: move hash helpers to the top trace/io_uring: fix io_uring_local_work_run ctx documentation io_uring: finish IOU_OK -> IOU_COMPLETE transition io_uring: add new helpers for posting overflows io_uring: pass in struct io_big_cqe to io_alloc_ocqe() io_uring: make io_alloc_ocqe() take a struct io_cqe pointer io_uring: split alloc and add of overflow io_uring: open code io_req_cqe_overflow() io_uring/fdinfo: get rid of dumping credentials io_uring/fdinfo: only compile if CONFIG_PROC_FS is set io_uring/kbuf: unify legacy buf provision and removal io_uring/kbuf: refactor __io_remove_buffers io_uring/kbuf: don't compute size twice on prep io_uring/kbuf: drop extra vars in io_register_pbuf_ring io_uring/kbuf: use mem_is_zero() io_uring/kbuf: account ring io_buffer_list memory io_uring: drain based on allocates reqs ...
2025-05-21io_uring: finish IOU_OK -> IOU_COMPLETE transitionJens Axboe
IOU_COMPLETE is more descriptive, in that it explicitly says that the return value means "please post a completion for this request". This patch completes the transition from IOU_OK to IOU_COMPLETE, replacing existing IOU_OK users. This is a purely mechanical change. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06io_uring: enable per-io write streamsKeith Busch
Allow userspace to pass a per-I/O write stream in the SQE: __u8 write_stream; The __u8 type matches the size the filesystems and block layer support. Application can query the supported values from the block devices max_write_streams sysfs attribute. Unsupported values are ignored by file operations that do not support write streams or rejected with an error by those that support them. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-7-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21io_uring/kbuf: pass bgid to io_buffer_select()Pavel Begunkov
The current situation with buffer group id juggling is not ideal. req->buf_index first stores the bgid, then it's overwritten by a buffer id, and then it can get restored back no recycling / etc. It's not so easy to control, and it's not handled consistently across request types with receive requests saving and restoring the bgid it by hand. It's a prep patch that adds a buffer group id argument to io_buffer_select(). The caller will be responsible for stashing a copy somewhere and passing it into the function. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a210d6427cc3f4f42271a6853274cd5a50e56820.1743437358.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-28Merge tag 'for-6.15/io_uring-reg-vec-20250327' of git://git.kernel.dk/linuxLinus Torvalds
Pull more io_uring updates from Jens Axboe: "Final separate updates for io_uring. This started out as a series of cleanups improvements and improvements for registered buffers, but as the last series of the io_uring changes for 6.15, it also collected a few fixes for the other branches on top: - Add support for vectored fixed/registered buffers. Previously only single segments have been supported for commands, now vectored variants are supported as well. This series includes networking and file read/write support. - Small series unifying return codes across multi and single shot. - Small series cleaning up registerd buffer importing. - Adding support for vectored registered buffers for uring_cmd. - Fix for io-wq handling of command reissue. - Various little fixes and tweaks" * tag 'for-6.15/io_uring-reg-vec-20250327' of git://git.kernel.dk/linux: (25 commits) io_uring/net: fix io_req_post_cqe abuse by send bundle io_uring/net: use REQ_F_IMPORT_BUFFER for send_zc io_uring: move min_events sanitisation io_uring: rename "min" arg in io_iopoll_check() io_uring: open code __io_post_aux_cqe() io_uring: defer iowq cqe overflow via task_work io_uring: fix retry handling off iowq io_uring/net: only import send_zc buffer once io_uring/cmd: introduce io_uring_cmd_import_fixed_vec io_uring/cmd: add iovec cache for commands io_uring/cmd: don't expose entire cmd async data io_uring: rename the data cmd cache io_uring: rely on io_prep_reg_vec for iovec placement io_uring: introduce io_prep_reg_iovec() io_uring: unify STOP_MULTISHOT with IOU_OK io_uring: return -EAGAIN to continue multishot io_uring: cap cached iovec/bvec size io_uring/net: implement vectored reg bufs for zctx io_uring/net: convert to struct iou_vec io_uring/net: pull vec alloc out of msghdr import ...
2025-03-26Merge tag 'for-6.15/io_uring-20250322' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring updates from Jens Axboe: "This is the first of the io_uring pull requests for the 6.15 merge window, there will be others once the net tree has gone in. This contains: - Cleanup and unification of cancelation handling across various request types. - Improvement for bundles, supporting them both for incrementally consumed buffers, and for non-multishot requests. - Enable toggling of using iowait while waiting on io_uring events or not. Unfortunately this is still tied with CPU frequency boosting on short waits, as the scheduler side has not been very receptive to splitting the (useless) iowait stat from the cpufreq implied boost. - Add support for kbuf nodes, enabling zero-copy support for the ublk block driver. - Various cleanups for resource node handling. - Series greatly cleaning up the legacy provided (non-ring based) buffers. For years, we've been pushing the ring provided buffers as the way to go, and that is what people have been using. Reduce the complexity and code associated with legacy provided buffers. - Series cleaning up the compat handling. - Series improving and cleaning up the recvmsg/sendmsg iovec and msg handling. - Series of cleanups for io-wq. - Start adding a bunch of selftests. The liburing repository generally carries feature and regression tests for everything, but at least for ublk initially, we'll try and go the route of having it in selftests as well. We'll see how this goes, might decide to migrate more tests this way in the future. - Various little cleanups and fixes" * tag 'for-6.15/io_uring-20250322' of git://git.kernel.dk/linux: (108 commits) selftests: ublk: add stripe target selftests: ublk: simplify loop io completion selftests: ublk: enable zero copy for null target selftests: ublk: prepare for supporting stripe target selftests: ublk: move common code into common.c selftests: ublk: increase max buffer size to 1MB selftests: ublk: add single sqe allocator helper selftests: ublk: add generic_01 for verifying sequential IO order selftests: ublk: fix starting ublk device io_uring: enable toggle of iowait usage when waiting on CQEs selftests: ublk: fix write cache implementation selftests: ublk: add variable for user to not show test result selftests: ublk: don't show `modprobe` failure selftests: ublk: add one dependency header io_uring/kbuf: enable bundles for incrementally consumed buffers Revert "io_uring/rsrc: simplify the bvec iter count calculation" selftests: ublk: improve test usability selftests: ublk: add stress test for covering IO vs. killing ublk server selftests: ublk: add one stress test for covering IO vs. removing device selftests: ublk: load/unload ublk_drv when preparing & cleaning up tests ...
2025-03-10io_uring: rely on io_prep_reg_vec for iovec placementPavel Begunkov
All vectored reg buffer users should use io_import_reg_vec() for iovec imports, since iovec placement is the function's responsibility and callers shouldn't know much about it, drop the offset parameter from io_prep_reg_vec() and calculate it inside. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/08ed87ca4bbc06724373b6ce06f36b703fe60c4e.1741457480.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10io_uring: introduce io_prep_reg_iovec()Pavel Begunkov
iovecs that are turned into registered buffers are imported in a special way with an offset, so that later we can do an in place translation. Add a helper function taking care of it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7de2ecb9ed5efc3c5cf320232236966da5ad4ccc.1741457480.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10io_uring: unify STOP_MULTISHOT with IOU_OKPavel Begunkov
IOU_OK means that the request ownership is now handed back to core io_uring and it has to complete it using the result provided in req->cqe. Same is true for multishot and IOU_STOP_MULTISHOT. Rename it into IOU_COMPLETE to avoid confusion and use for both modes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e6a5b2edb0eb9558acb1c8f1db38ac45fee95491.1741453534.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10io_uring: return -EAGAIN to continue multishotPavel Begunkov
Multishot errors can be mapped 1:1 to normal errors, but there are not identical. It leads to a peculiar situation where all multishot requests has to check in what context they're run and return different codes. Unify them starting with EAGAIN / IOU_ISSUE_SKIP_COMPLETE(EIOCBQUEUED) pair, which mean that core io_uring still owns the request and it should be retried. In case of multishot it's naturally just continues to poll, otherwise it might poll, use iowq or do any other kind of allowed blocking. Introduce IOU_RETRY aliased to -EAGAIN for that. Apart from obvious upsides, multishot can now also check for misuse of IOU_ISSUE_SKIP_COMPLETE. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/da117b79ce72ecc3ab488c744e29fae9ba54e23b.1741453534.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-07io_uring: cap cached iovec/bvec sizePavel Begunkov
Bvecs can be large, put an arbitrary limit on the max vector size it can cache. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/823055fa6628daa24bbc9cd77c2da87e9a1e1e32.1741362889.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-07io_uring/rw: defer reg buf vec importPavel Begunkov
Import registered buffers for vectored reads and writes later at issue time as we now do for other fixed ops. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e8491c976e4ab83a4e3dc428e9fe7555e59583b8.1741362889.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-07io_uring/rw: implement vectored registered rwPavel Begunkov
Implement registered buffer vectored reads with new opcodes IORING_OP_WRITEV_FIXED and IORING_OP_READV_FIXED. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d7c89eb481e870f598edc91cc66ff4d1e4ae3788.1741362889.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-07io_uring: introduce struct iou_vecPavel Begunkov
I need a convenient way to pass around and work with iovec+size pair, put them into a structure and makes use of it in rw.c Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d39fadafc9e9047b0a292e5be6db3cf2f48bb1f7.1741362889.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-05io_uring/rw: ensure reissue path is correctly handled for IOPOLLJens Axboe
The IOPOLL path posts CQEs when the io_kiocb is marked as completed, so it cannot rely on the usual retry that non-IOPOLL requests do for read/write requests. If -EAGAIN is received and the request should be retried, go through the normal completion path and let the normal flush logic catch it and reissue it, like what is done for !IOPOLL reads or writes. Fixes: d803d123948f ("io_uring/rw: handle -EAGAIN retry at IO completion time") Reported-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/io-uring/2b43ccfa-644d-4a09-8f8f-39ad71810f41@oracle.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-28io_uring: add support for kernel registered bvecsKeith Busch
Provide an interface for the kernel to leverage the existing pre-registered buffers that io_uring provides. User space can reference these later to achieve zero-copy IO. User space must register an empty fixed buffer table with io_uring in order for the kernel to make use of it. Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250227223916.143006-5-kbusch@meta.com Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-28io_uring/rw: move fixed buffer import to issue pathKeith Busch
Registered buffers may depend on a linked command, which makes the prep path too early to import. Move to the issue path when the node is actually needed like all the other users of fixed buffers. Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250227223916.143006-3-kbusch@meta.com Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-28io_uring/rw: move buffer_select outside generic prepKeith Busch
Cleans up the generic rw prep to not require the do_import flag. Use a different prep function for callers that might need buffer select. Based-on-a-patch-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250227223916.143006-2-kbusch@meta.com Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-27Merge branch 'io_uring-6.14' into for-6.15/io_uringJens Axboe
Merge mainline fixes into 6.15 branch, as upcoming patches depend on fixes that went into the 6.14 mainline branch. * io_uring-6.14: io_uring/net: save msg_control for compat io_uring/rw: clean up mshot forced sync mode io_uring/rw: move ki_complete init into prep io_uring/rw: don't directly use ki_complete io_uring/rw: forbid multishot async reads io_uring/rsrc: remove unused constants io_uring: fix spelling error in uapi io_uring.h io_uring: prevent opcode speculation io-wq: backoff when retrying worker creation
2025-02-27io_uring: combine buffer lookup and importPavel Begunkov
Registered buffer are currently imported in two steps, first we lookup a rsrc node and then use it to set up the iterator. The first part is usually done at the prep stage, and import happens whenever it's needed. As we want to defer binding to a node so that it works with linked requests, combine both steps into a single helper. Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250224213116.3509093-6-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-27io_uring/rw: open code io_prep_rw_setup()Pavel Begunkov
Open code io_prep_rw_setup() into its only caller, it doesn't provide any meaningful abstraction anymore. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/61ba72e2d46119db71f27ab908018e6a6cd6c064.1740425922.git.asml.silence@gmail.com [axboe: fold in 'ret' being unused fix] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-25io_uring/rw: extract helper for iovec importPavel Begunkov
Split out a helper out of __io_import_rw_buffer() that handles vectored buffers. I'll need it for registered vectored buffers, but it also looks cleaner, especially with parameters being properly named. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/075470cfb24be38709d946815f35ec846d966f41.1740425922.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-25io_uring/rw: rename io_import_iovec()Pavel Begunkov
io_import_iovec() is not limited to iovecs but also imports buffers for normal reads and selected buffers, rename it for clarity. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/91cea59340b61a8f52dc7b8e720274577a25188c.1740425922.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-25io_uring/rw: allocate async data in io_prep_rw()Pavel Begunkov
rw always allocates async_data, so instead of doing that deeper in prep calls inside of io_prep_rw_setup(), be a bit more explicit and do that early on in io_prep_rw(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5ead621051bc3374d1e8d96f816454906a6afd71.1740425922.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-24io_uring/rw: shrink io_iov_compat_buffer_select_prepPavel Begunkov
Compat performance is not important and simplicity is more appreciated. Let's not be smart about it and use simpler copy_from_user() instead of access + __get_user pair. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b334a3a5040efa424ded58e4d8a6ef2554324266.1740400452.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-24io_uring/rw: compile out compat param passingPavel Begunkov
Even when COMPAT is compiled out, we still have to pass ctx->compat to __import_iovec(). Replace the read with an indirection with a constant when the kernel doesn't support compat. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/r/2819df9c8533c36b46d7baccbb317a0ec89da6cd.1740400452.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-19io_uring/rw: clean up mshot forced sync modePavel Begunkov
Move code forcing synchronous execution of multishot read requests out a more generic __io_read(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4ad7b928c776d1ad59addb9fff64ef2d1fc474d5.1739919038.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-19io_uring/rw: move ki_complete init into prepPavel Begunkov
Initialise ki_complete during request prep stage, we'll depend on it not being reset during issue in the following patch. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/817624086bd5f0448b08c80623399919fda82f34.1739919038.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-19io_uring/rw: don't directly use ki_completePavel Begunkov
We want to avoid checking ->ki_complete directly in the io_uring completion path. Fortunately we have only two callback the selection of which depend on the ring constant flags, i.e. IOPOLL, so use that to infer the function. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4eb4bdab8cbcf5bc87083f7047edc81e920ab83c.1739919038.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-19io_uring/rw: forbid multishot async readsPavel Begunkov
At the moment we can't sanely handle queuing an async request from a multishot context, so disable them. It shouldn't matter as pollable files / socekts don't normally do async. Patching it in __io_read() is not the cleanest way, but it's simpler than other options, so let's fix it there and clean up on top. Cc: stable@vger.kernel.org Reported-by: chase xd <sl1589472800@gmail.com> Fixes: fc68fcda04910 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7d51732c125159d17db4fe16f51ec41b936973f8.1739919038.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring: introduce type alias for io_tw_stateCaleb Sander Mateos
In preparation for changing how io_tw_state is passed, introduce a type alias io_tw_token_t for struct io_tw_state *. This allows for changing the representation in one place, without having to update the many functions that just forward their struct io_tw_state * argument. Also add a comment to struct io_tw_state to explain its purpose. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250217022511.1150145-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-28io_uring/rw: simplify io_rw_recycle()Pavel Begunkov
Instead of freeing iovecs in case of IO_URING_F_UNLOCKED in io_rw_recycle(), leave it be and rely on the core io_uring code to call io_readv_writev_cleanup() later. This way the iovec will get recycled and we can clean up io_rw_recycle() and kill io_rw_iovec_free(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/14f83b112eb40078bea18e15d77a4f99fc981a44.1738087204.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-28io_uring: remove !KASAN guards from cache freePavel Begunkov
Test setups (with KASAN) will avoid !KASAN sections, and so it's not testing paths that would be exercised otherwise. That's bad as to be sure that your code works you now have to specifically test both KASAN and !KASAN configs. Remove !CONFIG_KASAN guards from io_netmsg_cache_free() and io_rw_cache_free(). The free functions should always be getting valid entries, and even though for KASAN iovecs should already be cleared, that's better than skipping the chunks completely. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/d6078a51c7137a243f9d00849bc3daa660873209.1738087204.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-23io_uring: get rid of alloc cache init_once handlingJens Axboe
init_once is called when an object doesn't come from the cache, and hence needs initial clearing of certain members. While the whole struct could get cleared by memset() in that case, a few of the cache members are large enough that this may cause unnecessary overhead if the caches used aren't large enough to satisfy the workload. For those cases, some churn of kmalloc+kfree is to be expected. Ensure that the 3 users that need clearing put the members they need cleared at the start of the struct, and wrap the rest of the struct in a struct group so the offset is known. While at it, improve the interaction with KASAN such that when/if KASAN writes to members inside the struct that should be retained over caching, it won't trip over itself. For rw and net, the retaining of the iovec over caching is disabled if KASAN is enabled. A helper will free and clear those members in that case. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-20Merge tag 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring updates from Jens Axboe: "Not a lot in terms of features this time around, mostly just cleanups and code consolidation: - Support for PI meta data read/write via io_uring, with NVMe and SCSI covered - Cleanup the per-op structure caching, making it consistent across various command types - Consolidate the various user mapped features into a concept called regions, making the various users of that consistent - Various cleanups and fixes" * tag 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linux: (56 commits) io_uring/fdinfo: fix io_uring_show_fdinfo() misuse of ->d_iname io_uring: reuse io_should_terminate_tw() for cmds io_uring: Factor out a function to parse restrictions io_uring/rsrc: require cloned buffers to share accounting contexts io_uring: simplify the SQPOLL thread check when cancelling requests io_uring: expose read/write attribute capability io_uring/rw: don't gate retry on completion context io_uring/rw: handle -EAGAIN retry at IO completion time io_uring/rw: use io_rw_recycle() from cleanup path io_uring/rsrc: simplify the bvec iter count calculation io_uring: ensure io_queue_deferred() is out-of-line io_uring/rw: always clear ->bytes_done on io_async_rw setup io_uring/rw: use NULL for rw->free_iovec assigment io_uring/rw: don't mask in f_iocb_flags io_uring/msg_ring: Drop custom destructor io_uring: Move old async data allocation helper to header io_uring/rw: Allocate async data through helper io_uring/net: Allocate msghdr async data through helper io_uring/uring_cmd: Allocate async data through generic helper io_uring/poll: Allocate apoll with generic alloc_cache helper ...
2025-01-10io_uring/rw: don't gate retry on completion contextJens Axboe
nvme multipath reports that they see spurious -EAGAIN bubbling back to userspace, which is caused by how they handle retries internally through a kworker. However, any data that needs preserving or importing for a read/write request has always been done so at prep time, and we can sanely skip this check. Reported-by: "Haeuptle, Michael" <michael.haeuptle@hpe.com> Link: https://lore.kernel.org/io-uring/DS7PR84MB31105C2C63CFA47BE8CBD6EE95102@DS7PR84MB3110.NAMPRD84.PROD.OUTLOOK.COM/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-10io_uring/rw: handle -EAGAIN retry at IO completion timeJens Axboe
Rather than try and have io_read/io_write turn REQ_F_REISSUE into -EAGAIN, catch the REQ_F_REISSUE when the request is otherwise considered as done. This is saner as we know this isn't happening during an actual submission, and it removes the need to randomly check REQ_F_REISSUE after read/write submission. If REQ_F_REISSUE is set, __io_submit_flush_completions() will skip over this request in terms of posting a CQE, and the regular request cleaning will ensure that it gets reissued via io-wq. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-10io_uring/rw: use io_rw_recycle() from cleanup pathJens Axboe
Cleanup should always have the uring lock held, it's safe to recycle from here. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-28io_uring/rw: fix downgraded mshot readPavel Begunkov
The io-wq path can downgrade a multishot request to oneshot mode, however io_read_mshot() doesn't handle that and would still post multiple CQEs. That's not allowed, because io_req_post_cqe() requires stricter context requirements. The described can only happen with pollable files that don't support FMODE_NOWAIT, which is an odd combination, so if even allowed it should be fairly rare. Cc: stable@vger.kernel.org Reported-by: chase xd <sl1589472800@gmail.com> Fixes: bee1d5becdf5b ("io_uring: disable io-wq execution of multishot NOWAIT requests") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c5c8c4a50a882fd581257b81bf52eee260ac29fd.1735407848.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-27io_uring/rw: always clear ->bytes_done on io_async_rw setupJens Axboe
A previous commit mistakenly moved the clearing of the in-progress byte count into the section that's dependent on having a cached iovec or not, but it should be cleared for any IO. If not, then extra bytes may be added at IO completion time, causing potentially weird behavior like over-reporting the amount of IO done. Fixes: d7f11616edf5 ("io_uring/rw: Allocate async data through helper") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202412271132.a09c3500-lkp@intel.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-27io_uring/rw: use NULL for rw->free_iovec assigmentJens Axboe
It's a pointer, don't use 0 for that. sparse throws a warning for that, as the kernel test robot noticed. Fixes: d7f11616edf5 ("io_uring/rw: Allocate async data through helper") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202412180253.YML3qN4d-lkp@intel.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-27io_uring/rw: don't mask in f_iocb_flagsJens Axboe
A previous commit changed overwriting kiocb->ki_flags with ->f_iocb_flags with masking it in. This breaks for retry situations, where we don't necessarily want to retain previously set flags, like IOCB_NOWAIT. The use case needs IOCB_HAS_METADATA to be persistent, but the change makes all flags persistent, which is an issue. Add a request flag to track whether the request has metadata or not, as that is persistent across issues. Fixes: 59a7d12a7fb5 ("io_uring: introduce attributes for read/write and PI support") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-27io_uring/rw: Allocate async data through helperGabriel Krisman Bertazi
This abstract away the cache details. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20241216204615.759089-8-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring: clean up io_prep_rw_setup()David Wei
Remove unnecessary call to iov_iter_save_state() in io_prep_rw_setup() as io_import_iovec() already does this. Then the result from io_import_iovec() can be returned directly. Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Tested-by: Li Zetao <lizetao1@huawei.com> Link: https://lore.kernel.org/r/20241207004144.783631-1-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring: introduce attributes for read/write and PI supportAnuj Gupta
Add the ability to pass additional attributes along with read/write. Application can prepare attibute specific information and pass its address using the SQE field: __u64 attr_ptr; Along with setting a mask indicating attributes being passed: __u64 attr_type_mask; Overall 64 attributes are allowed and currently one attribute 'IORING_RW_ATTR_FLAG_PI' is supported. With PI attribute, userspace can pass following information: - flags: integrity check flags IO_INTEGRITY_CHK_{GUARD/APPTAG/REFTAG} - len: length of PI/metadata buffer - addr: address of metadata buffer - seed: seed value for reftag remapping - app_tag: application defined 16b value Process this information to prepare uio_meta_descriptor and pass it down using kiocb->private. PI attribute is supported only for direct IO. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20241128112240.8867-7-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-19Merge tag 'timers-core-2024-11-18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "A rather large update for timekeeping and timers: - The final step to get rid of auto-rearming posix-timers posix-timers are currently auto-rearmed by the kernel when the signal of the timer is ignored so that the timer signal can be delivered once the corresponding signal is unignored. This requires to throttle the timer to prevent a DoS by small intervals and keeps the system pointlessly out of low power states for no value. This is a long standing non-trivial problem due to the lock order of posix-timer lock and the sighand lock along with life time issues as the timer and the sigqueue have different life time rules. Cure this by: - Embedding the sigqueue into the timer struct to have the same life time rules. Aside of that this also avoids the lookup of the timer in the signal delivery and rearm path as it's just a always valid container_of() now. - Queuing ignored timer signals onto a seperate ignored list. - Moving queued timer signals onto the ignored list when the signal is switched to SIG_IGN before it could be delivered. - Walking the ignored list when SIG_IGN is lifted and requeue the signals to the actual signal lists. This allows the signal delivery code to rearm the timer. This also required to consolidate the signal delivery rules so they are consistent across all situations. With that all self test scenarios finally succeed. - Core infrastructure for VFS multigrain timestamping This is required to allow the kernel to use coarse grained time stamps by default and switch to fine grained time stamps when inode attributes are actively observed via getattr(). These changes have been provided to the VFS tree as well, so that the VFS specific infrastructure could be built on top. - Cleanup and consolidation of the sleep() infrastructure - Move all sleep and timeout functions into one file - Rework udelay() and ndelay() into proper documented inline functions and replace the hardcoded magic numbers by proper defines. - Rework the fsleep() implementation to take the reality of the timer wheel granularity on different HZ values into account. Right now the boundaries are hard coded time ranges which fail to provide the requested accuracy on different HZ settings. - Update documentation for all sleep/timeout related functions and fix up stale documentation links all over the place - Fixup a few usage sites - Rework of timekeeping and adjtimex(2) to prepare for multiple PTP clocks A system can have multiple PTP clocks which are participating in seperate and independent PTP clock domains. So far the kernel only considers the PTP clock which is based on CLOCK TAI relevant as that's the clock which drives the timekeeping adjustments via the various user space daemons through adjtimex(2). The non TAI based clock domains are accessible via the file descriptor based posix clocks, but their usability is very limited. They can't be accessed fast as they always go all the way out to the hardware and they cannot be utilized in the kernel itself. As Time Sensitive Networking (TSN) gains traction it is required to provide fast user and kernel space access to these clocks. The approach taken is to utilize the timekeeping and adjtimex(2) infrastructure to provide this access in a similar way how the kernel provides access to clock MONOTONIC, REALTIME etc. Instead of creating a duplicated infrastructure this rework converts timekeeping and adjtimex(2) into generic functionality which operates on pointers to data structures instead of using static variables. This allows to provide time accessors and adjtimex(2) functionality for the independent PTP clocks in a subsequent step. - Consolidate hrtimer initialization hrtimers are set up by initializing the data structure and then seperately setting the callback function for historical reasons. That's an extra unnecessary step and makes Rust support less straight forward than it should be. Provide a new set of hrtimer_setup*() functions and convert the core code and a few usage sites of the less frequently used interfaces over. The bulk of the htimer_init() to hrtimer_setup() conversion is already prepared and scheduled for the next merge window. - Drivers: - Ensure that the global timekeeping clocksource is utilizing the cluster 0 timer on MIPS multi-cluster systems. Otherwise CPUs on different clusters use their cluster specific clocksource which is not guaranteed to be synchronized with other clusters. - Mostly boring cleanups, fixes, improvements and code movement" * tag 'timers-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (140 commits) posix-timers: Fix spurious warning on double enqueue versus do_exit() clocksource/drivers/arm_arch_timer: Use of_property_present() for non-boolean properties clocksource/drivers/gpx: Remove redundant casts clocksource/drivers/timer-ti-dm: Fix child node refcount handling dt-bindings: timer: actions,owl-timer: convert to YAML clocksource/drivers/ralink: Add Ralink System Tick Counter driver clocksource/drivers/mips-gic-timer: Always use cluster 0 counter as clocksource clocksource/drivers/timer-ti-dm: Don't fail probe if int not found clocksource/drivers:sp804: Make user selectable clocksource/drivers/dw_apb: Remove unused dw_apb_clockevent functions hrtimers: Delete hrtimer_init_on_stack() alarmtimer: Switch to use hrtimer_setup() and hrtimer_setup_on_stack() io_uring: Switch to use hrtimer_setup_on_stack() sched/idle: Switch to use hrtimer_setup_on_stack() hrtimers: Delete hrtimer_init_sleeper_on_stack() wait: Switch to use hrtimer_setup_sleeper_on_stack() timers: Switch to use hrtimer_setup_sleeper_on_stack() net: pktgen: Switch to use hrtimer_setup_sleeper_on_stack() futex: Switch to use hrtimer_setup_sleeper_on_stack() fs/aio: Switch to use hrtimer_setup_sleeper_on_stack() ...
2024-11-18Merge tag 'for-6.13/io_uring-20241118' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring updates from Jens Axboe: - Cleanups of the eventfd handling code, making it fully private. - Support for sending a sync message to another ring, without having a ring available to send a normal async message. - Get rid of the separate unlocked hash table, unify everything around the single locked one. - Add support for ring resizing. It can be hard to appropriately size the CQ ring upfront, if the application doesn't know how busy it will be. This results in applications sizing rings for the most busy case, which can be wasteful. With ring resizing, they can start small and grow the ring, if needed. - Add support for fixed wait regions, rather than needing to copy the same wait data tons of times for each wait operation. - Rewrite the resource node handling, which before was serialized per ring. This caused issues with particularly fixed files, where one file waiting on IO could hold up putting and freeing of other unrelated files. Now each node is handled separately. New code is much simpler too, and was a net 250 line reduction in code. - Add support for just doing partial buffer clones, rather than always cloning the entire buffer table. - Series adding static NAPI support, where a specific NAPI instance is used rather than having a list of them available that need lookup. - Add support for mapped regions, and also convert the fixed wait support mentioned above to that concept. This avoids doing special mappings for various planned features, and folds the existing registered wait into that too. - Add support for hybrid IO polling, which is a variant of strict IOPOLL but with an initial sleep delay to avoid spinning too early and wasting resources on devices that aren't necessarily in the < 5 usec category wrt latencies. - Various cleanups and little fixes. * tag 'for-6.13/io_uring-20241118' of git://git.kernel.dk/linux: (79 commits) io_uring/region: fix error codes after failed vmap io_uring: restore back registered wait arguments io_uring: add memory region registration io_uring: introduce concept of memory regions io_uring: temporarily disable registered waits io_uring: disable ENTER_EXT_ARG_REG for IOPOLL io_uring: fortify io_pin_pages with a warning switch io_msg_ring() to CLASS(fd) io_uring: fix invalid hybrid polling ctx leaks io_uring/uring_cmd: fix buffer index retrieval io_uring/rsrc: add & apply io_req_assign_buf_node() io_uring/rsrc: remove '->ctx_ptr' of 'struct io_rsrc_node' io_uring/rsrc: pass 'struct io_ring_ctx' reference to rsrc helpers io_uring: avoid normal tw intermediate fallback io_uring/napi: add static napi tracking strategy io_uring/napi: clean up __io_napi_do_busy_loop io_uring/napi: Use lock guards io_uring/napi: improve __io_napi_add io_uring/napi: fix io_napi_entry RCU accesses io_uring/napi: protect concurrent io_napi_entry timeout accesses ...
2024-11-18Merge tag 'for-6.13/block-20241118' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - NVMe updates via Keith: - Use uring_cmd helper (Pavel) - Host Memory Buffer allocation enhancements (Christoph) - Target persistent reservation support (Guixin) - Persistent reservation tracing (Guixen) - NVMe 2.1 specification support (Keith) - Rotational Meta Support (Matias, Wang, Keith) - Volatile cache detection enhancment (Guixen) - MD updates via Song: - Maintainers update - raid5 sync IO fix - Enhance handling of faulty and blocked devices - raid5-ppl atomic improvement - md-bitmap fix - Support for manually defining embedded partition tables - Zone append fixes and cleanups - Stop sending the queued requests in the plug list to the driver ->queue_rqs() handle in reverse order. - Zoned write plug cleanups - Cleanups disk stats tracking and add support for disk stats for passthrough IO - Add preparatory support for file system atomic writes - Add lockdep support for queue freezing. Already found a bunch of issues, and some fixes for that are in here. More will be coming. - Fix race between queue stopping/quiescing and IO queueing - ublk recovery improvements - Fix ublk mmap for 64k pages - Various fixes and cleanups * tag 'for-6.13/block-20241118' of git://git.kernel.dk/linux: (118 commits) MAINTAINERS: Update git tree for mdraid subsystem block: make struct rq_list available for !CONFIG_BLOCK block/genhd: use seq_put_decimal_ull for diskstats decimal values block: don't reorder requests in blk_mq_add_to_batch block: don't reorder requests in blk_add_rq_to_plug block: add a rq_list type block: remove rq_list_move virtio_blk: reverse request order in virtio_queue_rqs nvme-pci: reverse request order in nvme_queue_rqs btrfs: validate queue limits block: export blk_validate_limits nvmet: add tracing of reservation commands nvme: parse reservation commands's action and rtype to string nvmet: report ns's vwc not present md/raid5: Increase r5conf.cache_name size block: remove the ioprio field from struct request block: remove the write_hint field from struct request nvme: check ns's volatile write cache not present nvme: add rotational support nvme: use command set independent id ns if available ...
2024-11-13block: add a rq_list typeChristoph Hellwig
Replace the semi-open coded request list helpers with a proper rq_list type that mirrors the bio_list and has head and tail pointers. Besides better type safety this actually allows to insert at the tail of the list, which will be useful soon. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241113152050.157179-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-07io_uring/rsrc: add & apply io_req_assign_buf_node()Ming Lei
The following pattern becomes more and more: + io_req_assign_rsrc_node(&req->buf_node, node); + req->flags |= REQ_F_BUF_NODE; so make it a helper, which is less fragile to use than above code, for example, the BUF_NODE flag is even missed in current io_uring_cmd_prep(). Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241107110149.890530-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>