summaryrefslogtreecommitdiff
path: root/include/uapi/linux/io_uring.h
AgeCommit message (Collapse)Author
2020-09-30io_uring: provide IORING_ENTER_SQ_WAIT for SQPOLL SQ ring waitsJens Axboe
When using SQPOLL, applications can run into the issue of running out of SQ ring entries because the thread hasn't consumed them yet. The only option for dealing with that is checking later, or busy checking for the condition. Provide IORING_ENTER_SQ_WAIT if applications want to wait on this condition. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30io_uring: allow disabling rings during the creationStefano Garzarella
This patch adds a new IORING_SETUP_R_DISABLED flag to start the rings disabled, allowing the user to register restrictions, buffers, files, before to start processing SQEs. When IORING_SETUP_R_DISABLED is set, SQE are not processed and SQPOLL kthread is not started. The restrictions registration are allowed only when the rings are disable to prevent concurrency issue while processing SQEs. The rings can be enabled using IORING_REGISTER_ENABLE_RINGS opcode with io_uring_register(2). Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30io_uring: add IOURING_REGISTER_RESTRICTIONS opcodeStefano Garzarella
The new io_uring_register(2) IOURING_REGISTER_RESTRICTIONS opcode permanently installs a feature allowlist on an io_ring_ctx. The io_ring_ctx can then be passed to untrusted code with the knowledge that only operations present in the allowlist can be executed. The allowlist approach ensures that new features added to io_uring do not accidentally become available when an existing application is launched on a newer kernel version. Currently is it possible to restrict sqe opcodes, sqe flags, and register opcodes. IOURING_REGISTER_RESTRICTIONS can only be made once. Afterwards it is not possible to change restrictions anymore. This prevents untrusted code from removing restrictions. Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30io_uring: use an enumeration for io_uring_register(2) opcodesStefano Garzarella
The enumeration allows us to keep track of the last io_uring_register(2) opcode available. Behaviour and opcodes names don't change. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24Merge branch 'io_uring-5.8' into for-5.9/io_uringJens Axboe
Merge in io_uring-5.8 fixes, as changes/cleanups to how we do locked mem accounting require a fixup, and only one of the spots are noticed by git as the other merges cleanly. The flags fix from io_uring-5.8 also causes a merge conflict, the leak fix for recvmsg, the double poll fix, and the link failure locking fix. * io_uring-5.8: io_uring: fix lockup in io_fail_links() io_uring: fix ->work corruption with poll_add io_uring: missed req_init_async() for IOSQE_ASYNC io_uring: always allow drain/link/hardlink/async sqe flags io_uring: ensure double poll additions work with both request types io_uring: fix recvmsg memory leak with buffer selection io_uring: fix not initialised work->flags io_uring: fix missing msg_name assignment io_uring: account user memory freed when exit has been queued io_uring: fix memleak in io_sqe_files_register() io_uring: fix memleak in __io_sqe_files_update() io_uring: export cq overflow status to userspace Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-08io_uring: export cq overflow status to userspaceXiaoguang Wang
For those applications which are not willing to use io_uring_enter() to reap and handle cqes, they may completely rely on liburing's io_uring_peek_cqe(), but if cq ring has overflowed, currently because io_uring_peek_cqe() is not aware of this overflow, it won't enter kernel to flush cqes, below test program can reveal this bug: static void test_cq_overflow(struct io_uring *ring) { struct io_uring_cqe *cqe; struct io_uring_sqe *sqe; int issued = 0; int ret = 0; do { sqe = io_uring_get_sqe(ring); if (!sqe) { fprintf(stderr, "get sqe failed\n"); break;; } ret = io_uring_submit(ring); if (ret <= 0) { if (ret != -EBUSY) fprintf(stderr, "sqe submit failed: %d\n", ret); break; } issued++; } while (ret > 0); assert(ret == -EBUSY); printf("issued requests: %d\n", issued); while (issued) { ret = io_uring_peek_cqe(ring, &cqe); if (ret) { if (ret != -EAGAIN) { fprintf(stderr, "peek completion failed: %s\n", strerror(ret)); break; } printf("left requets: %d\n", issued); continue; } io_uring_cqe_seen(ring, cqe); issued--; printf("left requets: %d\n", issued); } } int main(int argc, char *argv[]) { int ret; struct io_uring ring; ret = io_uring_queue_init(16, &ring, 0); if (ret) { fprintf(stderr, "ring setup failed: %d\n", ret); return 1; } test_cq_overflow(&ring); return 0; } To fix this issue, export cq overflow status to userspace by adding new IORING_SQ_CQ_OVERFLOW flag, then helper functions() in liburing, such as io_uring_peek_cqe, can be aware of this cq overflow and do flush accordingly. Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21io_uring: change the poll type to be 32-bitsJiufei Xue
poll events should be 32-bits to cover EPOLLEXCLUSIVE. Explicit word-swap the poll32_events for big endian to make sure the ABI is not changed. We call this feature IORING_FEAT_POLL_32BITS, applications who want to use EPOLLEXCLUSIVE should check the feature bit first. Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-17io_uring: add tee(2) supportPavel Begunkov
Add IORING_OP_TEE implementing tee(2) support. Almost identical to splice bits, but without offsets. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-15io_uring: add IORING_CQ_EVENTFD_DISABLED to the CQ ring flagsStefano Garzarella
This new flag should be set/clear from the application to disable/enable eventfd notifications when a request is completed and queued to the CQ ring. Before this patch, notifications were always sent if an eventfd is registered, so IORING_CQ_EVENTFD_DISABLED is not set during the initialization. It will be up to the application to set the flag after initialization if no notifications are required at the beginning. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-15io_uring: add 'cq_flags' field for the CQ ringStefano Garzarella
This patch adds the new 'cq_flags' field that should be written by the application and read by the kernel. This new field is available to the userspace application through 'cq_off.flags'. We are using 4-bytes previously reserved and set to zero. This means that if the application finds this field to zero, then the new functionality is not supported. In the next patch we will introduce the first flag available. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-21io_uring: make spdxcheck.py happyLukas Bulwahn
Commit bbbdeb4720a0 ("io_uring: dual license io_uring.h uapi header") uses a nested SPDX-License-Identifier to dual license the header. Since then, ./scripts/spdxcheck.py complains: include/uapi/linux/io_uring.h: 1:60 Missing parentheses: OR Add parentheses to make spdxcheck.py happy. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-11io_uring: dual license io_uring.h uapi headerJens Axboe
This just syncs the header it with the liburing version, so there's no confusion on the license of the header parts. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-10io_uring: provide means of removing buffersJens Axboe
We have IORING_OP_PROVIDE_BUFFERS, but the only way to remove buffers is to trigger IO on them. The usual case of shrinking a buffer pool would be to just not replenish the buffers when IO completes, and instead just free it. But it may be nice to have a way to manually remove a number of buffers from a given group, and IORING_OP_REMOVE_BUFFERS provides that functionality. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-10io_uring: support buffer selection for OP_READ and OP_RECVJens Axboe
If a server process has tons of pending socket connections, generally it uses epoll to wait for activity. When the socket is ready for reading (or writing), the task can select a buffer and issue a recv/send on the given fd. Now that we have fast (non-async thread) support, a task can have tons of pending reads or writes pending. But that means they need buffers to back that data, and if the number of connections is high enough, having them preallocated for all possible connections is unfeasible. With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to use for any request. The request then sets IOSQE_BUFFER_SELECT in the sqe, and a given group ID in sqe->buf_group. When the fd becomes ready, a free buffer from the specified group is selected. If none are available, the request is terminated with -ENOBUFS. If successful, the CQE on completion will contain the buffer ID chosen in the cqe->flags member, encoded as: (buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER; Once a buffer has been consumed by a request, it is no longer available and must be registered again with IORING_OP_PROVIDE_BUFFERS. Requests need to support this feature. For now, IORING_OP_READ and IORING_OP_RECV support it. This is checked on SQE submission, a CQE with res == -EOPNOTSUPP will be posted if attempted on unsupported requests. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-10io_uring: add IORING_OP_PROVIDE_BUFFERSJens Axboe
IORING_OP_PROVIDE_BUFFERS uses the buffer registration infrastructure to support passing in an addr/len that is associated with a buffer ID and buffer group ID. The group ID is used to index and lookup the buffers, while the buffer ID can be used to notify the application which buffer in the group was used. The addr passed in is the starting buffer address, and length is each buffer length. A number of buffers to add with can be specified, in which case addr is incremented by length for each addition, and each buffer increments the buffer ID specified. No validation is done of the buffer ID. If the application provides buffers within the same group with identical buffer IDs, then it'll have a hard time telling which buffer ID was used. The only restriction is that the buffer ID can be a max of 16-bits in size, so USHRT_MAX is the maximum ID that can be used. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02io_uring: use poll driven retry for files that support itJens Axboe
Currently io_uring tries any request in a non-blocking manner, if it can, and then retries from a worker thread if we get -EAGAIN. Now that we have a new and fancy poll based retry backend, use that to retry requests if the file supports it. This means that, for example, an IORING_OP_RECVMSG on a socket no longer requires an async thread to complete the IO. If we get -EAGAIN reading from the socket in a non-blocking manner, we arm a poll handler for notification on when the socket becomes readable. When it does, the pending read is executed directly by the task again, through the io_uring task work handlers. Not only is this faster and more efficient, it also means we're not generating potentially tons of async threads that just sit and block, waiting for the IO to complete. The feature is marked with IORING_FEAT_FAST_POLL, meaning that async pollable IO is fast, and that poll<link>other_op is fast as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02io_uring: add splice(2) supportPavel Begunkov
Add support for splice(2). - output file is specified as sqe->fd, so it's handled by generic code - hash_reg_file handled by generic code as well - len is 32bit, but should be fine - the fd_in is registered file, when SPLICE_F_FD_IN_FIXED is set, which is a splice flag (i.e. sqe->splice_flags). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-29io_uring: add support for epoll_ctl(2)Jens Axboe
This adds IORING_OP_EPOLL_CTL, which can perform the same work as the epoll_ctl(2) system call. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-28io_uring: support using a registered personality for commandsJens Axboe
For personalities previously registered via IORING_REGISTER_PERSONALITY, allow any command to select them. This is done through setting sqe->personality to the id returned from registration, and then flagging sqe->flags with IOSQE_PERSONALITY. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-28io_uring: allow registering credentialsJens Axboe
If an application wants to use a ring with different kinds of credentials, it can register them upfront. We don't lookup credentials, the credentials of the task calling IORING_REGISTER_PERSONALITY is used. An 'id' is returned for the application to use in subsequent personality support. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-28io_uring: add io-wq workqueue sharingPavel Begunkov
If IORING_SETUP_ATTACH_WQ is set, it expects wq_fd in io_uring_params to be a valid io_uring fd io-wq of which will be shared with the newly created io_uring instance. If the flag is set but it can't share io-wq, it fails. This allows creation of "sibling" io_urings, where we prefer to keep the SQ/CQ private, but want to share the async backend to minimize the amount of overhead associated with having multiple rings that belong to the same backend. Reported-by: Jens Axboe <axboe@kernel.dk> Reported-by: Daurnimator <quae@daurnimator.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-28io_uring/io-wq: don't use static creds/mm assignmentsJens Axboe
We currently setup the io_wq with a static set of mm and creds. Even for a single-use io-wq per io_uring, this is suboptimal as we have may have multiple enters of the ring. For sharing the io-wq backend, it doesn't work at all. Switch to passing in the creds and mm when the work item is setup. This means that async work is no longer deferred to the io_uring mm and creds, it is done with the current mm and creds. Flag this behavior with IORING_FEAT_CUR_PERSONALITY, so applications know they can rely on the current personality (mm and creds) being the same for direct issue and async issue. Reviewed-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20io_uring: optimise sqe-to-req flags translationPavel Begunkov
For each IOSQE_* flag there is a corresponding REQ_F_* flag. And there is a repetitive pattern of their translation: e.g. if (sqe->flags & SQE_FLAG*) req->flags |= REQ_F_FLAG* Use same numeric values/bits for them and copy instead of manual handling. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20io_uring: add support for probing opcodesJens Axboe
The application currently has no way of knowing if a given opcode is supported or not without having to try and issue one and see if we get -EINVAL or not. And even this approach is fraught with peril, as maybe we're getting -EINVAL due to some fields being missing, or maybe it's just not that easy to issue that particular command without doing some other leg work in terms of setup first. This adds IORING_REGISTER_PROBE, which fills in a structure with info on what it supported or not. This will work even with sparse opcode fields, which may happen in the future or even today if someone backports specific features to older kernels. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20io_uring: add support for IORING_OP_OPENAT2Jens Axboe
Add support for the new openat2(2) system call. It's trivial to do, as we can have openat(2) just be wrapped around it. Suggested-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20io_uring: enable option to only trigger eventfd for async completionsJens Axboe
If an application is using eventfd notifications with poll to know when new SQEs can be issued, it's expecting the following read/writes to complete inline. And with that, it knows that there are events available, and don't want spurious wakeups on the eventfd for those requests. This adds IORING_REGISTER_EVENTFD_ASYNC, which works just like IORING_REGISTER_EVENTFD, except it only triggers notifications for events that happen from async completions (IRQ, or io-wq worker completions). Any completions inline from the submission itself will not trigger notifications. Suggested-by: Mark Papadakis <markuspapadakis@icloud.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20io_uring: add support for send(2) and recv(2)Jens Axboe
This adds IORING_OP_SEND for send(2) support, and IORING_OP_RECV for recv(2) support. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20io_uring: add support for IORING_SETUP_CLAMPJens Axboe
Some applications like to start small in terms of ring size, and then ramp up as needed. This is a bit tricky to do currently, since we don't advertise the max ring size. This adds IORING_SETUP_CLAMP. If set, and the values for SQ or CQ ring size exceed what we support, then clamp them at the max values instead of returning -EINVAL. Since we return the chosen ring sizes after setup, no further changes are needed on the application side. io_uring already changes the ring sizes if the application doesn't ask for power-of-two sizes, for example. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20io_uring: add IORING_OP_MADVISEJens Axboe
This adds support for doing madvise(2) through io_uring. We assume that any operation can block, and hence punt everything async. This could be improved, but hard to make bullet proof. The async punt ensures it's safe. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20io_uring: add IORING_OP_FADVISEJens Axboe
This adds support for doing fadvise through io_uring. We assume that WILLNEED doesn't block, but that DONTNEED may block. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20io_uring: allow use of offset == -1 to mean file positionJens Axboe
This behaves like preadv2/pwritev2 with offset == -1, it'll use (and update) the current file position. This obviously comes with the caveat that if the application has multiple read/writes in flight, then the end result will not be as expected. This is similar to threads sharing a file descriptor and doing IO using the current file position. Since this feature isn't easily detectable by doing a read or write, add a feature flags, IORING_FEAT_RW_CUR_POS, to allow applications to detect presence of this feature. Reported-by: 李通洲 <carter.li@eoitek.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20io_uring: add non-vectored read/write commandsJens Axboe
For uses cases that don't already naturally have an iovec, it's easier (or more convenient) to just use a buffer address + length. This is particular true if the use case is from languages that want to create a memory safe abstraction on top of io_uring, and where introducing the need for the iovec may impose an ownership issue. For those cases, they currently need an indirection buffer, which means allocating data just for this purpose. Add basic read/write that don't require the iovec. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20io_uring: add IOSQE_ASYNCJens Axboe
io_uring defaults to always doing inline submissions, if at all possible. But for larger copies, even if the data is fully cached, that can take a long time. Add an IOSQE_ASYNC flag that the application can set on the SQE - if set, it'll ensure that we always go async for those kinds of requests. Use the io-wq IO_WQ_WORK_CONCURRENT flag to ensure we get the concurrency we desire for this case. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20io_uring: add support for IORING_OP_STATXJens Axboe
This provides support for async statx(2) through io_uring. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20io_uring: avoid ring quiesce for fixed file set unregister and updateJens Axboe
We currently fully quiesce the ring before an unregister or update of the fixed fileset. This is very expensive, and we can be a bit smarter about this. Add a percpu refcount for the file tables as a whole. Grab a percpu ref when we use a registered file, and put it on completion. This is cheap to do. Upon removal of a file from a set, switch the ref count to atomic mode. When we hit zero ref on the completion side, then we know we can drop the previously registered files. When the old files have been dropped, switch the ref back to percpu mode for normal operation. Since there's a period between doing the update and the kernel being done with it, add a IORING_OP_FILES_UPDATE opcode that can perform the same action. The application knows the update has completed when it gets the CQE for it. Between doing the update and receiving this completion, the application must continue to use the unregistered fd if submitting IO on this particular file. This takes the runtime of test/file-register from liburing from 14s to about 0.7s. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20io_uring: add support for IORING_OP_CLOSEJens Axboe
This works just like close(2), unsurprisingly. We remove the file descriptor and post the completion inline, then offload the actual (potential) last file put to async context. Mark the async part of this work as uncancellable, as we really must guarantee that the latter part of the close is run. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20io_uring: add support for IORING_OP_OPENATJens Axboe
This works just like openat(2), except it can be performed async. For the normal case of a non-blocking path lookup this will complete inline. If we have to do IO to perform the open, it'll be done from async context. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20io_uring: add support for fallocate()Jens Axboe
This exposes fallocate(2) through io_uring. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20io_uring: fix compat for IORING_REGISTER_FILES_UPDATEEugene Syromiatnikov
fds field of struct io_uring_files_update is problematic with regards to compat user space, as pointer size is different in 32-bit, 32-on-64-bit, and 64-bit user space. In order to avoid custom handling of compat in the syscall implementation, make fds __u64 and use u64_to_user_ptr in order to retrieve it. Also, align the field naturally and check that no garbage is passed there. Fixes: c3a31e605620c279 ("io_uring: add support for IORING_REGISTER_FILES_UPDATE") Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-11io_uring: ensure we return -EINVAL on unknown opcodeJens Axboe
If we submit an unknown opcode and have fd == -1, io_op_needs_file() will return true as we default to needing a file. Then when we go and assign the file, we find the 'fd' invalid and return -EBADF. We really should be returning -EINVAL for that case, as we normally do for unsupported opcodes. Change io_op_needs_file() to have the following return values: 0 - does not need a file 1 - does need a file < 0 - error value and use this to pass back the right value for this invalid case. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-10io_uring: allow unbreakable linksJens Axboe
Some commands will invariably end in a failure in the sense that the completion result will be less than zero. One such example is timeouts that don't have a completion count set, they will always complete with -ETIME unless cancelled. For linked commands, we sever links and fail the rest of the chain if the result is less than zero. Since we have commands where we know that will happen, add IOSQE_IO_HARDLINK as a stronger link that doesn't sever regardless of the completion result. Note that the link will still sever if we fail submitting the parent request, hard links are only resilient in the presence of completion results for requests that did submit correctly. Cc: stable@vger.kernel.org # v5.4 Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reported-by: 李通洲 <carter.li@eoitek.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-03io_uring: mark us with IORING_FEAT_SUBMIT_STABLEJens Axboe
If this flag is set, applications can be certain that any data for async offload has been consumed when the kernel has consumed the SQE. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25io_uring: add support for IORING_OP_CONNECTJens Axboe
This allows an application to call connect() in an async fashion. Like other opcodes, we first try a non-blocking connect, then punt to async context if we have to. Note that we can still return -EINPROGRESS, and in that case the caller should use IORING_OP_POLL_ADD to do an async wait for completion of the connect request (just like for regular connect(2), except we can do it async here too). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-09io_uring: add support for backlogged CQ ringJens Axboe
Currently we drop completion events, if the CQ ring is full. That's fine for requests with bounded completion times, but it may make it harder or impossible to use io_uring with networked IO where request completion times are generally unbounded. Or with POLL, for example, which is also unbounded. After this patch, we never overflow the ring, we simply store requests in a backlog for later flushing. This flushing is done automatically by the kernel. To prevent the backlog from growing indefinitely, if the backlog is non-empty, we apply back pressure on IO submissions. Any attempt to submit new IO with a non-empty backlog will get an -EBUSY return from the kernel. This is a signal to the application that it has backlogged CQ events, and that it must reap those before being allowed to submit more IO. Note that if we do return -EBUSY, we will have filled whatever backlogged events into the CQ ring first, if there's room. This means the application can safely reap events WITHOUT entering the kernel and waiting for them, they are already available in the CQ ring. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07io_uring: add support for linked SQE timeoutsJens Axboe
While we have support for generic timeouts, we don't have a way to tie a timeout to a specific SQE. The generic timeouts simply trigger wakeups on the CQ ring. This adds support for IORING_OP_LINK_TIMEOUT. This command is only valid as a link to a previous command. The timeout specific can be either relative or absolute, following the same rules as IORING_OP_TIMEOUT. If the timeout triggers before the dependent command completes, it will attempt to cancel that command. Likewise, if the dependent command completes before the timeout triggers, it will cancel the timeout. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-01io_uring: support for generic async request cancelJens Axboe
This adds support for IORING_OP_ASYNC_CANCEL, which will attempt to cancel requests that have been punted to async context and are now in-flight. This works for regular read/write requests to files, as long as they haven't been started yet. For socket based IO (or things like accept4(2)), we can cancel work that is already running as well. To cancel a request, the sqe must have ->addr set to the user_data of the request it wishes to cancel. If the request is cancelled successfully, the original request is completed with -ECANCELED and the cancel request is completed with a result of 0. If the request was already running, the original may or may not complete in error. The cancel request will complete with -EALREADY for that case. And finally, if the request to cancel wasn't found, the cancel request is completed with -ENOENT. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29io_uring: add support for IORING_OP_ACCEPTJens Axboe
This allows an application to call accept4() in an async fashion. Like other opcodes, we first try a non-blocking accept, then punt to async context if we have to. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29io_uring: add support for canceling timeout requestsJens Axboe
We might have cases where the need for a specific timeout is gone, add support for canceling an existing timeout operation. This works like the POLL_REMOVE command, where the application passes in the user_data of the timeout it wishes to cancel in the sqe->addr field. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29io_uring: add support for absolute timeoutsJens Axboe
This is a pretty trivial addition on top of the relative timeouts we have now, but it's handy for ensuring tighter timing for those that are building scheduling primitives on top of io_uring. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29io_uring: allow application controlled CQ ring sizeJens Axboe
We currently size the CQ ring as twice the SQ ring, to allow some flexibility in not overflowing the CQ ring. This is done because the SQE life time is different than that of the IO request itself, the SQE is consumed as soon as the kernel has seen the entry. Certain application don't need a huge SQ ring size, since they just submit IO in batches. But they may have a lot of requests pending, and hence need a big CQ ring to hold them all. By allowing the application to control the CQ ring size multiplier, we can cater to those applications more efficiently. If an application wants to define its own CQ ring size, it must set IORING_SETUP_CQSIZE in the setup flags, and fill out io_uring_params->cq_entries. The value must be a power of two. Signed-off-by: Jens Axboe <axboe@kernel.dk>