summaryrefslogtreecommitdiff
path: root/fs/io_uring.c
AgeCommit message (Collapse)Author
2019-10-30io_uring: ensure we clear io_kiocb->result before each issueJens Axboe
We use io_kiocb->result == -EAGAIN as a way to know if we need to re-submit a polled request, as -EAGAIN reporting happens out-of-line for IO submission failures. This field is cleared when we originally allocate the request, but it isn't reset when we retry the submission from async context. This can cause issues where we think something needs a re-issue, but we're really just reading stale data. Reset ->result whenever we re-prep a request for polled submission. Cc: stable@vger.kernel.org Fixes: 9e645e1105ca ("io_uring: add support for sqe links") Reported-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-28io_uring: don't touch ctx in setup after ring fd installJens Axboe
syzkaller reported an issue where it looks like a malicious app can trigger a use-after-free of reading the ctx ->sq_array and ->rings value right after having installed the ring fd in the process file table. Defer ring fd installation until after we're done reading those values. Fixes: 75b28affdd6a ("io_uring: allocate the two rings together") Reported-by: syzbot+6f03d895a6cd0d06187f@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-27io_uring: Fix leaked shadow_reqPavel Begunkov
io_queue_link_head() owns shadow_req after taking it as an argument. By not freeing it in case of an error, it can leak the request along with taken ctx->refs. Reviewed-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-25io_uring: fix bad inflight accounting for SETUP_IOPOLL|SETUP_SQTHREADJens Axboe
We currently assume that submissions from the sqthread are successful, and if IO polling is enabled, we use that value for knowing how many completions to look for. But if we overflowed the CQ ring or some requests simply got errored and already completed, they won't be available for polling. For the case of IO polling and SQTHREAD usage, look at the pending poll list. If it ever hits empty then we know that we don't have anymore pollable requests inflight. For that case, simply reset the inflight count to zero. Reported-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-25io_uring: used cached copies of sq->dropped and cq->overflowJens Axboe
We currently use the ring values directly, but that can lead to issues if the application is malicious and changes these values on our behalf. Created in-kernel cached versions of them, and just overwrite the user side when we update them. This is similar to how we treat the sq/cq ring tail/head updates. Reported-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-25io_uring: Fix race for sqes with userspacePavel Begunkov
io_ring_submit() finalises with 1. io_commit_sqring(), which releases sqes to the userspace 2. Then calls to io_queue_link_head(), accessing released head's sqe Reorder them. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-25io_uring: Fix broken links with offloadingPavel Begunkov
io_sq_thread() processes sqes by 8 without considering links. As a result, links will be randomely subdivided. The easiest way to fix it is to call io_get_sqring() inside io_submit_sqes() as do io_ring_submit(). Downsides: 1. This removes optimisation of not grabbing mm_struct for fixed files 2. It submitting all sqes in one go, without finer-grained sheduling with cq processing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-25io_uring: Fix corrupted user_dataPavel Begunkov
There is a bug, where failed linked requests are returned not with specified @user_data, but with garbage from a kernel stack. The reason is that io_fail_links() uses req->user_data, which is uninitialised when called from io_queue_sqe() on fail path. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-23io_uring: correct timeout req sequence when inserting a new entryzhangyi (F)
The sequence number of the timeout req (req->sequence) indicate the expected completion request. Because of each timeout req consume a sequence number, so the sequence of each timeout req on the timeout list shouldn't be the same. But now, we may get the same number (also incorrect) if we insert a new entry before the last one, such as submit such two timeout reqs on a new ring instance below. req->sequence req_1 (count = 2): 2 req_2 (count = 1): 2 Then, if we submit a nop req, req_2 will still timeout even the nop req finished. This patch fix this problem by adjust the sequence number of each reordered reqs when inserting a new entry. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-23io_uring : correct timeout req sequence when waiting timeoutzhangyi (F)
The sequence number of reqs on the timeout_list before the timeout req should be adjusted in io_timeout_fn(), because the current timeout req will consumes a slot in the cq_ring and cq_tail pointer will be increased, otherwise other timeout reqs may return in advance without waiting for enough wait_nr. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-23io_uring: revert "io_uring: optimize submit_and_wait API"Jens Axboe
There are cases where it isn't always safe to block for submission, even if the caller asked to wait for events as well. Revert the previous optimization of doing that. This reverts two commits: bf7ec93c644cb c576666863b78 Fixes: c576666863b78 ("io_uring: optimize submit_and_wait API") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-18Merge tag 'for-linus-2019-10-18' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - NVMe pull request from Keith that address deadlocks, double resets, memory leaks, and other regression. - Fixup elv_support_iosched() for bio based devices (Damien) - Fixup for the ahci PCS quirk (Dan) - Socket O_NONBLOCK handling fix for io_uring (me) - Timeout sequence io_uring fixes (yangerkun) - MD warning fix for parameter default_layout (Song) - blkcg activation fixes (Tejun) - blk-rq-qos node deletion fix (Tejun) * tag 'for-linus-2019-10-18' of git://git.kernel.dk/linux-block: nvme-pci: Set the prp2 correctly when using more than 4k page io_uring: fix logic error in io_timeout io_uring: fix up O_NONBLOCK handling for sockets md/raid0: fix warning message for parameter default_layout libata/ahci: Fix PCS quirk application blk-rq-qos: fix first node deletion of rq_qos_del() blkcg: Fix multiple bugs in blkcg_activate_policy() io_uring: consider the overflow of sequence for timeout req nvme-tcp: fix possible leakage during error flow nvmet-loop: fix possible leakage during error flow block: Fix elv_support_iosched() nvme-tcp: Initialize sk->sk_ll_usec only with NET_RX_BUSY_POLL nvme: Wait for reset state when required nvme: Prevent resets during paused controller state nvme: Restart request timers in resetting state nvme: Remove ADMIN_ONLY state nvme-pci: Free tagset if no IO queues nvme: retain split access workaround for capability reads nvme: fix possible deadlock when nvme_update_formats fails
2019-10-17io_uring: fix logic error in io_timeoutyangerkun
If ctx->cached_sq_head < nxt_sq_head, we should add UINT_MAX to tmp, not tmp_nxt. Fixes: 5da0fb1ab34c ("io_uring: consider the overflow of sequence for timeout req") Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-17io_uring: fix up O_NONBLOCK handling for socketsJens Axboe
We've got two issues with the non-regular file handling for non-blocking IO: 1) We don't want to re-do a short read in full for a non-regular file, as we can't just read the data again. 2) For non-regular files that don't support non-blocking IO attempts, we need to punt to async context even if the file is opened as non-blocking. Otherwise the caller always gets -EAGAIN. Add two new request flags to handle these cases. One is just a cache of the inode S_ISREG() status, the other tells io_uring that we always need to punt this request to async context, even if REQ_F_NOWAIT is set. Cc: stable@vger.kernel.org Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com> Tested-by: Hrvoje Zeba <zeba.hrvoje@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-15io_uring: consider the overflow of sequence for timeout reqyangerkun
Now we recalculate the sequence of timeout with 'req->sequence = ctx->cached_sq_head + count - 1', judge the right place to insert for timeout_list by compare the number of request we still expected for completion. But we have not consider about the situation of overflow: 1. ctx->cached_sq_head + count - 1 may overflow. And a bigger count for the new timeout req can have a small req->sequence. 2. cached_sq_head of now may overflow compare with before req. And it will lead the timeout req with small req->sequence. This overflow will lead to the misorder of timeout_list, which can lead to the wrong order of the completion of timeout_list. Fix it by reuse req->submit.sequence to store the count, and change the logic of inserting sort in io_timeout. Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-13Merge tag 'for-linus-20191012' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fix from Jens Axboe: "Single small fix for a regression in the sequence logic for linked commands" * tag 'for-linus-20191012' of git://git.kernel.dk/linux-block: io_uring: fix sequence logic for timeout requests
2019-10-11Merge tag 'for-linus-20191010' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - Fix wbt performance regression introduced with the blk-rq-qos refactoring (Harshad) - Fix io_uring fileset removal inadvertently killing the workqueue (me) - Fix io_uring typo in linked command nonblock submission (Pavel) - Remove spurious io_uring wakeups on request free (Pavel) - Fix null_blk zoned command error return (Keith) - Don't use freezable workqueues for backing_dev, also means we can revert a previous libata hack (Mika) - Fix nbd sysfs mutex dropped too soon at removal time (Xiubo) * tag 'for-linus-20191010' of git://git.kernel.dk/linux-block: nbd: fix possible sysfs duplicate warning null_blk: Fix zoned command return code io_uring: only flush workqueues on fileset removal io_uring: remove wait loop spurious wakeups blk-wbt: fix performance regression in wbt scale_up/scale_down Revert "libata, freezer: avoid block device removal while system is frozen" bdi: Do not use freezable workqueue io_uring: fix reversed nonblock flag for link submission
2019-10-10io_uring: fix sequence logic for timeout requestsJens Axboe
We have two ways a request can be deferred: 1) It's a regular request that depends on another one 2) It's a timeout that tracks completions We have a shared helper to determine whether to defer, and that attempts to make the right decision based on the request. But we only have some of this information in the caller. Un-share the two timeout/defer helpers so the caller can use the right one. Fixes: 5262f567987d ("io_uring: IORING_OP_TIMEOUT support") Reported-by: yangerkun <yangerkun@huawei.com> Reviewed-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-09io_uring: only flush workqueues on fileset removalJens Axboe
We should not remove the workqueue, we just need to ensure that the workqueues are synced. The workqueues are torn down on ctx removal. Cc: stable@vger.kernel.org Fixes: 6b06314c47e1 ("io_uring: add file set registration") Reported-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-07io_uring: remove wait loop spurious wakeupsPavel Begunkov
Any changes interesting to tasks waiting in io_cqring_wait() are commited with io_cqring_ev_posted(). However, io_ring_drop_ctx_refs() also tries to do that but with no reason, that means spurious wakeups every io_free_req() and io_uring_enter(). Just use percpu_ref_put() instead. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-04Merge tag 'for-linus-2019-10-03' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - Mandate timespec64 for the io_uring timeout ABI (Arnd) - Set of NVMe changes via Sagi: - controller removal race fix from Balbir - quirk additions from Gabriel and Jian-Hong - nvme-pci power state save fix from Mario - Add 64bit user commands (for 64bit registers) from Marta - nvme-rdma/nvme-tcp fixes from Max, Mark and Me - Minor cleanups and nits from James, Dan and John - Two s390 dasd fixes (Jan, Stefan) - Have loop change block size in DIO mode (Martijn) - paride pg header ifdef guard (Masahiro) - Two blk-mq queue scheduler tweaks, fixing an ordering issue on zoned devices and suboptimal performance on others (Ming) * tag 'for-linus-2019-10-03' of git://git.kernel.dk/linux-block: (22 commits) block: sed-opal: fix sparse warning: convert __be64 data block: sed-opal: fix sparse warning: obsolete array init. block: pg: add header include guard Revert "s390/dasd: Add discard support for ESE volumes" s390/dasd: Fix error handling during online processing io_uring: use __kernel_timespec in timeout ABI loop: change queue block size to match when using DIO blk-mq: apply normal plugging for HDD blk-mq: honor IO scheduler for multiqueue devices nvme-rdma: fix possible use-after-free in connect timeout nvme: Move ctrl sqsize to generic space nvme: Add ctrl attributes for queue_count and sqsize nvme: allow 64-bit results in passthru commands nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T nvmet-tcp: remove superflous check on request sgl Added QUIRKs for ADATA XPG SX8200 Pro 512GB nvme-rdma: Fix max_hw_sectors calculation nvme: fix an error code in nvme_init_subsystem() nvme-pci: Save PCI state before putting drive into deepest state nvme-tcp: fix wrong stop condition in io_work ...
2019-10-04io_uring: fix reversed nonblock flag for link submissionPavel Begunkov
io_queue_link_head() accepts @force_nonblock flag, but io_ring_submit() passes something opposite. Fixes: c576666863b78 ("io_uring: optimize submit_and_wait API") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-01io_uring: use __kernel_timespec in timeout ABIArnd Bergmann
All system calls use struct __kernel_timespec instead of the old struct timespec, but this one was just added with the old-style ABI. Change it now to enforce the use of __kernel_timespec, avoiding ABI confusion and the need for compat handlers on 32-bit architectures. Any user space caller will have to use __kernel_timespec now, but this is unambiguous and works for any C library regardless of the time_t definition. A nicer way to specify the timeout would have been a less ambiguous 64-bit nanosecond value, but I suppose it's too late now to change that as this would impact both 32-bit and 64-bit users. Fixes: 5262f567987d ("io_uring: IORING_OP_TIMEOUT support") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-27Merge tag 'for-5.4/io_uring-2019-09-27' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull more io_uring updates from Jens Axboe: "Just two things in here: - Improvement to the io_uring CQ ring wakeup for batched IO (me) - Fix wrong comparison in poll handling (yangerkun) I realize the first one is a little late in the game, but it felt pointless to hold it off until the next release. Went through various testing and reviews with Pavel and peterz" * tag 'for-5.4/io_uring-2019-09-27' of git://git.kernel.dk/linux-block: io_uring: make CQ ring wakeups be more efficient io_uring: compare cached_cq_tail with cq.head in_io_uring_poll
2019-09-26io_uring: make CQ ring wakeups be more efficientJens Axboe
For batched IO, it's not uncommon for waiters to ask for more than 1 IO to complete before being woken up. This is a problem with wait_event() since tasks will get woken for every IO that completes, re-check condition, then go back to sleep. For batch counts on the order of what you do for high IOPS, that can result in 10s of extra wakeups for the waiting task. Add a private wake function that checks for the wake up count criteria being met before calling autoremove_wake_function(). Pavel reports that one test case he has runs 40% faster with proper batching of wakeups. Reported-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-24Merge tag 'for-5.4/io_uring-2019-09-24' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull more io_uring updates from Jens Axboe: "A collection of later fixes and additions, that weren't quite ready for pushing out with the initial pull request. This contains: - Fix potential use-after-free of shadow requests (Jackie) - Fix potential OOM crash in request allocation (Jackie) - kmalloc+memcpy -> kmemdup cleanup (Jackie) - Fix poll crash regression (me) - Fix SQ thread not being nice and giving up CPU for !PREEMPT (me) - Add support for timeouts, making it easier to do epoll_wait() conversions, for instance (me) - Ensure io_uring works without f_ops->read_iter() and f_ops->write_iter() (me)" * tag 'for-5.4/io_uring-2019-09-24' of git://git.kernel.dk/linux-block: io_uring: correctly handle non ->{read,write}_iter() file_operations io_uring: IORING_OP_TIMEOUT support io_uring: use cond_resched() in sqthread io_uring: fix potential crash issue due to io_get_req failure io_uring: ensure poll commands clear ->sqe io_uring: fix use-after-free of shadow_req io_uring: use kmemdup instead of kmalloc and memcpy
2019-09-24mm: introduce page_size()Matthew Wilcox (Oracle)
Patch series "Make working with compound pages easier", v2. These three patches add three helpers and convert the appropriate places to use them. This patch (of 3): It's unnecessarily hard to find out the size of a potentially huge page. Replace 'PAGE_SIZE << compound_order(page)' with page_size(page). Link: http://lkml.kernel.org/r/20190721104612.19120-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24io_uring: compare cached_cq_tail with cq.head in_io_uring_pollyangerkun
After 75b28af("io_uring: allocate the two rings together"), we compare sq.head with cached_cq_tail to determine does there any cq invalid. Actually, we should use cq.head. Fixes: 75b28affdd6a ("io_uring: allocate the two rings together") Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-23io_uring: correctly handle non ->{read,write}_iter() file_operationsJens Axboe
Currently we just -EINVAL a read or write to an fd that isn't backed by ->read_iter() or ->write_iter(). But we can handle them just fine, as long as we punt fo async context first. Implement a simple loop function for doing ->read() or ->write() instead, and ensure we call it appropriately. Reported-by: 李通洲 <carter.li@eoitek.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-18io_uring: IORING_OP_TIMEOUT supportJens Axboe
There's been a few requests for functionality similar to io_getevents() and epoll_wait(), where the user can specify a timeout for waiting on events. I deliberately did not add support for this through the system call initially to avoid overloading the args, but I can see that the use cases for this are valid. This adds support for IORING_OP_TIMEOUT. If a user wants to get woken when waiting for events, simply submit one of these timeout commands with your wait call (or before). This ensures that the application sleeping on the CQ ring waiting for events will get woken. The timeout command is passed in as a pointer to a struct timespec. Timeouts are relative. The timeout command also includes a way to auto-cancel after N events has passed. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-19io_uring: use cond_resched() in sqthreadJens Axboe
If preempt isn't enabled in the kernel, we can run into hang issues with sqthread submissions. Use cond_resched() to play nice instead of cpu_relax(), if we end up starting the loop and not having any events pending for submissions. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-18io_uring: fix potential crash issue due to io_get_req failureJackie Liu
Sometimes io_get_req will return a NUL, then we need to do the correct error handling, otherwise it will cause the kernel null pointer exception. Fixes: 4fe2c963154c ("io_uring: add support for link with drain") Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-18io_uring: ensure poll commands clear ->sqeJens Axboe
If we end up getting woken in poll (due to a signal), then we may need to punt the poll request to an async worker. When we do that, we look up the list to queue at, deferefencing req->submit.sqe, however that is only set for requests we initially decided to queue async. This fixes a crash with poll command usage and wakeups that need to punt to async context. Fixes: 54a91f3bb9b9 ("io_uring: limit parallelism of buffered writes") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-18io_uring: fix use-after-free of shadow_reqJackie Liu
There is a potential dangling pointer problem. we never clean shadow_req, if there are multiple link lists in this series of sqes, then the shadow_req will not reallocate, and continue to use the last one. but in the previous, his memory has been released, thus forming a dangling pointer. let's clean up him and make sure that every new link list can reapply for a new shadow_req. Fixes: 4fe2c963154c ("io_uring: add support for link with drain") Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-18io_uring: use kmemdup instead of kmalloc and memcpyJackie Liu
Just clean up the code, no function changes. Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-14io_uring: increase IORING_MAX_ENTRIES to 32KDaniel Xu
Some workloads can require far more than 4K oustanding entries. For example memcached can have ~300K sockets over ~40 cores. Bumping the max to 32K seems to work pretty well. Reported-by: Dan Melnic <dmm@fb.com> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-12io_uring: make sqpoll wakeup possible with geteventsJens Axboe
The way the logic is setup in io_uring_enter() means that you can't wake up the SQ poller thread while at the same time waiting (or polling) for completions afterwards. There's no reason for that to be the case. Reported-by: Lewis Baker <lbaker@fb.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-12io_uring: extend async work mergingJens Axboe
We currently merge async work items if we see a strict sequential hit. This helps avoid unnecessary workqueue switches when we don't need them. We can extend this merging to cover cases where it's not a strict sequential hit, but the IO still fits within the same page. If an application is doing multiple requests within the same page, we don't want separate workers waiting on the same page to complete IO. It's much faster to let the first worker bring in the page, then operate on that page from the same worker to complete the next request(s). Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-10io_uring: limit parallelism of buffered writesJens Axboe
All the popular filesystems need to grab the inode lock for buffered writes. With io_uring punting buffered writes to async context, we observe a lot of contention with all workers hamming this mutex. For buffered writes, we generally don't need a lot of parallelism on the submission side, as the flushing will take care of that for us. Hence we don't need a deep queue on the write side, as long as we can safely punt from the original submission context. Add a workqueue with a limit of 2 that we can use for buffered writes. This greatly improves the performance and efficiency of higher queue depth buffered async writes with io_uring. Reported-by: Andres Freund <andres@anarazel.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-10io_uring: add io_queue_async_work() helperJens Axboe
Add a helper for queueing a request for async execution, in preparation for optimizing it. No functional change in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-10io_uring: optimize submit_and_wait APIJens Axboe
For some applications that end up using a submit-and-wait type of approach for certain batches of IO, we can make that a bit more efficient by allowing the application to block for the last IO submission. This prevents an async when we don't need it, as the application will be blocking for the completion event(s) anyway. Typical use cases are using the liburing io_uring_submit_and_wait() API, or just using io_uring_enter() doing both submissions and completions. As a specific example, RocksDB doing MultiGet() is sped up quite a bit with this change. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-09io_uring: add support for link with drainJackie Liu
To support the link with drain, we need to do two parts. There is an sqes: 0 1 2 3 4 5 6 +-----+-----+-----+-----+-----+-----+-----+ | N | L | L | L+D | N | N | N | +-----+-----+-----+-----+-----+-----+-----+ First, we need to ensure that the io before the link is completed, there is a easy way is set drain flag to the link list's head, so all subsequent io will be inserted into the defer_list. +-----+ (0) | N | +-----+ | (2) (3) (4) +-----+ +-----+ +-----+ +-----+ (1) | L+D | --> | L | --> | L+D | --> | N | +-----+ +-----+ +-----+ +-----+ | +-----+ (5) | N | +-----+ | +-----+ (6) | N | +-----+ Second, ensure that the following IO will not be completed first, an easy way is to create a mirror of drain io and insert it into defer_list, in this way, as long as drain io is not processed, the following io in the defer_list will not be actively process. +-----+ (0) | N | +-----+ | (2) (3) (4) +-----+ +-----+ +-----+ +-----+ (1) | L+D | --> | L | --> | L+D | --> | N | +-----+ +-----+ +-----+ +-----+ | +-----+ ('3) | D | <== This is a shadow of (3) +-----+ | +-----+ (5) | N | +-----+ | +-----+ (6) | N | +-----+ Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-09io_uring: fix wrong sequence setting logicJackie Liu
Sqo_thread will get sqring in batches, which will cause ctx->cached_sq_head to be added in batches. if one of these sqes is set with the DRAIN flag, then he will never get a chance to process, and finally sqo_thread will not exit. Fixes: de0617e4671 ("io_uring: add support for marking commands as draining") Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-06io_uring: expose single mmap capabilityJens Axboe
After commit 75b28affdd6a we can get by with just a single mmap to map both the sq and cq ring. However, userspace doesn't know that. Add a features variable to io_uring_params, and notify userspace that the kernel has this ability. This can then be used in liburing (or in applications directly) to avoid the second mmap. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-27io_uring: allocate the two rings togetherHristo Venev
Both the sq and the cq rings have sizes just over a power of two, and the sq ring is significantly smaller. By bundling them in a single alllocation, we get the sq ring for free. This also means that IORING_OFF_SQ_RING and IORING_OFF_CQ_RING now mean the same thing. If we indicate this to userspace, we can save a mmap call. Signed-off-by: Hristo Venev <hristo@venev.name> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-27fs/io_uring.c: convert put_page() to put_user_page*()John Hubbard
For pages that were retained via get_user_pages*(), release those pages via the new put_user_page*() routines, instead of via put_page() or release_pages(). This is part a tree-wide conversion, as described in commit fc1d8e7cca2d ("mm: introduce put_user_page*(), placeholder versions"). Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-fsdevel@vger.kernel.org Cc: linux-block@vger.kernel.org Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-22io_uring: add need_resched() check in inner poll loopJens Axboe
The outer poll loop checks for whether we need to reschedule, and returns to userspace if we do. However, it's possible to get stuck in the inner loop as well, if the CPU we are running on needs to reschedule to finish the IO work. Add the need_resched() check in the inner loop as well. This fixes a potential hang if the kernel is configured with CONFIG_PREEMPT_VOLUNTARY=y. Reported-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-20io_uring: don't enter poll loop if we have CQEs pendingJens Axboe
We need to check if we have CQEs pending before starting a poll loop, as those could be the events we will be spinning for (and hence we'll find none). This can happen if a CQE triggers an error, or if it is found by eg an IRQ before we get a chance to find it through polling. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-20io_uring: fix potential hang with polled IOJens Axboe
If a request issue ends up being punted to async context to avoid blocking, we can get into a situation where the original application enters the poll loop for that very request before it has been issued. This should not be an issue, except that the polling will hold the io_uring uring_ctx mutex for the duration of the poll. When the async worker has actually issued the request, it needs to acquire this mutex to add the request to the poll issued list. Since the application polling is already holding this mutex, the workqueue sleeps on the mutex forever, and the application thus never gets a chance to poll for the very request it was interested in. Fix this by ensuring that the polling drops the uring_ctx occasionally if it's not making any progress. Reported-by: Jeffrey M. Birnbaum <jmbnyc@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-15io_uring: fix an issue when IOSQE_IO_LINK is inserted into defer listJackie Liu
This patch may fix two issues: First, when IOSQE_IO_DRAIN set, the next IOs need to be inserted into defer list to delay execution, but link io will be actively scheduled to run by calling io_queue_sqe. Second, when multiple LINK_IOs are inserted together with defer_list, the LINK_IO is no longer keep order. |-------------| | LINK_IO | ----> insert to defer_list ----------- |-------------| | | LINK_IO | ----> insert to defer_list ----------| |-------------| | | LINK_IO | ----> insert to defer_list ----------| |-------------| | | NORMAL_IO | ----> insert to defer_list ----------| |-------------| | | queue_work at same time <-----| Fixes: 9e645e1105c ("io_uring: add support for sqe links") Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>