summaryrefslogtreecommitdiff
path: root/fs/io_uring.c
AgeCommit message (Collapse)Author
2019-04-23io_uring: remove 'state' argument from io_{read,write} pathJens Axboe
Since commit 09bb839434b we don't use the state argument for any sort of on-stack caching in the io read and write path. Remove the stale and unused argument from them, and bubble it up to __io_submit_sqe() and down to io_prep_rw(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-22io_uring: fix poll full SQ detectionStefan Bühler
io_uring_poll shouldn't signal EPOLLOUT | EPOLLWRNORM if the queue is full; the old check would always signal EPOLLOUT | EPOLLWRNORM (unless there were U32_MAX - 1 entries in the SQ queue). Signed-off-by: Stefan Bühler <source@stbuehler.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-22io_uring: fix race condition when sq threads goes sleepingStefan Bühler
Reading the SQ tail needs to come after setting IORING_SQ_NEED_WAKEUP in flags; there is no cheap barrier for ordering a store before a load, a full memory barrier is required. Userspace needs a full memory barrier between updating SQ tail and checking for the IORING_SQ_NEED_WAKEUP too. Signed-off-by: Stefan Bühler <source@stbuehler.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-22io_uring: fix race condition reading SQ entriesStefan Bühler
A read memory barrier is required between reading SQ tail and reading the actual data belonging to the SQ entry. Userspace needs a matching write barrier between writing SQ entries and updating SQ tail (using smp_store_release to update tail will do). Signed-off-by: Stefan Bühler <source@stbuehler.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-22io_uring: fail io_uring_register(2) on a dying io_uring instanceJens Axboe
If we have multiple threads doing io_uring_register(2) on an io_uring fd, then we can potentially try and kill the percpu reference while someone else has already killed it. Prevent this race by failing io_uring_register(2) if the ref is marked dying. This is safe since we're inside the io_uring mutex. Fixes: b19062a56726 ("io_uring: fix possible deadlock between io_uring_{enter,register}") Reported-by: syzbot <syzbot+10d25e23199614b7721f@syzkaller.appspotmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-17io_uring: fix CQ overflow conditionJens Axboe
This is a leftover from when the rings initially were not free flowing, and hence a test for tail + 1 == head would indicate full. Since we now let them wrap instead of mask them with the size, we need to check if they drift more than the ring size from each other. This fixes a case where we'd overwrite CQ ring entries, if the user failed to reap completions. Both cases would ultimately result in lost completions as the application violated the depth it asked for. The only difference is that before this fix we'd return invalid entries for the overflowed completions, instead of properly flagging it in the cq_ring->overflow variable. Reported-by: Stefan Bühler <source@stbuehler.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-15io_uring: fix possible deadlock between io_uring_{enter,register}Jens Axboe
If we have multiple threads, one doing io_uring_enter() while the other is doing io_uring_register(), we can run into a deadlock between the two. io_uring_register() must wait for existing users of the io_uring instance to exit. But it does so while holding the io_uring mutex. Callers of io_uring_enter() may need this mutex to make progress (and eventually exit). If we wait for users to exit in io_uring_register(), we can't do so with the io_uring mutex held without potentially risking a deadlock. Drop the io_uring mutex while waiting for existing callers to exit. This is safe and guaranteed to make forward progress, since we already killed the percpu ref before doing so. Hence later callers of io_uring_enter() will be rejected. Reported-by: syzbot+16dc03452dee970a0c3e@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-13io_uring: drop io_file_put() 'file' argumentJens Axboe
Since the fget/fput handling was reworked in commit 09bb839434bd, we never call io_file_put() with state == NULL (and hence file != NULL) anymore. Remove that case. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-13io_uring: only test SQPOLL cpu after we've verified itJens Axboe
We currently call cpu_possible() even if we don't use the CPU. Move the test under the SQ_AFF branch, which is the only place where we'll use the value. Do the cpu_possible() test AFTER we've limited it to a max of NR_CPUS. This avoids triggering the following warning: WARNING: CPU: 1 PID: 7600 at include/linux/cpumask.h:121 cpu_max_bits_warn if CONFIG_DEBUG_PER_CPU_MAPS is enabled. While in there, also move the SQ thread idle period assignment inside SETUP_SQPOLL, as we don't use it otherwise either. Reported-by: syzbot+cd714a07c6de2bc34293@syzkaller.appspotmail.com Fixes: 6c271ce2f1d5 ("io_uring: add submission polling") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-13io_uring: park SQPOLL thread if it's percpuJens Axboe
kthread expects this, or we can throw a warning on exit: WARNING: CPU: 0 PID: 7822 at kernel/kthread.c:399 __kthread_bind_mask+0x3b/0xc0 kernel/kthread.c:399 Kernel panic - not syncing: panic_on_warn set ... CPU: 0 PID: 7822 Comm: syz-executor030 Not tainted 5.1.0-rc4-next-20190412 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 panic+0x2cb/0x72b kernel/panic.c:214 __warn.cold+0x20/0x46 kernel/panic.c:576 report_bug+0x263/0x2b0 lib/bug.c:186 fixup_bug arch/x86/kernel/traps.c:179 [inline] fixup_bug arch/x86/kernel/traps.c:174 [inline] do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:272 do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:291 invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:973 RIP: 0010:__kthread_bind_mask+0x3b/0xc0 kernel/kthread.c:399 Code: 48 89 fb e8 f7 ab 24 00 4c 89 e6 48 89 df e8 ac e1 02 00 31 ff 49 89 c4 48 89 c6 e8 7f ad 24 00 4d 85 e4 75 15 e8 d5 ab 24 00 <0f> 0b e8 ce ab 24 00 5b 41 5c 41 5d 41 5e 5d c3 e8 c0 ab 24 00 4c RSP: 0018:ffff8880a89bfbb8 EFLAGS: 00010293 RAX: ffff88808ca7a280 RBX: ffff8880a98e4380 RCX: ffffffff814bdd11 RDX: 0000000000000000 RSI: ffffffff814bdd1b RDI: 0000000000000007 RBP: ffff8880a89bfbd8 R08: ffff88808ca7a280 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: ffffffff87691148 R14: ffff8880a98e43a0 R15: ffffffff81c91e10 __kthread_bind kernel/kthread.c:412 [inline] kthread_unpark+0x123/0x160 kernel/kthread.c:480 kthread_stop+0xfa/0x6c0 kernel/kthread.c:556 io_sq_thread_stop fs/io_uring.c:2057 [inline] io_sq_thread_stop fs/io_uring.c:2052 [inline] io_finish_async+0xab/0x180 fs/io_uring.c:2064 io_ring_ctx_free fs/io_uring.c:2534 [inline] io_ring_ctx_wait_and_kill+0x133/0x510 fs/io_uring.c:2591 io_uring_release+0x42/0x50 fs/io_uring.c:2599 __fput+0x2e5/0x8d0 fs/file_table.c:278 ____fput+0x16/0x20 fs/file_table.c:309 task_work_run+0x14a/0x1c0 kernel/task_work.c:113 exit_task_work include/linux/task_work.h:22 [inline] do_exit+0x90a/0x2fa0 kernel/exit.c:876 do_group_exit+0x135/0x370 kernel/exit.c:980 __do_sys_exit_group kernel/exit.c:991 [inline] __se_sys_exit_group kernel/exit.c:989 [inline] __x64_sys_exit_group+0x44/0x50 kernel/exit.c:989 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Reported-by: syzbot+6d4a92619eb0ad08602b@syzkaller.appspotmail.com Fixes: 6c271ce2f1d5 ("io_uring: add submission polling") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-08io_uring: restrict IORING_SETUP_SQPOLL to rootJens Axboe
This options spawns a kernel side thread that will poll for submissions (and completions, if IORING_SETUP_IOPOLL is set). As this allows a user to potentially use more cycles outside of the normal hierarchy, restrict the use of this feature to root. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-03io_uring: fix double free in case of fileset regitration failureJens Axboe
Will Deacon reported the following KASAN complaint: [ 149.890370] ================================================================== [ 149.891266] BUG: KASAN: double-free or invalid-free in io_sqe_files_unregister+0xa8/0x140 [ 149.892218] [ 149.892411] CPU: 113 PID: 3974 Comm: io_uring_regist Tainted: G B 5.1.0-rc3-00012-g40b114779944 #3 [ 149.893623] Hardware name: linux,dummy-virt (DT) [ 149.894169] Call trace: [ 149.894539] dump_backtrace+0x0/0x228 [ 149.895172] show_stack+0x14/0x20 [ 149.895747] dump_stack+0xe8/0x124 [ 149.896335] print_address_description+0x60/0x258 [ 149.897148] kasan_report_invalid_free+0x78/0xb8 [ 149.897936] __kasan_slab_free+0x1fc/0x228 [ 149.898641] kasan_slab_free+0x10/0x18 [ 149.899283] kfree+0x70/0x1f8 [ 149.899798] io_sqe_files_unregister+0xa8/0x140 [ 149.900574] io_ring_ctx_wait_and_kill+0x190/0x3c0 [ 149.901402] io_uring_release+0x2c/0x48 [ 149.902068] __fput+0x18c/0x510 [ 149.902612] ____fput+0xc/0x18 [ 149.903146] task_work_run+0xf0/0x148 [ 149.903778] do_notify_resume+0x554/0x748 [ 149.904467] work_pending+0x8/0x10 [ 149.905060] [ 149.905331] Allocated by task 3974: [ 149.905934] __kasan_kmalloc.isra.0.part.1+0x48/0xf8 [ 149.906786] __kasan_kmalloc.isra.0+0xb8/0xd8 [ 149.907531] kasan_kmalloc+0xc/0x18 [ 149.908134] __kmalloc+0x168/0x248 [ 149.908724] __arm64_sys_io_uring_register+0x2b8/0x15a8 [ 149.909622] el0_svc_common+0x100/0x258 [ 149.910281] el0_svc_handler+0x48/0xc0 [ 149.910928] el0_svc+0x8/0xc [ 149.911425] [ 149.911696] Freed by task 3974: [ 149.912242] __kasan_slab_free+0x114/0x228 [ 149.912955] kasan_slab_free+0x10/0x18 [ 149.913602] kfree+0x70/0x1f8 [ 149.914118] __arm64_sys_io_uring_register+0xc2c/0x15a8 [ 149.915009] el0_svc_common+0x100/0x258 [ 149.915670] el0_svc_handler+0x48/0xc0 [ 149.916317] el0_svc+0x8/0xc [ 149.916817] [ 149.917101] The buggy address belongs to the object at ffff8004ce07ed00 [ 149.917101] which belongs to the cache kmalloc-128 of size 128 [ 149.919197] The buggy address is located 0 bytes inside of [ 149.919197] 128-byte region [ffff8004ce07ed00, ffff8004ce07ed80) [ 149.921142] The buggy address belongs to the page: [ 149.921953] page:ffff7e0013381f00 count:1 mapcount:0 mapping:ffff800503417c00 index:0x0 compound_mapcount: 0 [ 149.923595] flags: 0x1ffff00000010200(slab|head) [ 149.924388] raw: 1ffff00000010200 dead000000000100 dead000000000200 ffff800503417c00 [ 149.925706] raw: 0000000000000000 0000000080400040 00000001ffffffff 0000000000000000 [ 149.927011] page dumped because: kasan: bad access detected [ 149.927956] [ 149.928224] Memory state around the buggy address: [ 149.929054] ffff8004ce07ec00: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc [ 149.930274] ffff8004ce07ec80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 149.931494] >ffff8004ce07ed00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 149.932712] ^ [ 149.933281] ffff8004ce07ed80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 149.934508] ffff8004ce07ee00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 149.935725] ================================================================== which is due to a failure in registrering a fileset. This frees the ctx->user_files pointer, but doesn't clear it. When the io_uring instance is later freed through the normal channels, we free this pointer again. At this point it's invalid. Ensure we clear the pointer when we free it for the error case. Reported-by: Will Deacon <will.deacon@arm.com> Tested-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-25io_uring: offload write to async worker in case of -EAGAINRoman Penyaev
In case of direct write -EAGAIN will be returned if page cache was previously populated. To avoid immediate completion of a request with -EAGAIN error write has to be offloaded to the async worker, like io_read() does. Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-25io_uring: fix big-endian compat signal mask handlingArnd Bergmann
On big-endian architectures, the signal masks are differnet between 32-bit and 64-bit tasks, so we have to use a different function for reading them from user space. io_cqring_wait() initially got this wrong, and always interprets this as a native structure. This is ok on x86 and most arm64, but not on s390, ppc64be, mips64be, sparc64 and parisc. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-23Merge tag 'io_uring-20190323' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes and improvements from Jens Axboe: "The first five in this series are heavily inspired by the work Al did on the aio side to fix the races there. The last two re-introduce a feature that was in io_uring before it got merged, but which I pulled since we didn't have a good way to have BVEC iters that already have a stable reference. These aren't necessarily related to block, it's just how io_uring pins fixed buffers" * tag 'io_uring-20190323' of git://git.kernel.dk/linux-block: block: add BIO_NO_PAGE_REF flag iov_iter: add ITER_BVEC_FLAG_NO_REF flag io_uring: mark me as the maintainer io_uring: retry bulk slab allocs as single allocs io_uring: fix poll races io_uring: fix fget/fput handling io_uring: add prepped flag io_uring: make io_read/write return an integer io_uring: use regular request ref counts
2019-03-18iov_iter: add ITER_BVEC_FLAG_NO_REF flagJens Axboe
For ITER_BVEC, if we're holding on to kernel pages, the caller doesn't need to grab a reference to the bvec pages, and drop that same reference on IO completion. This is essentially safe for any ITER_BVEC, but some use cases end up reusing pages and uncondtionally dropping a page reference on completion. And example of that is sendfile(2), that ends up being a splice_in + splice_out on the pipe pages. Add a flag that tells us it's fine to not grab a page reference to the bvec pages, since that caller knows not to drop a reference when it's done with the pages. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-18io_uring: retry bulk slab allocs as single allocsJens Axboe
I've seen cases where bulk alloc fails, since the bulk alloc API is all-or-nothing - either we get the number we ask for, or it returns 0 as number of entries. If we fail a batch bulk alloc, retry a "normal" kmem_cache_alloc() and just use that instead of failing with -EAGAIN. While in there, ensure we use GFP_KERNEL. That was an oversight in the original code, when we switched away from GFP_ATOMIC. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-15io_uring: fix poll racesJens Axboe
This is a straight port of Al's fix for the aio poll implementation, since the io_uring version is heavily based on that. The below description is almost straight from that patch, just modified to fit the io_uring situation. io_poll() has to cope with several unpleasant problems: * requests that might stay around indefinitely need to be made visible for io_cancel(2); that must not be done to a request already completed, though. * in cases when ->poll() has placed us on a waitqueue, wakeup might have happened (and request completed) before ->poll() returns. * worse, in some early wakeup cases request might end up re-added into the queue later - we can't treat "woken up and currently not in the queue" as "it's not going to stick around indefinitely" * ... moreover, ->poll() might have decided not to put it on any queues to start with, and that needs to be distinguished from the previous case * ->poll() might have tried to put us on more than one queue. Only the first will succeed for io poll, so we might end up missing wakeups. OTOH, we might very well notice that only after the wakeup hits and request gets completed (all before ->poll() gets around to the second poll_wait()). In that case it's too late to decide that we have an error. req->woken was an attempt to deal with that. Unfortunately, it was broken. What we need to keep track of is not that wakeup has happened - the thing might come back after that. It's that async reference is already gone and won't come back, so we can't (and needn't) put the request on the list of cancellables. The easiest case is "request hadn't been put on any waitqueues"; we can tell by seeing NULL apt.head, and in that case there won't be anything async. We should either complete the request ourselves (if vfs_poll() reports anything of interest) or return an error. In all other cases we get exclusion with wakeups by grabbing the queue lock. If request is currently on queue and we have something interesting from vfs_poll(), we can steal it and complete the request ourselves. If it's on queue and vfs_poll() has not reported anything interesting, we either put it on the cancellable list, or, if we know that it hadn't been put on all queues ->poll() wanted it on, we steal it and return an error. If it's _not_ on queue, it's either been already dealt with (in which case we do nothing), or there's io_poll_complete_work() about to be executed. In that case we either put it on the cancellable list, or, if we know it hadn't been put on all queues ->poll() wanted it on, simulate what cancel would've done. Fixes: 221c5eb23382 ("io_uring: add support for IORING_OP_POLL") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-15io_uring: fix fget/fput handlingJens Axboe
This isn't a straight port of commit 84c4e1f89fef for aio.c, since io_uring doesn't use files in exactly the same way. But it's pretty close. See the commit message for that commit. This essentially fixes a use-after-free with the poll command handling, but it takes cue from Linus's approach to just simplifying the file handling. We move the setup of the file into a higher level location, so the individual commands don't have to deal with it. And then we release the reference when we free the associated io_kiocb. Fixes: 221c5eb23382 ("io_uring: add support for IORING_OP_POLL") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-14io_uring: add prepped flagJens Axboe
We currently use the fact that if ->ki_filp is already set, then we've done the prep. In preparation for moving the file assignment earlier, use a separate flag to tell whether the request has been prepped for IO or not. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-14io_uring: make io_read/write return an integerJens Axboe
The callers all convert to an integer, and we only return 0/-ERROR anyway. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-14io_uring: use regular request ref countsJens Axboe
Get rid of the special casing of "normal" requests not having any references to the io_kiocb. We initialize the ref count to 2, one for the submission side, and one or the completion side. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-12mm: refactor readahead defines in mm.hNikolay Borisov
All users of VM_MAX_READAHEAD actually convert it to kbytes and then to pages. Define the macro explicitly as (SZ_128K / PAGE_SIZE). This simplifies the expression in every filesystem. Also rename the macro to VM_READAHEAD_PAGES to properly convey its meaning. Finally remove unused VM_MIN_READAHEAD [akpm@linux-foundation.org: fix fs/io_uring.c, per Stephen] Link: http://lkml.kernel.org/r/20181221144053.24318-1-nborisov@suse.com Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Latchesar Ionkov <lucho@ionkov.net> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: David Howells <dhowells@redhat.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: David Sterba <dsterba@suse.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06io_uring: allow workqueue item to handle multiple buffered requestsJens Axboe
Right now we punt any buffered request that ends up triggering an -EAGAIN to an async workqueue. This works fine in terms of providing async execution of them, but it also can create quite a lot of work queue items. For sequentially buffered IO, it's advantageous to serialize the issue of them. For reads, the first one will trigger a read-ahead, and subsequent request merely end up waiting on later pages to complete. For writes, devices usually respond better to streamed sequential writes. Add state to track the last buffered request we punted to a work queue, and if the next one is sequential to the previous, attempt to get the previous work item to handle it. We limit the number of sequential add-ons to the a multiple (8) of the max read-ahead size of the file. This should be a good number for both reads and wries, as it defines the max IO size the device can do directly. This drastically cuts down on the number of context switches we need to handle buffered sequential IO, and a basic test case of copying a big file with io_uring sees a 5x speedup. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-06io_uring: add support for IORING_OP_POLLJens Axboe
This is basically a direct port of bfe4037e722e, which implements a one-shot poll command through aio. Description below is based on that commit as well. However, instead of adding a POLL command and relying on io_cancel(2) to remove it, we mimic the epoll(2) interface of having a command to add a poll notification, IORING_OP_POLL_ADD, and one to remove it again, IORING_OP_POLL_REMOVE. To poll for a file descriptor the application should submit an sqe of type IORING_OP_POLL. It will poll the fd for the events specified in the poll_events field. Unlike poll or epoll without EPOLLONESHOT this interface always works in one shot mode, that is once the sqe is completed, it will have to be resubmitted. Reviewed-by: Hannes Reinecke <hare@suse.com> Based-on-code-from: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28io_uring: add io_kiocb ref countJens Axboe
We'll use this for the POLL implementation. Regular requests will NOT be using references, so initialize it to 0. Any real use of the io_kiocb ref will initialize it to at least 2. Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28io_uring: add submission pollingJens Axboe
This enables an application to do IO, without ever entering the kernel. By using the SQ ring to fill in new sqes and watching for completions on the CQ ring, we can submit and reap IOs without doing a single system call. The kernel side thread will poll for new submissions, and in case of HIPRI/polled IO, it'll also poll for completions. By default, we allow 1 second of active spinning. This can by changed by passing in a different grace period at io_uring_register(2) time. If the thread exceeds this idle time without having any work to do, it will set: sq_ring->flags |= IORING_SQ_NEED_WAKEUP. The application will have to call io_uring_enter() to start things back up again. If IO is kept busy, that will never be needed. Basically an application that has this feature enabled will guard it's io_uring_enter(2) call with: read_barrier(); if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP) io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP); instead of calling it unconditionally. It's mandatory to use fixed files with this feature. Failure to do so will result in the application getting an -EBADF CQ entry when submitting IO. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28io_uring: add file set registrationJens Axboe
We normally have to fget/fput for each IO we do on a file. Even with the batching we do, the cost of the atomic inc/dec of the file usage count adds up. This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes for the io_uring_register(2) system call. The arguments passed in must be an array of __s32 holding file descriptors, and nr_args should hold the number of file descriptors the application wishes to pin for the duration of the io_uring instance (or until IORING_UNREGISTER_FILES is called). When used, the application must set IOSQE_FIXED_FILE in the sqe->flags member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd to the index in the array passed in to IORING_REGISTER_FILES. Files are automatically unregistered when the io_uring instance is torn down. An application need only unregister if it wishes to register a new set of fds. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28io_uring: add support for pre-mapped user IO buffersJens Axboe
If we have fixed user buffers, we can map them into the kernel when we setup the io_uring. That avoids the need to do get_user_pages() for each and every IO. To utilize this feature, the application must call io_uring_register() after having setup an io_uring instance, passing in IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to an iovec array, and the nr_args should contain how many iovecs the application wishes to map. If successful, these buffers are now mapped into the kernel, eligible for IO. To use these fixed buffers, the application must use the IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len must point to somewhere inside the indexed buffer. The application may register buffers throughout the lifetime of the io_uring instance. It can call io_uring_register() with IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of buffers, and then register a new set. The application need not unregister buffers explicitly before shutting down the io_uring instance. It's perfectly valid to setup a larger buffer, and then sometimes only use parts of it for an IO. As long as the range is within the originally mapped region, it will work just fine. For now, buffers must not be file backed. If file backed buffers are passed in, the registration will fail with -1/EOPNOTSUPP. This restriction may be relaxed in the future. RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat arbitrary 1G per buffer size is also imposed. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28io_uring: batch io_kiocb allocationJens Axboe
Similarly to how we use the state->ios_left to know how many references to get to a file, we can use it to allocate the io_kiocb's we need in bulk. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28io_uring: use fget/fput_many() for file referencesJens Axboe
Add a separate io_submit_state structure, to cache some of the things we need for IO submission. One such example is file reference batching. io_submit_state. We get as many references as the number of sqes we are submitting, and drop unused ones if we end up switching files. The assumption here is that we're usually only dealing with one fd, and if there are multiple, hopefuly they are at least somewhat ordered. Could trivially be extended to cover multiple fds, if needed. On the completion side we do the same thing, except this is trivially done just locally in io_iopoll_reap(). Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28io_uring: support for IO pollingJens Axboe
Add support for a polled io_uring instance. When a read or write is submitted to a polled io_uring, the application must poll for completions on the CQ ring through io_uring_enter(2). Polled IO may not generate IRQ completions, hence they need to be actively found by the application itself. To use polling, io_uring_setup() must be used with the IORING_SETUP_IOPOLL flag being set. It is illegal to mix and match polled and non-polled IO on an io_uring. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28io_uring: add fsync supportChristoph Hellwig
Add a new fsync opcode, which either syncs a range if one is passed, or the whole file if the offset and length fields are both cleared to zero. A flag is provided to use fdatasync semantics, that is only force out metadata which is required to retrieve the file data, but not others like metadata. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28Add io_uring IO interfaceJens Axboe
The submission queue (SQ) and completion queue (CQ) rings are shared between the application and the kernel. This eliminates the need to copy data back and forth to submit and complete IO. IO submissions use the io_uring_sqe data structure, and completions are generated in the form of io_uring_cqe data structures. The SQ ring is an index into the io_uring_sqe array, which makes it possible to submit a batch of IOs without them being contiguous in the ring. The CQ ring is always contiguous, as completion events are inherently unordered, and hence any io_uring_cqe entry can point back to an arbitrary submission. Two new system calls are added for this: io_uring_setup(entries, params) Sets up an io_uring instance for doing async IO. On success, returns a file descriptor that the application can mmap to gain access to the SQ ring, CQ ring, and io_uring_sqes. io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize) Initiates IO against the rings mapped to this fd, or waits for them to complete, or both. The behavior is controlled by the parameters passed in. If 'to_submit' is non-zero, then we'll try and submit new IO. If IORING_ENTER_GETEVENTS is set, the kernel will wait for 'min_complete' events, if they aren't already available. It's valid to set IORING_ENTER_GETEVENTS and 'min_complete' == 0 at the same time, this allows the kernel to return already completed events without waiting for them. This is useful only for polling, as for IRQ driven IO, the application can just check the CQ ring without entering the kernel. With this setup, it's possible to do async IO with a single system call. Future developments will enable polled IO with this interface, and polled submission as well. The latter will enable an application to do IO without doing ANY system calls at all. For IRQ driven IO, an application only needs to enter the kernel for completions if it wants to wait for them to occur. Each io_uring is backed by a workqueue, to support buffered async IO as well. We will only punt to an async context if the command would need to wait for IO on the device side. Any data that can be accessed directly in the page cache is done inline. This avoids the slowness issue of usual threadpools, since cached data is accessed as quickly as a sync interface. Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>