summaryrefslogtreecommitdiff
path: root/fs/io_uring.c
AgeCommit message (Collapse)Author
2020-04-09io_uring: punt final io_ring_ctx wait-and-free to workqueueJens Axboe
We can't reliably wait in io_ring_ctx_wait_and_kill(), since the task_works list isn't ordered (in fact it's LIFO ordered). We could either fix this with a separate task_works list for io_uring work, or just punt the wait-and-free to async context. This ensures that task_work that comes in while we're shutting down is processed correctly. If we don't go async, we could have work past the fput() work for the ring that depends on work that won't be executed until after we're done with the wait-and-free. But as this operation is blocking, it'll never get a chance to run. This was reproduced with hundreds of thousands of sockets running memcached, haven't been able to reproduce this synthetically. Reported-by: Dan Melnic <dmm@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-09io_uring: fix fs cleanup on cqe overflowPavel Begunkov
If completion queue overflow occurs, __io_cqring_fill_event() will update req->cflags, which is in a union with req->work and happens to be aliased to req->work.fs. Following io_free_req() -> io_req_work_drop_env() may get a bunch of different problems (miscount fs->users, segfault, etc) on cleaning @fs. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-08io_uring: don't read user-shared sqe flags twicePavel Begunkov
Don't re-read userspace-shared sqe->flags, it can be exploited. sqe->flags are copied into req->flags in io_submit_sqe(), check them there instead. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-08io_uring: remove req init from io_get_req()Pavel Begunkov
io_get_req() do two different things: io_kiocb allocation and initialisation. Move init part out of it and rename into io_alloc_req(). It's simpler this way and also have better data locality. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-08io_uring: alloc req only after getting sqePavel Begunkov
As io_get_sqe() split into 2 stage get/consume, get an sqe before allocating io_kiocb, so no free_req*() for a failure case is needed, and inline back __io_req_do_free(), which has only 1 user. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-08io_uring: simplify io_get_sqringPavel Begunkov
Make io_get_sqring() care only about sqes themselves, not initialising the io_kiocb. Also, split it into get + consume, that will be helpful in the future. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-08io_uring: do not always copy iovec in io_req_map_rw()Xiaoguang Wang
In io_read_prep() or io_write_prep(), io_req_map_rw() takes struct io_async_rw's fast_iov as argument to call io_import_iovec(), and if io_import_iovec() uses struct io_async_rw's fast_iov as valid iovec array, later indeed io_req_map_rw() does not need to do the memcpy operation, because they are same pointers. Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-08io_uring: ensure openat sets O_LARGEFILE if neededJens Axboe
OPENAT2 correctly sets O_LARGEFILE if it has to, but that escaped the OPENAT opcode. Dmitry reports that his test case that compares openat() and IORING_OP_OPENAT sees failures on large files: *** sync openat openat succeeded sync write at offset 0 write succeeded sync write at offset 4294967296 write succeeded *** sync openat openat succeeded io_uring write at offset 0 write succeeded io_uring write at offset 4294967296 write succeeded *** io_uring openat openat succeeded sync write at offset 0 write succeeded sync write at offset 4294967296 write failed: File too large *** io_uring openat openat succeeded io_uring write at offset 0 write succeeded io_uring write at offset 4294967296 write failed: File too large Ensure we set O_LARGEFILE, if force_o_largefile() is true. Cc: stable@vger.kernel.org # v5.6 Fixes: 15b71abe7b52 ("io_uring: add support for IORING_OP_OPENAT") Reported-by: Dmitry Kadashev <dkadashev@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-07io_uring: initialize fixed_file_data lockXiaoguang Wang
syzbot reports below warning: INFO: trying to register non-static key. the code is fine but needs lockdep annotation. turning off the locking correctness validator. CPU: 1 PID: 7099 Comm: syz-executor897 Not tainted 5.6.0-next-20200406-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x188/0x20d lib/dump_stack.c:118 assign_lock_key kernel/locking/lockdep.c:913 [inline] register_lock_class+0x1664/0x1760 kernel/locking/lockdep.c:1225 __lock_acquire+0x104/0x4e00 kernel/locking/lockdep.c:4223 lock_acquire+0x1f2/0x8f0 kernel/locking/lockdep.c:4923 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0x8c/0xbf kernel/locking/spinlock.c:159 io_sqe_files_register fs/io_uring.c:6599 [inline] __io_uring_register+0x1fe8/0x2f00 fs/io_uring.c:8001 __do_sys_io_uring_register fs/io_uring.c:8081 [inline] __se_sys_io_uring_register fs/io_uring.c:8063 [inline] __x64_sys_io_uring_register+0x192/0x560 fs/io_uring.c:8063 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295 entry_SYSCALL_64_after_hwframe+0x49/0xb3 RIP: 0033:0x440289 Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007ffff1bbf558 EFLAGS: 00000246 ORIG_RAX: 00000000000001ab RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000440289 RDX: 0000000020000280 RSI: 0000000000000002 RDI: 0000000000000003 RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8 R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000401b10 R13: 0000000000401ba0 R14: 0000000000000000 R15: 0000000000000000 Initialize struct fixed_file_data's lock to fix this issue. Reported-by: syzbot+e6eeca4a035da76b3065@syzkaller.appspotmail.com Fixes: 055895537302 ("io_uring: refactor file register/unregister/update handling") Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-07io_uring: remove redundant variable pointer nxt and io_wq_assign_next callColin Ian King
An earlier commit "io_uring: remove @nxt from handlers" removed the setting of pointer nxt and now it is always null, hence the non-null check and call to io_wq_assign_next is redundant and can be removed. Addresses-Coverity: ("'Constant' variable guard") Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-05io_uring: fix ctx refcounting in io_submit_sqes()Pavel Begunkov
If io_get_req() fails, it drops a ref. Then, awhile keeping @submitted unmodified, io_submit_sqes() breaks the loop and puts @nr - @submitted refs. For each submitted req a ref is dropped in io_put_req() and friends. So, for @nr taken refs there will be (@nr - @submitted + @submitted + 1) dropped. Remove ctx refcounting from io_get_req(), that at the same time makes it clearer. Fixes: 2b85edfc0c90 ("io_uring: batch getting pcpu references") Cc: stable@vger.kernel.org # v5.6 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-03io_uring: process requests completed with -EAGAIN on poll listBijan Mottahedeh
A request that completes with an -EAGAIN result after it has been added to the poll list, will not be removed from that list in io_do_iopoll() because the f_op->iopoll() will not succeed for that request. Maintain a retryable local list similar to the done list, and explicity reissue requests with an -EAGAIN result. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-03io_uring: remove bogus RLIMIT_NOFILE check in file registrationJens Axboe
We already checked this limit when the file was opened, and we keep it open in the file table. Hence when we added unit_inflight to the count we want to register, we're doubly accounting these files. This results in -EMFILE for file registration, if we're at half the limit. Cc: stable@vger.kernel.org # v5.1+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-03io_uring: use io-wq manager as backup task if task is exitingJens Axboe
If the original task is (or has) exited, then the task work will not get queued properly. Allow for using the io-wq manager task to queue this work for execution, and ensure that the io-wq manager notices and runs this work if woken up (or exiting). Reported-by: Dan Melnic <dmm@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-03io_uring: grab task reference for poll requestsJens Axboe
We can have a task exit if it's not the owner of the ring. Be safe and grab an actual reference to it, to avoid a potential use-after-free. Reported-by: Dan Melnic <dmm@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-03io_uring: retry poll if we got woken with non-matching maskJens Axboe
If we get woken and the poll doesn't match our mask, re-add the task to the poll waitqueue and try again instead of completing the request with a mask of 0. Reported-by: Dan Melnic <dmm@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-01io_uring: add missing finish_wait() in io_sq_thread()Hillf Danton
Add it to pair with prepare_to_wait() in an attempt to avoid anything weird in the field. Fixes: b41e98524e42 ("io_uring: add per-task callback handler") Reported-by: syzbot+0c3370f235b74b3cfd97@syzkaller.appspotmail.com Signed-off-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-31io_uring: refactor file register/unregister/update handlingXiaoguang Wang
While diving into io_uring fileset register/unregister/update codes, we found one bug in the fileset update handling. io_uring fileset update use a percpu_ref variable to check whether we can put the previously registered file, only when the refcnt of the perfcpu_ref variable reaches zero, can we safely put these files. But this doesn't work so well. If applications always issue requests continually, this perfcpu_ref will never have an chance to reach zero, and it'll always be in atomic mode, also will defeat the gains introduced by fileset register/unresiger/update feature, which are used to reduce the atomic operation overhead of fput/fget. To fix this issue, while applications do IORING_REGISTER_FILES or IORING_REGISTER_FILES_UPDATE operations, we allocate a new percpu_ref and kill the old percpu_ref, new requests will use the new percpu_ref. Once all previous old requests complete, old percpu_refs will be dropped and registered files will be put safely. Link: https://lore.kernel.org/io-uring/5a8dac33-4ca2-4847-b091-f7dcd3ad0ff3@linux.alibaba.com/T/#t Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-30Merge tag 'for-5.7/io_uring-2020-03-29' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring updates from Jens Axboe: "Here are the io_uring changes for this merge window. Light on new features this time around (just splice + buffer selection), lots of cleanups, fixes, and improvements to existing support. In particular, this contains: - Cleanup fixed file update handling for stack fallback (Hillf) - Re-work of how pollable async IO is handled, we no longer require thread offload to handle that. Instead we rely using poll to drive this, with task_work execution. - In conjunction with the above, allow expendable buffer selection, so that poll+recv (for example) no longer has to be a split operation. - Make sure we honor RLIMIT_FSIZE for buffered writes - Add support for splice (Pavel) - Linked work inheritance fixes and optimizations (Pavel) - Async work fixes and cleanups (Pavel) - Improve io-wq locking (Pavel) - Hashed link write improvements (Pavel) - SETUP_IOPOLL|SETUP_SQPOLL improvements (Xiaoguang)" * tag 'for-5.7/io_uring-2020-03-29' of git://git.kernel.dk/linux-block: (54 commits) io_uring: cleanup io_alloc_async_ctx() io_uring: fix missing 'return' in comment io-wq: handle hashed writes in chains io-uring: drop 'free_pfile' in struct io_file_put io-uring: drop completion when removing file io_uring: Fix ->data corruption on re-enqueue io-wq: close cancel gap for hashed linked work io_uring: make spdxcheck.py happy io_uring: honor original task RLIMIT_FSIZE io-wq: hash dependent work io-wq: split hashing and enqueueing io-wq: don't resched if there is no work io-wq: remove duplicated cancel code io_uring: fix truncated async read/readv and write/writev retry io_uring: dual license io_uring.h uapi header io_uring: io_uring_enter(2) don't poll while SETUP_IOPOLL|SETUP_SQPOLL enabled io_uring: Fix unused function warnings io_uring: add end-of-bits marker and build time verify it io_uring: provide means of removing buffers io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_RECVMSG ...
2020-03-27io_uring: cleanup io_alloc_async_ctx()Xiaoguang Wang
Cleanup io_alloc_async_ctx() a bit, add a new __io_alloc_async_ctx(), so io_setup_async_rw() won't need to check whether async_ctx is true or false again. Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-24io_uring: fix missing 'return' in commentChucheng Luo
The missing 'return' work may make it hard for other developers to understand it. Signed-off-by: Chucheng Luo <luochucheng@vivo.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-23io-uring: drop 'free_pfile' in struct io_file_putHillf Danton
Sync removal of file is only used in case of a GFP_KERNEL kmalloc failure at the cost of io_file_put::done and work flush, while a glich like it can be handled at the call site without too much pain. That said, what is proposed is to drop sync removing of file, and the kink in neck as well. Signed-off-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-23io-uring: drop completion when removing fileHillf Danton
A case of task hung was reported by syzbot, INFO: task syz-executor975:9880 blocked for more than 143 seconds. Not tainted 5.6.0-rc6-syzkaller #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. syz-executor975 D27576 9880 9878 0x80004000 Call Trace: schedule+0xd0/0x2a0 kernel/sched/core.c:4154 schedule_timeout+0x6db/0xba0 kernel/time/timer.c:1871 do_wait_for_common kernel/sched/completion.c:83 [inline] __wait_for_common kernel/sched/completion.c:104 [inline] wait_for_common kernel/sched/completion.c:115 [inline] wait_for_completion+0x26a/0x3c0 kernel/sched/completion.c:136 io_queue_file_removal+0x1af/0x1e0 fs/io_uring.c:5826 __io_sqe_files_update.isra.0+0x3a1/0xb00 fs/io_uring.c:5867 io_sqe_files_update fs/io_uring.c:5918 [inline] __io_uring_register+0x377/0x2c00 fs/io_uring.c:7131 __do_sys_io_uring_register fs/io_uring.c:7202 [inline] __se_sys_io_uring_register fs/io_uring.c:7184 [inline] __x64_sys_io_uring_register+0x192/0x560 fs/io_uring.c:7184 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe and bisect pointed to 05f3fb3c5397 ("io_uring: avoid ring quiesce for fixed file set unregister and update"). It is down to the order that we wait for work done before flushing it while nobody is likely going to wake us up. We can drop that completion on stack as flushing work itself is a sync operation we need and no more is left behind it. To that end, io_file_put::done is re-used for indicating if it can be freed in the workqueue worker context. Reported-and-Inspired-by: syzbot <syzbot+538d1957ce178382a394@syzkaller.appspotmail.com> Signed-off-by: Hillf Danton <hdanton@sina.com> Rename ->done to ->free_pfile Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-22io_uring: Fix ->data corruption on re-enqueuePavel Begunkov
work->data and work->list are shared in union. io_wq_assign_next() sets ->data if a req having a linked_timeout, but then io-wq may want to use work->list, e.g. to do re-enqueue of a request, so corrupting ->data. ->data is not necessary, just remove it and extract linked_timeout through @link_list. Fixes: 60cf46ae6054 ("io-wq: hash dependent work") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-20io_uring: honor original task RLIMIT_FSIZEJens Axboe
With the previous fixes for number of files open checking, I added some debug code to see if we had other spots where we're checking rlimit() against the async io-wq workers. The only one I found was file size checking, which we should also honor. During write and fallocate prep, store the max file size and override that for the current ask if we're in io-wq worker context. Cc: stable@vger.kernel.org # 5.1+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-20io_uring: make sure accept honor rlimit nofileJens Axboe
Just like commit 4022e7af86be, this fixes the fact that IORING_OP_ACCEPT ends up using get_unused_fd_flags(), which checks current->signal->rlim[] for limits. Add an extra argument to __sys_accept4_file() that allows us to pass in the proper nofile limit, and grab it at request prep time. Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-20io_uring: make sure openat/openat2 honor rlimit nofileJens Axboe
Dmitry reports that a test case shows that io_uring isn't honoring a modified rlimit nofile setting. get_unused_fd_flags() checks the task signal->rlimi[] for the limits. As this isn't easily inheritable, provide a __get_unused_fd_flags() that takes the value instead. Then we can grab it when the request is prepared (from the original task), and pass that in when we do the async part part of the open. Reported-by: Dmitry Kadashev <dkadashev@gmail.com> Tested-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-14io-wq: split hashing and enqueueingPavel Begunkov
It's a preparation patch removing io_wq_enqueue_hashed(), which now should be done by io_wq_hash_work() + io_wq_enqueue(). Also, set hash value for dependant works, and do it as late as possible, because req->file can be unavailable before. This hash will be ignored by io-wq. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-14io_uring: NULL-deref for IOSQE_{ASYNC,DRAIN}Pavel Begunkov
Processing links, io_submit_sqe() prepares requests, drops sqes, and passes them with sqe=NULL to io_queue_sqe(). There IOSQE_DRAIN and/or IOSQE_ASYNC requests will go through the same prep, which doesn't expect sqe=NULL and fail with NULL pointer deference. Always do full prepare including io_alloc_async_ctx() for linked requests, and then it can skip the second preparation. Cc: stable@vger.kernel.org # 5.5 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-11io_uring: fix truncated async read/readv and write/writev retryJens Axboe
Ensure we keep the truncated value, if we did truncate it. If not, we might read/write more than the registered buffer size. Also for retry, ensure that we return the truncated mapped value for the vectorized versions of the read/write commands. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-11io_uring: io_uring_enter(2) don't poll while SETUP_IOPOLL|SETUP_SQPOLL enabledXiaoguang Wang
When SETUP_IOPOLL and SETUP_SQPOLL are both enabled, applications don't need to do io completion events polling again, they can rely on io_sq_thread to do polling work, which can reduce cpu usage and uring_lock contention. I modify fio io_uring engine codes a bit to evaluate the performance: static int fio_ioring_getevents(struct thread_data *td, unsigned int min, continue; } - if (!o->sqpoll_thread) { + if (o->sqpoll_thread && o->hipri) { r = io_uring_enter(ld, 0, actual_min, IORING_ENTER_GETEVENTS); if (r < 0) { and use "fio -name=fiotest -filename=/dev/nvme0n1 -iodepth=$depth -thread -rw=read -ioengine=io_uring -hipri=1 -sqthread_poll=1 -direct=1 -bs=4k -size=10G -numjobs=1 -time_based -runtime=120" original codes -------------------------------------------------------------------- iodepth | 4 | 8 | 16 | 32 | 64 bw | 1133MB/s | 1519MB/s | 2090MB/s | 2710MB/s | 3012MB/s fio cpu usage | 100% | 100% | 100% | 100% | 100% -------------------------------------------------------------------- with patch -------------------------------------------------------------------- iodepth | 4 | 8 | 16 | 32 | 64 bw | 1196MB/s | 1721MB/s | 2351MB/s | 2977MB/s | 3357MB/s fio cpu usage | 63.8% | 74.4%% | 81.1% | 83.7% | 82.4% -------------------------------------------------------------------- bw improve | 5.5% | 13.2% | 12.3% | 9.8% | 11.5% -------------------------------------------------------------------- From above test results, we can see that bw has above 5.5%~13% improvement, and fio process's cpu usage also drops much. Note this won't improve io_sq_thread's cpu usage when SETUP_IOPOLL|SETUP_SQPOLL are both enabled, in this case, io_sq_thread always has 100% cpu usage. I think this patch will be friendly to applications which will often use io_uring_wait_cqe() or similar from liburing. Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-10io_uring: Fix unused function warningsYueHaibing
If CONFIG_NET is not set, gcc warns: fs/io_uring.c:3110:12: warning: io_setup_async_msg defined but not used [-Wunused-function] static int io_setup_async_msg(struct io_kiocb *req, ^~~~~~~~~~~~~~~~~~ There are many funcions wraped by CONFIG_NET, move them together to simplify code, also fix this warning. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Minor tweaks. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-10io_uring: add end-of-bits marker and build time verify itJens Axboe
Not easy to tell if we're going over the size of bits we can shove in req->flags, so add an end-of-bits marker and a BUILD_BUG_ON() check for it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-10io_uring: provide means of removing buffersJens Axboe
We have IORING_OP_PROVIDE_BUFFERS, but the only way to remove buffers is to trigger IO on them. The usual case of shrinking a buffer pool would be to just not replenish the buffers when IO completes, and instead just free it. But it may be nice to have a way to manually remove a number of buffers from a given group, and IORING_OP_REMOVE_BUFFERS provides that functionality. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-10io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_RECVMSGJens Axboe
Like IORING_OP_READV, this is limited to supporting just a single segment in the iovec passed in. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-10io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_READVJens Axboe
This adds support for the vectored read. This is limited to supporting just 1 segment in the iov, and is provided just for convenience for applications that use IORING_OP_READV already. The iov helpers will be used for IORING_OP_RECVMSG as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-10io_uring: support buffer selection for OP_READ and OP_RECVJens Axboe
If a server process has tons of pending socket connections, generally it uses epoll to wait for activity. When the socket is ready for reading (or writing), the task can select a buffer and issue a recv/send on the given fd. Now that we have fast (non-async thread) support, a task can have tons of pending reads or writes pending. But that means they need buffers to back that data, and if the number of connections is high enough, having them preallocated for all possible connections is unfeasible. With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to use for any request. The request then sets IOSQE_BUFFER_SELECT in the sqe, and a given group ID in sqe->buf_group. When the fd becomes ready, a free buffer from the specified group is selected. If none are available, the request is terminated with -ENOBUFS. If successful, the CQE on completion will contain the buffer ID chosen in the cqe->flags member, encoded as: (buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER; Once a buffer has been consumed by a request, it is no longer available and must be registered again with IORING_OP_PROVIDE_BUFFERS. Requests need to support this feature. For now, IORING_OP_READ and IORING_OP_RECV support it. This is checked on SQE submission, a CQE with res == -EOPNOTSUPP will be posted if attempted on unsupported requests. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-10io_uring: add IORING_OP_PROVIDE_BUFFERSJens Axboe
IORING_OP_PROVIDE_BUFFERS uses the buffer registration infrastructure to support passing in an addr/len that is associated with a buffer ID and buffer group ID. The group ID is used to index and lookup the buffers, while the buffer ID can be used to notify the application which buffer in the group was used. The addr passed in is the starting buffer address, and length is each buffer length. A number of buffers to add with can be specified, in which case addr is incremented by length for each addition, and each buffer increments the buffer ID specified. No validation is done of the buffer ID. If the application provides buffers within the same group with identical buffer IDs, then it'll have a hard time telling which buffer ID was used. The only restriction is that the buffer ID can be a max of 16-bits in size, so USHRT_MAX is the maximum ID that can be used. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-08io_uring: ensure RCU callback ordering with rcu_barrier()Jens Axboe
After more careful studying, Paul informs me that we cannot rely on ordering of RCU callbacks in the way that the the tagged commit did. The current construct looks like this: void C(struct rcu_head *rhp) { do_something(rhp); call_rcu(&p->rh, B); } call_rcu(&p->rh, A); call_rcu(&p->rh, C); and we're relying on ordering between A and B, which isn't guaranteed. Make this explicit instead, and have a work item issue the rcu_barrier() to ensure that A has run before we manually execute B. While thorough testing never showed this issue, it's dependent on the per-cpu load in terms of RCU callbacks. The updated method simplifies the code as well, and eliminates the need to maintain an rcu_head in the fileset data. Fixes: c1e2148f8ecb ("io_uring: free fixed_file_data after RCU grace period") Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-07io_uring: fix lockup with timeoutsPavel Begunkov
There is a recipe to deadlock the kernel: submit a timeout sqe with a linked_timeout (e.g. test_single_link_timeout_ception() from liburing), and SIGKILL the process. Then, io_kill_timeouts() takes @ctx->completion_lock, but the timeout isn't flagged with REQ_F_COMP_LOCKED, and will try to double grab it during io_put_free() to cancel the linked timeout. Probably, the same can happen with another io_kill_timeout() call site, that is io_commit_cqring(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-06io_uring: free fixed_file_data after RCU grace periodJens Axboe
The percpu refcount protects this structure, and we can have an atomic switch in progress when exiting. This makes it unsafe to just free the struct normally, and can trigger the following KASAN warning: BUG: KASAN: use-after-free in percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0 Read of size 1 at addr ffff888181a19a30 by task swapper/0/0 CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.6.0-rc4+ #5747 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 Call Trace: <IRQ> dump_stack+0x76/0xa0 print_address_description.constprop.0+0x3b/0x60 ? percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0 ? percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0 __kasan_report.cold+0x1a/0x3d ? percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0 percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0 rcu_core+0x370/0x830 ? percpu_ref_exit+0x50/0x50 ? rcu_note_context_switch+0x7b0/0x7b0 ? run_rebalance_domains+0x11d/0x140 __do_softirq+0x10a/0x3e9 irq_exit+0xd5/0xe0 smp_apic_timer_interrupt+0x86/0x200 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:default_idle+0x26/0x1f0 Fix this by punting the final exit and free of the struct to RCU, then we know that it's safe to do so. Jann suggested the approach of using a double rcu callback to achieve this. It's important that we do a nested call_rcu() callback, as otherwise the free could be ordered before the atomic switch, even if the latter was already queued. Reported-by: syzbot+e017e49c39ab484ac87a@syzkaller.appspotmail.com Suggested-by: Jann Horn <jannh@google.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-04io_uring: buffer registration infrastructureJens Axboe
This just prepares the ring for having lists of buffers associated with it, that the application can provide for SQEs to consume instead of providing their own. The buffers are organized by group ID. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-04io_uring/io-wq: forward submission ref to asyncPavel Begunkov
First it changes io-wq interfaces. It replaces {get,put}_work() with free_work(), which guaranteed to be called exactly once. It also enforces free_work() callback to be non-NULL. io_uring follows the changes and instead of putting a submission reference in io_put_req_async_completion(), it will be done in io_free_work(). As removes io_get_work() with corresponding refcount_inc(), the ref balance is maintained. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-03io_uring: get next work with submission ref dropPavel Begunkov
If after dropping the submission reference req->refs == 1, the request is done, because this one is for io_put_work() and will be dropped synchronously shortly after. In this case it's safe to steal a next work from the request. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-03io_uring: remove @nxt from handlersPavel Begunkov
There will be no use for @nxt in the handlers, and it's doesn't work anyway, so purge it Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-03io_uring: make submission ref putting consistentPavel Begunkov
The rule is simple, any async handler gets a submission ref and should put it at the end. Make them all follow it, and so more consistent. This is a preparation patch, and as io_wq_assign_next() currently won't ever work, this doesn't care to use io_put_req_find_next() instead of io_put_req(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> refcount_inc_not_zero() -> refcount_inc() fix. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02io_uring: clean up io_closePavel Begunkov
Don't abuse labels for plain and straightworward code. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02io_uring: Ensure mask is initialized in io_arm_poll_handlerNathan Chancellor
Clang warns: fs/io_uring.c:4178:6: warning: variable 'mask' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized] if (def->pollin) ^~~~~~~~~~~ fs/io_uring.c:4182:2: note: uninitialized use occurs here mask |= POLLERR | POLLPRI; ^~~~ fs/io_uring.c:4178:2: note: remove the 'if' if its condition is always true if (def->pollin) ^~~~~~~~~~~~~~~~ fs/io_uring.c:4154:15: note: initialize the variable 'mask' to silence this warning __poll_t mask, ret; ^ = 0 1 warning generated. io_op_defs has many definitions where pollin is not set so mask indeed might be uninitialized. Initialize it to zero and change the next assignment to |=, in case further masks are added in the future to avoid missing changing the assignment then. Fixes: d7718a9d25a6 ("io_uring: use poll driven retry for files that support it") Link: https://github.com/ClangBuiltLinux/linux/issues/916 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02io_uring: remove io_prep_next_work()Pavel Begunkov
io-wq cares about IO_WQ_WORK_UNBOUND flag only while enqueueing, so it's useless setting it for a next req of a link. Thus, removed it from io_prep_linked_timeout(), and inline the function. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02io_uring: remove extra nxt check after puntPavel Begunkov
After __io_queue_sqe() ended up in io_queue_async_work(), it's already known that there is no @nxt req, so skip the check and return from the function. Also, @nxt initialisation now can be done just before io_put_req_find_next(), as there is no jumping until it's checked. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>