summaryrefslogtreecommitdiff
path: root/fs/io_uring.c
AgeCommit message (Collapse)Author
2021-10-19io_uring: dedup CQE flushing non-empty checksPavel Begunkov
We don't do io_submit_flush_completions() when there is no requests enqueued, and every single caller checks for it. Hide that check into the function not forgetting about inlining. That will make it much easier for changing the empty check condition in the future. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d7ff8cef5da1b38e8ea648f5aad9a315ddfc7b57.1631115443.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19io_uring: inline linked part of io_req_find_nextPavel Begunkov
Inline part of __io_req_find_next() that returns a request but doesn't need io_disarm_next(). It's just two places, but makes links a bit faster. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4126d13f23d0e91b39b3558e16bd86cafa7fcef2.1631115443.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19io_uring: inline io_dismantle_reqPavel Begunkov
io_dismantle_req() is hot, and not _too_ huge. Inline it, there are 3 call sites, which hopefully will turn into 2 in the future. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bdd2dc30716cac270c2403e99bccd6286e4ae201.1631115443.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19io_uring: kill off ios_leftPavel Begunkov
->ios_left is only used to decide whether to plug or not, kill it to avoid this extra accounting, just use the initial submission number. There is no much difference in regards of enabling plugging, where this one does it in a few more cases, but all major ones should be covered well. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f13993bcf5b477f9a7d52881fc49f9457ea9870a.1631115443.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19io_uring: dump sqe contents if issue failsJens Axboe
I recently had to look at a production problem where a request ended up getting the dreaded -EINVAL error on submit. The most used and hence useless of error codes, as it just tells you that something was wrong with your request, but not more than that. Let's dump the full sqe contents if we run into an issue failure, that'll allow easier diagnosing of a wide variety of issues. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18io_uring: utilize the io batching infrastructure for more efficient polled IOJens Axboe
Wire up using an io_comp_batch for f_op->iopoll(). If the lower stack supports it, we can handle high rates of polled IO more efficiently. This raises the single core efficiency on my system from ~6.1M IOPS to ~6.6M IOPS running a random read workload at depth 128 on two gen2 Optane drives. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18block: add a struct io_comp_batch argument to fops->iopoll()Jens Axboe
struct io_comp_batch contains a list head and a completion handler, which will allow completions to more effciently completed batches of IO. For now, no functional changes in this patch, we just define the io_comp_batch structure and add the argument to the file_operations iopoll handler. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18io_uring: don't sleep when polling for I/OChristoph Hellwig
There is no point in sleeping for the expected I/O completion timeout in the io_uring async polling model as we never poll for a specific I/O. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Link: https://lore.kernel.org/r/20211012111226.760968-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18block: replace the spin argument to blk_iopoll with a flags argumentChristoph Hellwig
Switch the boolean spin argument to blk_poll to passing a set of flags instead. This will allow to control polling behavior in a more fine grained way. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Link: https://lore.kernel.org/r/20211012111226.760968-10-hch@lst.de [axboe: adapt to changed io_uring iopoll] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18io_uring: fix a layering violation in io_iopoll_req_issuedChristoph Hellwig
syscall-level code can't just poke into the details of the poll cookie, which is private information of the block layer. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211012111226.760968-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-14io_uring: fix wrong condition to grab uring lockHao Xu
Grab uring lock when we are in io-worker rather than in the original or system-wq context since we already hold it in these two situation. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Fixes: b66ceaf324b3 ("io_uring: move iopoll reissue into regular IO path") Link: https://lore.kernel.org/r/20211014140400.50235-1-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-01io_uring: kill fasyncPavel Begunkov
We have never supported fasync properly, it would only fire when there is something polling io_uring making it useless. The original support came in through the initial io_uring merge for 5.1. Since it's broken and nobody has reported it, get rid of the fasync bits. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2f7ca3d344d406d34fa6713824198915c41cea86.1633080236.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-27mm/filemap: Add __folio_lock_async()Matthew Wilcox (Oracle)
There aren't any actual callers of lock_page_async(), so remove it. Convert filemap_update_page() to call __folio_lock_async(). __folio_lock_async() is 21 bytes smaller than __lock_page_async(), but the real savings come from using a folio in filemap_update_page(), shrinking it from 515 bytes to 404 bytes, saving 110 bytes. The text shrinks by 132 bytes in total. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Jeff Layton <jlayton@kernel.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: David Howells <dhowells@redhat.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com>
2021-09-24io_uring: make OP_CLOSE consistent with direct openPavel Begunkov
From recently open/accept are now able to manipulate fixed file table, but it's inconsistent that close can't. Close the gap, keep API same as with open/accept, i.e. via sqe->file_slot. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-24io_uring: kill extra checks in io_write()Pavel Begunkov
We don't retry short writes and so we would never get to async setup in io_write() in that case. Thus ret2 > 0 is always false and iov_iter_advance() is never used. Apparently, the same is found by Coverity, which complains on the code. Fixes: cd65869512ab ("io_uring: use iov_iter state save/restore helpers") Reported-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5b33e61034748ef1022766efc0fb8854cfcf749c.1632500058.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-24io_uring: don't punt files update to io-wq unconditionallyJens Axboe
There's no reason to punt it unconditionally, we just need to ensure that the submit lock grabbing is conditional. Fixes: 05f3fb3c5397 ("io_uring: avoid ring quiesce for fixed file set unregister and update") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-24io_uring: put provided buffer meta data under memcg accountingJens Axboe
For each provided buffer, we allocate a struct io_buffer to hold the data associated with it. As a large number of buffers can be provided, account that data with memcg. Fixes: ddf0322db79c ("io_uring: add IORING_OP_PROVIDE_BUFFERS") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-24io_uring: allow conditional reschedule for intensive iteratorsJens Axboe
If we have a lot of threads and rings, the tctx list can get quite big. This is especially true if we keep creating new threads and rings. Likewise for the provided buffers list. Be nice and insert a conditional reschedule point while iterating the nodes for deletion. Link: https://lore.kernel.org/io-uring/00000000000064b6b405ccb41113@google.com/ Reported-by: syzbot+111d2a03f51f5ae73775@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-24io_uring: fix potential req refcount underflowHao Xu
For multishot mode, there may be cases like: iowq original context io_poll_add _arm_poll() mask = vfs_poll() is not 0 if mask (2) io_poll_complete() compl_unlock (interruption happens tw queued to original context) io_poll_task_func() compl_lock (3) done = io_poll_complete() is true compl_unlock put req ref (1) if (poll->flags & EPOLLONESHOT) put req ref EPOLLONESHOT flag in (1) may be from (2) or (3), so there are multiple combinations that can cause ref underfow. Let's address it by: - check the return value in (2) as done - change (1) to if (done) in this way, we only do ref put in (1) if 'oneshot flag' is from (2) - do poll.done check in io_poll_task_func(), so that we won't put ref for the second time. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20210922101238.7177-4-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-24io_uring: fix missing set of EPOLLONESHOT for CQ ring overflowHao Xu
We should set EPOLLONESHOT if cqring_fill_event() returns false since io_poll_add() decides to put req or not by it. Fixes: 5082620fb2ca ("io_uring: terminate multishot poll for CQ ring overflow") Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20210922101238.7177-3-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-24io_uring: fix race between poll completion and cancel_hash insertionHao Xu
If poll arming and poll completion runs in parallel, there maybe races. For instance, run io_poll_add in iowq and io_poll_task_func in original context, then: iowq original context io_poll_add vfs_poll (interruption happens tw queued to original context) io_poll_task_func generate cqe del from cancel_hash[] if !poll.done insert to cancel_hash[] The entry left in cancel_hash[], similar case for fast poll. Fix it by set poll.done = true when del from cancel_hash[]. Fixes: 5082620fb2ca ("io_uring: terminate multishot poll for CQ ring overflow") Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20210922101238.7177-2-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-19lsm,io_uring: add LSM hooks to io_uringPaul Moore
A full expalantion of io_uring is beyond the scope of this commit description, but in summary it is an asynchronous I/O mechanism which allows for I/O requests and the resulting data to be queued in memory mapped "rings" which are shared between the kernel and userspace. Optionally, io_uring offers the ability for applications to spawn kernel threads to dequeue I/O requests from the ring and submit the requests in the kernel, helping to minimize the syscall overhead. Rings are accessed in userspace by memory mapping a file descriptor provided by the io_uring_setup(2), and can be shared between applications as one might do with any open file descriptor. Finally, process credentials can be registered with a given ring and any process with access to that ring can submit I/O requests using any of the registered credentials. While the io_uring functionality is widely recognized as offering a vastly improved, and high performing asynchronous I/O mechanism, its ability to allow processes to submit I/O requests with credentials other than its own presents a challenge to LSMs. When a process creates a new io_uring ring the ring's credentials are inhertied from the calling process; if this ring is shared with another process operating with different credentials there is the potential to bypass the LSMs security policy. Similarly, registering credentials with a given ring allows any process with access to that ring to submit I/O requests with those credentials. In an effort to allow LSMs to apply security policy to io_uring I/O operations, this patch adds two new LSM hooks. These hooks, in conjunction with the LSM anonymous inode support previously submitted, allow an LSM to apply access control policy to the sharing of io_uring rings as well as any io_uring credential changes requested by a process. The new LSM hooks are described below: * int security_uring_override_creds(cred) Controls if the current task, executing an io_uring operation, is allowed to override it's credentials with @cred. In cases where the current task is a user application, the current credentials will be those of the user application. In cases where the current task is a kernel thread servicing io_uring requests the current credentials will be those of the io_uring ring (inherited from the process that created the ring). * int security_uring_sqpoll(void) Controls if the current task is allowed to create an io_uring polling thread (IORING_SETUP_SQPOLL). Without a SQPOLL thread in the kernel processes must submit I/O requests via io_uring_enter(2) which allows us to compare any requested credential changes against the application making the request. With a SQPOLL thread, we can no longer compare requested credential changes against the application making the request, the comparison is made against the ring's credentials. Signed-off-by: Paul Moore <paul@paul-moore.com>
2021-09-19io_uring: convert io_uring to the secure anon inode interfacePaul Moore
Converting io_uring's anonymous inode to the secure anon inode API enables LSMs to enforce policy on the io_uring anonymous inodes if they chose to do so. This is an important first step towards providing the necessary mechanisms so that LSMs can apply security policy to io_uring operations. Signed-off-by: Paul Moore <paul@paul-moore.com>
2021-09-19audit,io_uring,io-wq: add some basic audit support to io_uringPaul Moore
This patch adds basic auditing to io_uring operations, regardless of their context. This is accomplished by allocating audit_context structures for the io-wq worker and io_uring SQPOLL kernel threads as well as explicitly auditing the io_uring operations in io_issue_sqe(). Individual io_uring operations can bypass auditing through the "audit_skip" field in the struct io_op_def definition for the operation; although great care must be taken so that security relevant io_uring operations do not bypass auditing; please contact the audit mailing list (see the MAINTAINERS file) with any questions. The io_uring operations are audited using a new AUDIT_URINGOP record, an example is shown below: type=UNKNOWN[1336] msg=audit(1631800225.981:37289): uring_op=19 success=yes exit=0 items=0 ppid=15454 pid=15681 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=(null) Thanks to Richard Guy Briggs for review and feedback. Signed-off-by: Paul Moore <paul@paul-moore.com>
2021-09-17Merge tag 'iov_iter.3-5.15-2021-09-17' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring iov_iter retry fixes from Jens Axboe: "This adds a helper to save/restore iov_iter state, and modifies io_uring to use it. After that is done, we can now kill the iter->truncated addition that we added for this release. The io_uring change is being overly cautious with the save/restore/advance, but better safe than sorry and we can always improve that and reduce the overhead if it proves to be of concern. The only case to be worried about in this regard is huge IO, where iteration can take a while to iterate segments. I spent some time writing test cases, and expanded the coverage quite a bit from the last posting of this. liburing carries this regression test case now: https://git.kernel.dk/cgit/liburing/tree/test/file-verify.c which exercises all of this. It now also supports provided buffers, and explicitly tests for end-of-file/device truncation as well. On top of that, Pavel sanitized the IOPOLL retry path to follow the exact same pattern as normal IO" * tag 'iov_iter.3-5.15-2021-09-17' of git://git.kernel.dk/linux-block: io_uring: move iopoll reissue into regular IO path Revert "iov_iter: track truncated size" io_uring: use iov_iter state save/restore helpers iov_iter: add helper to save iov_iter state
2021-09-15io_uring: move iopoll reissue into regular IO pathPavel Begunkov
230d50d448acb ("io_uring: move reissue into regular IO path") made non-IOPOLL I/O to not retry from ki_complete handler. Follow it steps and do the same for IOPOLL. Same problems, same implementation, same -EAGAIN assumptions. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f80dfee2d5fa7678f0052a8ab3cfca9496a112ca.1631699928.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-15io_uring: use iov_iter state save/restore helpersJens Axboe
Get rid of the need to do re-expand and revert on an iterator when we encounter a short IO, or failure that warrants a retry. Use the new state save/restore helpers instead. We keep the iov_iter_state persistent across retries, if we need to restart the read or write operation. If there's a pending retry, the operation will always exit with the state correctly saved. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-14io_uring: allow retry for O_NONBLOCK if async is supportedJens Axboe
A common complaint is that using O_NONBLOCK files with io_uring can be a bit of a pain. Be a bit nicer and allow normal retry IFF the file does support async behavior. This makes it possible to use io_uring more reliably with O_NONBLOCK files, for use cases where it either isn't possible or feasible to modify the file flags. Cc: stable@vger.kernel.org Reported-and-tested-by: Dan Melnic <dmm@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-14io_uring: auto-removal for direct open/acceptPavel Begunkov
It might be inconvenient that direct open/accept deviates from the update semantics and fails if the slot is taken instead of removing a file sitting there. Implement this auto-removal. Note that removal might need to allocate and so may fail. However, if an empty slot is specified, it's guaraneed to not fail on the fd installation side for valid userspace programs. It's needed for users who can't tolerate such failures, e.g. accept where the other end never retries. Suggested-by: Franz-B. Tuneke <franz-bernhard.tuneke@tu-dortmund.de> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c896f14ea46b0eaa6c09d93149e665c2c37979b4.1631632300.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-14io_uring: fix missing sigmask restore in io_cqring_wait()Xiaoguang Wang
Move get_timespec() section in io_cqring_wait() before the sigmask saving, otherwise we'll fail to restore sigmask once get_timespec() returns error. Fixes: c73ebb685fb6 ("io_uring: add timeout support for io_uring_enter()") Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Link: https://lore.kernel.org/r/20210914143852.9663-1-xiaoguang.wang@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-13io_uring: pin SQPOLL data before unlocking ring lockJens Axboe
We need to re-check sqd->thread after we've dropped the lock. Pin the sqd before doing the lockdep lock dance, and check if the thread is alive after that. It's either NULL or alive, as the SQPOLL thread cannot exit without holding the same sqd->lock. Reported-and-tested-by: syzbot+337de45f13a4fd54d708@syzkaller.appspotmail.com Fixes: fa84693b3c89 ("io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-12io_uring: ensure symmetry in handling iter types in loop_rw_iter()Jens Axboe
When setting up the next segment, we check what type the iter is and handle it accordingly. However, when incrementing and processed amount we do not, and both iter advance and addr/len are adjusted, regardless of type. Split the increment side just like we do on the setup side. Fixes: 4017eb91a9e7 ("io_uring: make loop_rw_iter() use original user supplied pointers") Cc: stable@vger.kernel.org Reported-by: Valentina Palmiotti <vpalmiotti@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-11Merge tag 'io_uring-5.15-2021-09-11' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes from Jens Axboe: - Fix an off-by-one in a BUILD_BUG_ON() check. Not a real issue right now as we have plenty of flags left, but could become one. (Hao) - Fix lockdep issue introduced in this merge window (me) - Fix a few issues with the worker creation (me, Pavel, Qiang) - Fix regression with wq_has_sleeper() for IOPOLL (Pavel) - Timeout link error propagation fix (Pavel) * tag 'io_uring-5.15-2021-09-11' of git://git.kernel.dk/linux-block: io_uring: fix off-by-one in BUILD_BUG_ON check of __REQ_F_LAST_BIT io_uring: fail links of cancelled timeouts io-wq: fix memory leak in create_io_worker() io-wq: fix silly logic error in io_task_work_match() io_uring: drop ctx->uring_lock before acquiring sqd->lock io_uring: fix missing mb() before waitqueue_active io-wq: fix cancellation on create-worker failure
2021-09-10io_uring: fix off-by-one in BUILD_BUG_ON check of __REQ_F_LAST_BITHao Xu
Build check of __REQ_F_LAST_BIT should be larger than, not equal or larger than. It's perfectly valid to have __REQ_F_LAST_BIT be 32, as that means that the last valid bit is 31 which does fit in the type. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20210907032243.114190-1-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-09Merge branch 'work.iov_iter' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull iov_iter fixes from Al Viro: "Fixes for io-uring handling of iov_iter reexpands" * 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: io_uring: reexpand under-reexpanded iters iov_iter: track truncated size
2021-09-09io_uring: fail links of cancelled timeoutsPavel Begunkov
When we cancel a timeout we should mark it with REQ_F_FAIL, so linked requests are cancelled as well, but not queued for further execution. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/fff625b44eeced3a5cae79f60e6acf3fbdf8f990.1631192135.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-08io_uring: drop ctx->uring_lock before acquiring sqd->lockJens Axboe
The SQPOLL thread dictates the lock order, and we hold the ctx->uring_lock for all the registration opcodes. We also hold a ref to the ctx, and we do drop the lock for other reasons to quiesce, so it's fine to drop the ctx lock temporarily to grab the sqd->lock. This fixes the following lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 5.14.0-syzkaller #0 Not tainted ------------------------------------------------------ syz-executor.5/25433 is trying to acquire lock: ffff888023426870 (&sqd->lock){+.+.}-{3:3}, at: io_register_iowq_max_workers fs/io_uring.c:10551 [inline] ffff888023426870 (&sqd->lock){+.+.}-{3:3}, at: __io_uring_register fs/io_uring.c:10757 [inline] ffff888023426870 (&sqd->lock){+.+.}-{3:3}, at: __do_sys_io_uring_register+0x10aa/0x2e70 fs/io_uring.c:10792 but task is already holding lock: ffff8880885b40a8 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_register+0x2e1/0x2e70 fs/io_uring.c:10791 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&ctx->uring_lock){+.+.}-{3:3}: __mutex_lock_common kernel/locking/mutex.c:596 [inline] __mutex_lock+0x131/0x12f0 kernel/locking/mutex.c:729 __io_sq_thread fs/io_uring.c:7291 [inline] io_sq_thread+0x65a/0x1370 fs/io_uring.c:7368 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 -> #0 (&sqd->lock){+.+.}-{3:3}: check_prev_add kernel/locking/lockdep.c:3051 [inline] check_prevs_add kernel/locking/lockdep.c:3174 [inline] validate_chain kernel/locking/lockdep.c:3789 [inline] __lock_acquire+0x2a07/0x54a0 kernel/locking/lockdep.c:5015 lock_acquire kernel/locking/lockdep.c:5625 [inline] lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5590 __mutex_lock_common kernel/locking/mutex.c:596 [inline] __mutex_lock+0x131/0x12f0 kernel/locking/mutex.c:729 io_register_iowq_max_workers fs/io_uring.c:10551 [inline] __io_uring_register fs/io_uring.c:10757 [inline] __do_sys_io_uring_register+0x10aa/0x2e70 fs/io_uring.c:10792 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&ctx->uring_lock); lock(&sqd->lock); lock(&ctx->uring_lock); lock(&sqd->lock); *** DEADLOCK *** Fixes: 2e480058ddc2 ("io-wq: provide a way to limit max number of workers") Reported-by: syzbot+97fa56483f69d677969f@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-08io_uring: fix missing mb() before waitqueue_activePavel Begunkov
In case of !SQPOLL, io_cqring_ev_posted_iopoll() doesn't provide a memory barrier required by waitqueue_active(&ctx->poll_wait). There is a wq_has_sleeper(), which does smb_mb() inside, but it's called only for SQPOLL. Fixes: 5fd4617840596 ("io_uring: be smarter about waking multiple CQ ring waiters") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2982e53bcea2274006ed435ee2a77197107d8a29.1631130542.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-06Merge tag 'for-5.15/io_uring-2021-09-04' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes from Jens Axboe: "As sometimes happens, two reports came in around the merge window open that led to some fixes. Hence this one is a bit bigger than usual followup fixes, but most of it will be going towards stable, outside of the fixes that are addressing regressions from this merge window. In detail: - postgres is a heavy user of signals between tasks, and if we're unlucky this can interfere with io-wq worker creation. Make sure we're resilient against unrelated signal handling. This set of changes also includes hardening against allocation failures, which could previously had led to stalls. - Some use cases that end up having a mix of bounded and unbounded work would have starvation issues related to that. Split the pending work lists to handle that better. - Completion trace int -> unsigned -> long fix - Fix issue with REGISTER_IOWQ_MAX_WORKERS and SQPOLL - Fix regression with hash wait lock in this merge window - Fix retry issued on block devices (Ming) - Fix regression with links in this merge window (Pavel) - Fix race with multi-shot poll and completions (Xiaoguang) - Ensure regular file IO doesn't inadvertently skip completion batching (Pavel) - Ensure submissions are flushed after running task_work (Pavel)" * tag 'for-5.15/io_uring-2021-09-04' of git://git.kernel.dk/linux-block: io_uring: io_uring_complete() trace should take an integer io_uring: fix possible poll event lost in multi shot mode io_uring: prolong tctx_task_work() with flushing io_uring: don't disable kiocb_done() CQE batching io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL io-wq: make worker creation resilient against signals io-wq: get rid of FIXED worker flag io-wq: only exit on fatal signals io-wq: split bounded and unbounded work into separate lists io-wq: fix queue stalling race io_uring: don't submit half-prepared drain request io_uring: fix queueing half-created requests io-wq: ensure that hash wait lock is IRQ disabling io_uring: retry in case of short read on block device io_uring: IORING_OP_WRITE needs hash_reg_file set io-wq: fix race between adding work and activating a free worker
2021-09-03io_uring: reexpand under-reexpanded itersPavel Begunkov
[ 74.211232] BUG: KASAN: stack-out-of-bounds in iov_iter_revert+0x809/0x900 [ 74.212778] Read of size 8 at addr ffff888025dc78b8 by task syz-executor.0/828 [ 74.214756] CPU: 0 PID: 828 Comm: syz-executor.0 Not tainted 5.14.0-rc3-next-20210730 #1 [ 74.216525] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 74.219033] Call Trace: [ 74.219683] dump_stack_lvl+0x8b/0xb3 [ 74.220706] print_address_description.constprop.0+0x1f/0x140 [ 74.224226] kasan_report.cold+0x7f/0x11b [ 74.226085] iov_iter_revert+0x809/0x900 [ 74.227960] io_write+0x57d/0xe40 [ 74.232647] io_issue_sqe+0x4da/0x6a80 [ 74.242578] __io_queue_sqe+0x1ac/0xe60 [ 74.245358] io_submit_sqes+0x3f6e/0x76a0 [ 74.248207] __do_sys_io_uring_enter+0x90c/0x1a20 [ 74.257167] do_syscall_64+0x3b/0x90 [ 74.257984] entry_SYSCALL_64_after_hwframe+0x44/0xae old_size = iov_iter_count(); ... iov_iter_revert(old_size - iov_iter_count()); If iov_iter_revert() is done base on the initial size as above, and the iter is truncated and not reexpanded in the middle, it miscalculates borders causing problems. This trace is due to no one reexpanding after generic_write_checks(). Now iters store how many bytes has been truncated, so reexpand them to the initial state right before reverting. Cc: stable@vger.kernel.org Reported-by: Palash Oswal <oswalpalash@gmail.com> Reported-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Reported-and-tested-by: syzbot+9671693590ef5aad8953@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-09-03io_uring: fix possible poll event lost in multi shot modeXiaoguang Wang
IIUC, IORING_POLL_ADD_MULTI is similar to epoll's edge-triggered mode, that means once one pure poll request returns one event(cqe), we'll need to read or write continually until EAGAIN is returned, then I think there is a possible poll event lost race in multi shot mode: t1 poll request add | | t2 | | t3 event happens | | t4 task work add | | t5 | task work run | t6 | commit one cqe | t7 | | user app handles cqe t8 | new event happen | t9 | add back to waitqueue | t10 | After t6 but before t9, if new event happens, there'll be no wakeup operation, and if user app has picked up this cqe in t7, read or write until EAGAIN is returned. In t8, new event happens and will be lost, though this race window maybe small. To fix this possible race, add poll request back to waitqueue before committing cqe. Fixes: 88e41cf928a6 ("io_uring: add multishot mode for IORING_OP_POLL_ADD") Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Link: https://lore.kernel.org/r/20210903142436.5767-1-xiaoguang.wang@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-03io_uring: prolong tctx_task_work() with flushingPavel Begunkov
io_submit_flush_completions() may enqueue linked requests for task_work execution, so don't leave tctx_task_work() right after the tw list is exhausted, but try to flush and then retry. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0755d4c2c36301447c63bdd4146c10477cea4249.1630539342.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-03io_uring: don't disable kiocb_done() CQE batchingPavel Begunkov
Not passing issue_flags from kiocb_done() into __io_complete_rw() means that completion batching for this case is disabled, e.g. for most of buffered reads. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b2689462835c3ee28a5999ef4f9a581e24be04a2.1630539342.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-03io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLLJens Axboe
SQPOLL has a different thread doing submissions, we need to check for that and use the right task context when updating the worker values. Just hold the sqd->lock across the operation, this ensures that the thread cannot go away while we poke at ->io_uring. Link: https://github.com/axboe/liburing/issues/420 Fixes: 2e480058ddc2 ("io-wq: provide a way to limit max number of workers") Reported-by: Johannes Lundberg <johalun0@gmail.com> Tested-by: Johannes Lundberg <johalun0@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-31io_uring: don't submit half-prepared drain requestPavel Begunkov
[ 3784.910888] BUG: kernel NULL pointer dereference, address: 0000000000000020 [ 3784.910904] RIP: 0010:__io_file_supports_nowait+0x5/0xc0 [ 3784.910926] Call Trace: [ 3784.910928] ? io_read+0x17c/0x480 [ 3784.910945] io_issue_sqe+0xcb/0x1840 [ 3784.910953] __io_queue_sqe+0x44/0x300 [ 3784.910959] io_req_task_submit+0x27/0x70 [ 3784.910962] tctx_task_work+0xeb/0x1d0 [ 3784.910966] task_work_run+0x61/0xa0 [ 3784.910968] io_run_task_work_sig+0x53/0xa0 [ 3784.910975] __x64_sys_io_uring_enter+0x22/0x30 [ 3784.910977] do_syscall_64+0x3d/0x90 [ 3784.910981] entry_SYSCALL_64_after_hwframe+0x44/0xae io_drain_req() goes before checks for REQ_F_FAIL, which protect us from submitting under-prepared request (e.g. failed in io_init_req(). Fail such drained requests as well. Fixes: a8295b982c46d ("io_uring: fix failed linkchain code logic") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e411eb9924d47a131b1e200b26b675df0c2b7627.1630415423.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-31io_uring: fix queueing half-created requestsPavel Begunkov
[ 27.259845] general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] SMP KASAN PTI [ 27.261043] KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f] [ 27.263730] RIP: 0010:sock_from_file+0x20/0x90 [ 27.272444] Call Trace: [ 27.272736] io_sendmsg+0x98/0x600 [ 27.279216] io_issue_sqe+0x498/0x68d0 [ 27.281142] __io_queue_sqe+0xab/0xb50 [ 27.285830] io_req_task_submit+0xbf/0x1b0 [ 27.286306] tctx_task_work+0x178/0xad0 [ 27.288211] task_work_run+0xe2/0x190 [ 27.288571] exit_to_user_mode_prepare+0x1a1/0x1b0 [ 27.289041] syscall_exit_to_user_mode+0x19/0x50 [ 27.289521] do_syscall_64+0x48/0x90 [ 27.289871] entry_SYSCALL_64_after_hwframe+0x44/0xae io_req_complete_failed() -> io_req_complete_post() -> io_req_task_queue() still would try to enqueue hard linked request, which can be half prepared (e.g. failed init), so we can't allow that to happen. Fixes: a8295b982c46d ("io_uring: fix failed linkchain code logic") Reported-by: syzbot+f9704d1878e290eddf73@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/70b513848c1000f88bd75965504649c6bb1415c0.1630415423.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-31io_uring: retry in case of short read on block deviceMing Lei
In case of buffered reading from block device, when short read happens, we should retry to read more, otherwise the IO will be completed partially, for example, the following fio expects to read 2MB, but it can only read 1M or less bytes: fio --name=onessd --filename=/dev/nvme0n1 --filesize=2M \ --rw=randread --bs=2M --direct=0 --overwrite=0 --numjobs=1 \ --iodepth=1 --time_based=0 --runtime=2 --ioengine=io_uring \ --registerfiles --fixedbufs --gtod_reduce=1 --group_reporting Fix the issue by allowing short read retry for block device, which sets FMODE_BUF_RASYNC really. Fixes: 9a173346bd9e ("io_uring: fix short read retries for non-reg files") Cc: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20210821150751.1290434-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-31io_uring: IORING_OP_WRITE needs hash_reg_file setJens Axboe
During some testing, it became evident that using IORING_OP_WRITE doesn't hash buffered writes like the other writes commands do. That's simply an oversight, and can cause performance regressions when doing buffered writes with this command. Correct that and add the flag, so that buffered writes are correctly hashed when using the non-iovec based write command. Cc: stable@vger.kernel.org Fixes: 3a6820f2bb8a ("io_uring: add non-vectored read/write commands") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-30Merge tag 'for-5.15/io_uring-vfs-2021-08-30' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring mkdirat/symlinkat/linkat support from Jens Axboe: "This adds io_uring support for mkdirat, symlinkat, and linkat" * tag 'for-5.15/io_uring-vfs-2021-08-30' of git://git.kernel.dk/linux-block: io_uring: add support for IORING_OP_LINKAT io_uring: add support for IORING_OP_SYMLINKAT io_uring: add support for IORING_OP_MKDIRAT namei: update do_*() helpers to return ints namei: make do_linkat() take struct filename namei: add getname_uflags() namei: make do_symlinkat() take struct filename namei: make do_mknodat() take struct filename namei: make do_mkdirat() take struct filename namei: change filename_parentat() calling conventions namei: ignore ERR/NULL names in putname()
2021-08-30Merge tag 'io_uring-bio-cache.5-2021-08-30' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull support for struct bio recycling from Jens Axboe: "This adds bio recycling support for polled IO, allowing quick reuse of a bio for high IOPS scenarios via a percpu bio_set list. It's good for almost a 10% improvement in performance, bumping our per-core IO limit from ~3.2M IOPS to ~3.5M IOPS" * tag 'io_uring-bio-cache.5-2021-08-30' of git://git.kernel.dk/linux-block: bio: improve kerneldoc documentation for bio_alloc_kiocb() block: provide bio_clear_hipri() helper block: use the percpu bio cache in __blkdev_direct_IO io_uring: enable use of bio alloc cache block: clear BIO_PERCPU_CACHE flag if polling isn't supported bio: add allocation cache abstraction fs: add kiocb alloc cache flag bio: optimize initialization of a bio