summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2021-02-01xfs: reduce exclusive locking on unaligned dioDave Chinner
Attempt shared locking for unaligned DIO, but only if the the underlying extent is already allocated and in written state. On failure, retry with the existing exclusive locking. Test case is fio randrw of 512 byte IOs using AIO and an iodepth of 32 IOs. Vanilla: READ: bw=4560KiB/s (4670kB/s), 4560KiB/s-4560KiB/s (4670kB/s-4670kB/s), io=134MiB (140MB), run=30001-30001msec WRITE: bw=4567KiB/s (4676kB/s), 4567KiB/s-4567KiB/s (4676kB/s-4676kB/s), io=134MiB (140MB), run=30001-30001msec Patched: READ: bw=37.6MiB/s (39.4MB/s), 37.6MiB/s-37.6MiB/s (39.4MB/s-39.4MB/s), io=1127MiB (1182MB), run=30002-30002msec WRITE: bw=37.6MiB/s (39.4MB/s), 37.6MiB/s-37.6MiB/s (39.4MB/s-39.4MB/s), io=1128MiB (1183MB), run=30002-30002msec That's an improvement from ~18k IOPS to a ~150k IOPS, which is about the IOPS limit of the VM block device setup I'm testing on. 4kB block IO comparison: READ: bw=296MiB/s (310MB/s), 296MiB/s-296MiB/s (310MB/s-310MB/s), io=8868MiB (9299MB), run=30002-30002msec WRITE: bw=296MiB/s (310MB/s), 296MiB/s-296MiB/s (310MB/s-310MB/s), io=8878MiB (9309MB), run=30002-30002msec Which is ~150k IOPS, same as what the test gets for sub-block AIO+DIO writes with this patch. Signed-off-by: Dave Chinner <dchinner@redhat.com> [hch: rebased, split unaligned from nowait] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-02-01xfs: split the unaligned DIO write code outDave Chinner
The unaligned DIO write path is more convolted than the normal path, and we are about to make it more complex. Keep the block aligned fast path dio write code trim and simple by splitting out the unaligned DIO code from it. Signed-off-by: Dave Chinner <dchinner@redhat.com> [hch: rebased, fixed a few minor nits] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-02-01xfs: improve the reflink_bounce_dio_write tracepointChristoph Hellwig
Use a more suitable event class. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-02-01xfs: simplify the read/write tracepointsChristoph Hellwig
Pass the iocb and iov_iter to the tracepoints and leave decoding of actual arguments to the code only run when tracing is enabled. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-02-01xfs: remove the buffered I/O fallback assertChristoph Hellwig
The iomap code has been designed from the start not to do magic fallback, so remove the assert in preparation for further code cleanups. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-02-01xfs: cleanup the read/write helper namingChristoph Hellwig
Drop a few pointless aio_ prefixes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-02-01xfs: make xfs_file_aio_write_checks IOCB_NOWAIT-awareChristoph Hellwig
Ensure we don't block on the iolock, or waiting for I/O in xfs_file_aio_write_checks if the caller asked to avoid that. Fixes: 29a5d29ec181 ("xfs: nowait aio support") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-02-01xfs: factor out a xfs_ilock_iocb helperChristoph Hellwig
Add a helper to factor out the nowait locking logical for the read/write helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-02-01xfs: Fix 'set but not used' warning in xfs_bmap_compute_alignments()Chandan Babu R
With both CONFIG_XFS_DEBUG and CONFIG_XFS_WARN disabled, the only reference to local variable "error" in xfs_bmap_compute_alignments() gets eliminated during pre-processing stage of the compilation process. This causes the compiler to generate a "set but not used" warning. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-01xfs: fix unused log variable in xfs_log_cover()Brian Foster
The log variable is only used in kernels with asserts enabled. Remove it and open code the dereference to avoid unused variable warnings. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-02-01io_uring: check kthread parked flag before sqthread goes to sleepHao Xu
Abaci reported this issue: #[ 605.170872] INFO: task kworker/u4:1:53 blocked for more than 143 seconds. [ 605.172123] Not tainted 5.10.0+ #1 [ 605.172811] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 605.173915] task:kworker/u4:1 state:D stack: 0 pid: 53 ppid: 2 flags:0x00004000 [ 605.175130] Workqueue: events_unbound io_ring_exit_work [ 605.175931] Call Trace: [ 605.176334] __schedule+0xe0e/0x25a0 [ 605.176971] ? firmware_map_remove+0x1a1/0x1a1 [ 605.177631] ? write_comp_data+0x2a/0x80 [ 605.178272] schedule+0xd0/0x270 [ 605.178811] schedule_timeout+0x6b6/0x940 [ 605.179415] ? mark_lock.part.0+0xca/0x1420 [ 605.180062] ? usleep_range+0x170/0x170 [ 605.180684] ? wait_for_completion+0x16d/0x280 [ 605.181392] ? mark_held_locks+0x9e/0xe0 [ 605.182079] ? rwlock_bug.part.0+0x90/0x90 [ 605.182853] ? lockdep_hardirqs_on_prepare+0x286/0x400 [ 605.183817] wait_for_completion+0x175/0x280 [ 605.184713] ? wait_for_completion_interruptible+0x340/0x340 [ 605.185611] ? _raw_spin_unlock_irq+0x24/0x30 [ 605.186307] ? migrate_swap_stop+0x9c0/0x9c0 [ 605.187046] kthread_park+0x127/0x1c0 [ 605.187738] io_sq_thread_stop+0xd5/0x530 [ 605.188459] io_ring_exit_work+0xb1/0x970 [ 605.189207] process_one_work+0x92c/0x1510 [ 605.189947] ? pwq_dec_nr_in_flight+0x360/0x360 [ 605.190682] ? rwlock_bug.part.0+0x90/0x90 [ 605.191430] ? write_comp_data+0x2a/0x80 [ 605.192207] worker_thread+0x9b/0xe20 [ 605.192900] ? process_one_work+0x1510/0x1510 [ 605.193599] kthread+0x353/0x460 [ 605.194154] ? _raw_spin_unlock_irq+0x24/0x30 [ 605.194910] ? kthread_create_on_node+0x100/0x100 [ 605.195821] ret_from_fork+0x1f/0x30 [ 605.196605] [ 605.196605] Showing all locks held in the system: [ 605.197598] 1 lock held by khungtaskd/25: [ 605.198301] #0: ffffffff8b5f76a0 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.0+0x0/0x30 [ 605.199914] 3 locks held by kworker/u4:1/53: [ 605.200609] #0: ffff888100109938 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x82a/0x1510 [ 605.202108] #1: ffff888100e47dc0 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work+0x85e/0x1510 [ 605.203681] #2: ffff888116931870 (&sqd->lock){+.+.}-{3:3}, at: io_sq_thread_park.part.0+0x19/0x50 [ 605.205183] 3 locks held by systemd-journal/161: [ 605.206037] 1 lock held by syslog-ng/254: [ 605.206674] 2 locks held by agetty/311: [ 605.207292] #0: ffff888101097098 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x27/0x80 [ 605.208715] #1: ffffc900000332e8 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0x222/0x1bb0 [ 605.210131] 2 locks held by bash/677: [ 605.210723] #0: ffff88810419a098 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x27/0x80 [ 605.212105] #1: ffffc900000512e8 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0x222/0x1bb0 [ 605.213777] [ 605.214151] ============================================= I believe this is caused by the follow race: (ctx_list is empty now) => io_put_sq_data | ==> kthread_park(sqd->thread); | ====> set KTHREAD_SHOULD_PARK | ====> wake_up_process(k) | sq thread is running | | | needs_sched is true since no ctx, | so TASK_INTERRUPTIBLE set and schedule | out then never wake up again | ====> wait_for_completion | (stuck here) So check if sqthread gets park flag right before schedule(). since ctx_list is always empty when this problem happens, here I put kthread_should_park() before setting the wakeup flag(ctx_list is empty so this for loop is fast), where is close enough to schedule(). The problem doesn't show again in my repro testing after this fix. Reported-by: Abaci <abaci@linux.alibaba.com> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: Add skip option for __io_sqe_files_updatenoah
This patch adds support for skipping a file descriptor when using IORING_REGISTER_FILES_UPDATE. __io_sqe_files_update will skip fds set to IORING_REGISTER_FILES_SKIP. IORING_REGISTER_FILES_SKIP is inturn added as a #define in io_uring.h Signed-off-by: noah <goldstein.w.n@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: cleanup files_update loopingPavel Begunkov
Replace a while with a simple for loop, that looks way more natural, and enables us to use "continue" as indexes are no more updated by hand in the end of the loop. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: consolidate putting reqs taskPavel Begunkov
We grab a task for each request and while putting it it also have to do extra work like inflight accounting and waking up that task. This sequence is duplicated several time, it's good time to add a helper. More to that, the helper generates better code due to better locality and so not failing alias analysis. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: ensure only sqo_task has file notesPavel Begunkov
For SQPOLL io_uring we want to have only one file note held by sqo_task. Add a warning to make sure it holds. It's deep in io_uring_add_task_file() out of hot path, so shouldn't hurt. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: simplify io_remove_personalities()Yejune Deng
The function io_remove_personalities() is very similar to io_unregister_personality(),so implement io_remove_personalities() calling io_unregister_personality(). Signed-off-by: Yejune Deng <yejune.deng@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring/io-wq: kill off now unused IO_WQ_WORK_NO_CANCELJens Axboe
It's no longer used as IORING_OP_CLOSE got rid for the need of flagging it as uncancelable, kill it of. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: get rid of intermediate IORING_OP_CLOSE stageJens Axboe
We currently split the close into two, in case we have a ->flush op that we can't safely handle from non-blocking context. This requires us to flag the op as uncancelable if we do need to punt it async, and that means special handling for just this op type. Use __close_fd_get_file() and grab the files lock so we can get the file and check if we need to go async in one atomic operation. That gets rid of the need for splitting this into two steps, and hence the need for IO_WQ_WORK_NO_CANCEL. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01fs: provide locked helper variant of close_fd_get_file()Jens Axboe
Assumes current->files->file_lock is already held on invocation. Helps the caller check the file before removing the fd, if it needs to. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: save atomic dec for inline executed reqsPavel Begunkov
When a request is completed with comp_state, its completion reference put is deferred to io_submit_flush_completions(), but the submission is put not far from there, so do it together to save one atomic dec per request. That targets requests that complete inline, e.g. buffered rw, send/recv. Proper benchmarking haven't been conducted but for nops(batch=32) it was around 7901 vs 8117 KIOPS (~2.7%), or ~4% per perf profiling. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: don't flush CQEs deep down the stackPavel Begunkov
io_submit_flush_completions() is called down the stack in the _state version of io_req_complete(), that's ok because is only called by io_uring opcode handler functions directly. Move it up to __io_queue_sqe() as preparation. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: help inlining of io_req_complete()Pavel Begunkov
__io_req_complete() inlining is a bit weird, some compilers don't optimise out the non-NULL branch of it even when called as io_req_complete(). Help it a bit by extracting state and stateless helpers out of __io_req_complete(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: add a helper timeout mode calculationPavel Begunkov
Deduplicates translation of timeout flags into hrtimer_mode. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: deduplicate failing task_work_addPavel Begunkov
When io_req_task_work_add() fails, the request will be cancelled by enqueueing via task_works of io-wq. Extract a function for that. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: remove __io_state_file_putPavel Begunkov
The check in io_state_file_put() is optimised pretty well when called from __io_file_get(). Don't pollute the code with all these variants. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: simplify io_alloc_req()Pavel Begunkov
Get rid of a label in io_alloc_req(), it's cleaner to do return directly. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: further deduplicate #CQ events calcPavel Begunkov
Apparently, there is one more place hand coded calculation of number of CQ events in the ring. Use __io_cqring_events() helper in io_get_cqring() as well. Naturally, assembly stays identical. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: inline __io_commit_cqring()Pavel Begunkov
Inline it in its only user, that's cleaner Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: inline io_async_submit()Pavel Begunkov
The name is confusing and it's used only in one place. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: cleanup personalities under uring_lockPavel Begunkov
personality_idr is usually synchronised by uring_lock, the exception would be removing personalities in io_ring_ctx_wait_and_kill(), which is legit as refs are killed by that point but still would be more resilient to do it under the lock. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: refactor io_resubmit_prep()Pavel Begunkov
It's awkward to pass return a value into a function for it to return it back. Check it at the caller site and clean up io_resubmit_prep() a bit. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: optimise io_rw_reissue()Pavel Begunkov
The hot path is IO completing on the first try. Reshuffle io_rw_reissue() so it's checked first. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: make percpu_ref_release names consistentBijan Mottahedeh
Make the percpu ref release function names consistent between rsrc data and nodes. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: create common fixed_rsrc_data allocation routinesBijan Mottahedeh
Create common alloc/free fixed_rsrc_data routines for both files and buffers. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> [remove buffer part] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: create common fixed_rsrc_ref_node handling routinesBijan Mottahedeh
Create common routines to be used for both files/buffers registration. [remove io_sqe_rsrc_set_node substitution] Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> [merge, quiesce only for files] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: split ref_node alloc and initPavel Begunkov
A simple prep patch allowing to set refnode callbacks after it was allocated. This needed to 1) keep ourself off of hi-level functions where it's not pretty and they are not necessary 2) amortise ref_node allocation in the future, e.g. for updates. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: split alloc_fixed_file_ref_nodeBijan Mottahedeh
Split alloc_fixed_file_ref_node into resource generic/specific parts, to be leveraged for fixed buffers. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: add rsrc_ref locking routinesBijan Mottahedeh
Encapsulate resource reference locking into separate routines. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: separate ref_list from fixed_rsrc_dataBijan Mottahedeh
Uplevel ref_list and make it common to all resources. This is to allow one common ref_list to be used for both files, and buffers in upcoming patches. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: generalize io_queue_rsrc_removalBijan Mottahedeh
Generalize io_queue_rsrc_removal to handle both files and buffers. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> [remove io_mapped_ubuf from rsrc tables/etc. for now] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: rename file related variables to rsrcBijan Mottahedeh
This is a prep rename patch for subsequent patches to generalize file registration. [io_uring_rsrc_update:: rename fds -> data] Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> [leave io_uring_files_update as struct] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: modularize io_sqe_buffers_registerBijan Mottahedeh
Move allocation of buffer management structures, and validation of buffers into separate routines. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: modularize io_sqe_buffer_registerBijan Mottahedeh
Split io_sqe_buffer_register into two routines: - io_sqe_buffer_register() registers a single buffer - io_sqe_buffers_register iterates over all user specified buffers Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01io_uring: enable LOOKUP_CACHED path resolution for filename lookupsJens Axboe
Instead of being pessimistic and assume that path lookup will block, use LOOKUP_CACHED to attempt just a cached lookup. This ensures that the fast path is always done inline, and we only punt to async context if IO is needed to satisfy the lookup. For forced nonblock open attempts, mark the file O_NONBLOCK over the actual ->open() call as well. We can safely clear this again before doing fd_install(), so it'll never be user visible that we fiddled with it. This greatly improves the performance of file open where the dentry is already cached: ached 5.10-git 5.10-git+LOOKUP_CACHED Speedup --------------------------------------------------------------- 33% 1,014,975 900,474 1.1x 89% 545,466 292,937 1.9x 100% 435,636 151,475 2.9x The more cache hot we are, the faster the inline LOOKUP_CACHED optimization helps. This is unsurprising and expected, as a thread offload becomes a more dominant part of the total overhead. If we look at io_uring tracing, doing an IORING_OP_OPENAT on a file that isn't in the dentry cache will yield: 275.550481: io_uring_create: ring 00000000ddda6278, fd 3 sq size 8, cq size 16, flags 0 275.550491: io_uring_submit_sqe: ring 00000000ddda6278, op 18, data 0x0, non block 1, sq_thread 0 275.550498: io_uring_queue_async_work: ring 00000000ddda6278, request 00000000c0267d17, flags 69760, normal queue, work 000000003d683991 275.550502: io_uring_cqring_wait: ring 00000000ddda6278, min_events 1 275.550556: io_uring_complete: ring 00000000ddda6278, user_data 0x0, result 4 which shows a failed nonblock lookup, then punt to worker, and then we complete with fd == 4. This takes 65 usec in total. Re-running the same test case again: 281.253956: io_uring_create: ring 0000000008207252, fd 3 sq size 8, cq size 16, flags 0 281.253967: io_uring_submit_sqe: ring 0000000008207252, op 18, data 0x0, non block 1, sq_thread 0 281.253973: io_uring_complete: ring 0000000008207252, user_data 0x0, result 4 shows the same request completing inline, also returning fd == 4. This takes 6 usec. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01Merge branch 'work.namei' of ↵Jens Axboe
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into for-5.12/io_uring Merge RESOLVE_CACHED bits from Al, as the io_uring changes will build on top of that. * 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: expose LOOKUP_CACHED through openat2() RESOLVE_CACHED fs: add support for LOOKUP_CACHED saner calling conventions for unlazy_child() fs: make unlazy_walk() error handling consistent fs/namei.c: Remove unlikely of status being -ECHILD in lookup_fast() do_tmpfile(): don't mess with finish_open()
2021-01-31Merge tag 'nfs-for-5.11-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client fixes from Trond Myklebust: - SUNRPC: Handle 0 length opaque XDR object data properly - Fix a layout segment leak in pnfs_layout_process() - pNFS/NFSv4: Update the layout barrier when we schedule a layoutreturn - pNFS/NFSv4: Improve rejection of out-of-order layouts - pNFS/NFSv4: Try to return invalid layout in pnfs_layout_process() * tag 'nfs-for-5.11-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: SUNRPC: Handle 0 length opaque XDR object data properly SUNRPC: Move simple_get_bytes and simple_get_netobj into private header pNFS/NFSv4: Improve rejection of out-of-order layouts pNFS/NFSv4: Update the layout barrier when we schedule a layoutreturn pNFS/NFSv4: Try to return invalid layout in pnfs_layout_process() pNFS/NFSv4: Fix a layout segment leak in pnfs_layout_process()
2021-01-30Merge tag '5.11-rc5-smb3' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull cifs fixes from Steve French: "Four cifs patches found in additional testing of the conversion to the new mount API: three small option processing ones, and one fixing domain based DFS referrals" * tag '5.11-rc5-smb3' of git://git.samba.org/sfrench/cifs-2.6: cifs: fix dfs domain referrals cifs: returning mount parm processing errors correctly cifs: fix mounts to subdirectories of target cifs: ignore auto and noauto options if given
2021-01-30nfsd: skip some unnecessary stats in the v4 caseJ. Bruce Fields
In the typical case of v4 and an i_version-supporting filesystem, we can skip a stat which is only required to fake up a change attribute from ctime. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-01-30nfs: use change attribute for NFS re-exportsJ. Bruce Fields
When exporting NFS, we may as well use the real change attribute returned by the original server instead of faking up a change attribute from the ctime. Note we can't do that by setting I_VERSION--that would also turn on the logic in iversion.h which treats the lower bit specially, and that doesn't make sense for NFS. So instead we define a new export operation for filesystems like NFS that want to manage the change attribute themselves. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-01-29rxrpc: Fix deadlock around release of dst cached on udp tunnelDavid Howells
AF_RXRPC sockets use UDP ports in encap mode. This causes socket and dst from an incoming packet to get stolen and attached to the UDP socket from whence it is leaked when that socket is closed. When a network namespace is removed, the wait for dst records to be cleaned up happens before the cleanup of the rxrpc and UDP socket, meaning that the wait never finishes. Fix this by moving the rxrpc (and, by dependence, the afs) private per-network namespace registrations to the device group rather than subsys group. This allows cached rxrpc local endpoints to be cleared and their UDP sockets closed before we try waiting for the dst records. The symptom is that lines looking like the following: unregister_netdevice: waiting for lo to become free get emitted at regular intervals after running something like the referenced syzbot test. Thanks to Vadim for tracking this down and work out the fix. Reported-by: syzbot+df400f2f24a1677cd7e0@syzkaller.appspotmail.com Reported-by: Vadim Fedorenko <vfedorenko@novek.ru> Fixes: 5271953cad31 ("rxrpc: Use the UDP encap_rcv hook") Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Vadim Fedorenko <vfedorenko@novek.ru> Link: https://lore.kernel.org/r/161196443016.3868642.5577440140646403533.stgit@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>