summaryrefslogtreecommitdiff
path: root/io_uring
AgeCommit message (Collapse)Author
2024-01-09Merge tag 'mm-stable-2024-01-08-15-31' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Peng Zhang has done some mapletree maintainance work in the series 'maple_tree: add mt_free_one() and mt_attr() helpers' 'Some cleanups of maple tree' - In the series 'mm: use memmap_on_memory semantics for dax/kmem' Vishal Verma has altered the interworking between memory-hotplug and dax/kmem so that newly added 'device memory' can more easily have its memmap placed within that newly added memory. - Matthew Wilcox continues folio-related work (including a few fixes) in the patch series 'Add folio_zero_tail() and folio_fill_tail()' 'Make folio_start_writeback return void' 'Fix fault handler's handling of poisoned tail pages' 'Convert aops->error_remove_page to ->error_remove_folio' 'Finish two folio conversions' 'More swap folio conversions' - Kefeng Wang has also contributed folio-related work in the series 'mm: cleanup and use more folio in page fault' - Jim Cromie has improved the kmemleak reporting output in the series 'tweak kmemleak report format'. - In the series 'stackdepot: allow evicting stack traces' Andrey Konovalov to permits clients (in this case KASAN) to cause eviction of no longer needed stack traces. - Charan Teja Kalla has fixed some accounting issues in the page allocator's atomic reserve calculations in the series 'mm: page_alloc: fixes for high atomic reserve caluculations'. - Dmitry Rokosov has added to the samples/ dorectory some sample code for a userspace memcg event listener application. See the series 'samples: introduce cgroup events listeners'. - Some mapletree maintanance work from Liam Howlett in the series 'maple_tree: iterator state changes'. - Nhat Pham has improved zswap's approach to writeback in the series 'workload-specific and memory pressure-driven zswap writeback'. - DAMON/DAMOS feature and maintenance work from SeongJae Park in the series 'mm/damon: let users feed and tame/auto-tune DAMOS' 'selftests/damon: add Python-written DAMON functionality tests' 'mm/damon: misc updates for 6.8' - Yosry Ahmed has improved memcg's stats flushing in the series 'mm: memcg: subtree stats flushing and thresholds'. - In the series 'Multi-size THP for anonymous memory' Ryan Roberts has added a runtime opt-in feature to transparent hugepages which improves performance by allocating larger chunks of memory during anonymous page faults. - Matthew Wilcox has also contributed some cleanup and maintenance work against eh buffer_head code int he series 'More buffer_head cleanups'. - Suren Baghdasaryan has done work on Andrea Arcangeli's series 'userfaultfd move option'. UFFDIO_MOVE permits userspace heap compaction algorithms to move userspace's pages around rather than UFFDIO_COPY'a alloc/copy/free. - Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm: Add ksm advisor'. This is a governor which tunes KSM's scanning aggressiveness in response to userspace's current needs. - Chengming Zhou has optimized zswap's temporary working memory use in the series 'mm/zswap: dstmem reuse optimizations and cleanups'. - Matthew Wilcox has performed some maintenance work on the writeback code, both code and within filesystems. The series is 'Clean up the writeback paths'. - Andrey Konovalov has optimized KASAN's handling of alloc and free stack traces for secondary-level allocators, in the series 'kasan: save mempool stack traces'. - Andrey also performed some KASAN maintenance work in the series 'kasan: assorted clean-ups'. - David Hildenbrand has gone to town on the rmap code. Cleanups, more pte batching, folio conversions and more. See the series 'mm/rmap: interface overhaul'. - Kinsey Ho has contributed some maintenance work on the MGLRU code in the series 'mm/mglru: Kconfig cleanup'. - Matthew Wilcox has contributed lruvec page accounting code cleanups in the series 'Remove some lruvec page accounting functions'" * tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits) mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER mm, treewide: introduce NR_PAGE_ORDERS selftests/mm: add separate UFFDIO_MOVE test for PMD splitting selftests/mm: skip test if application doesn't has root privileges selftests/mm: conform test to TAP format output selftests: mm: hugepage-mmap: conform to TAP format output selftests/mm: gup_test: conform test to TAP format output mm/selftests: hugepage-mremap: conform test to TAP format output mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large mm/memcontrol: remove __mod_lruvec_page_state() mm/khugepaged: use a folio more in collapse_file() slub: use a folio in __kmalloc_large_node slub: use folio APIs in free_large_kmalloc() slub: use alloc_pages_node() in alloc_slab_page() mm: remove inc/dec lruvec page state functions mm: ratelimit stat flush from workingset shrinker kasan: stop leaking stack trace handles mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE mm/mglru: add dummy pmd_dirty() ...
2024-01-08Merge tag 'vfs-6.8.rw' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfsLinus Torvalds
Pull vfs rw updates from Christian Brauner: "This contains updates from Amir for read-write backing file helpers for stacking filesystems such as overlayfs: - Fanotify is currently in the process of introducing pre content events. Roughly, a new permission event will be added indicating that it is safe to write to the file being accessed. These events are used by hierarchical storage managers to e.g., fill the content of files on first access. During that work we noticed that our current permission checking is inconsistent in rw_verify_area() and remap_verify_area(). Especially in the splice code permission checking is done multiple times. For example, one time for the whole range and then again for partial ranges inside the iterator. In addition, we mostly do permission checking before we call file_start_write() except for a few places where we call it after. For pre-content events we need such permission checking to be done before file_start_write(). So this is a nice reason to clean this all up. After this series, all permission checking is done before file_start_write(). As part of this cleanup we also massaged the splice code a bit. We got rid of a few helpers because we are alredy drowning in special read-write helpers. We also cleaned up the return types for splice helpers. - Introduce generic read-write helpers for backing files. This lifts some overlayfs code to common code so it can be used by the FUSE passthrough work coming in over the next cycles. Make Amir and Miklos the maintainers for this new subsystem of the vfs" * tag 'vfs-6.8.rw' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (30 commits) fs: fix __sb_write_started() kerneldoc formatting fs: factor out backing_file_mmap() helper fs: factor out backing_file_splice_{read,write}() helpers fs: factor out backing_file_{read,write}_iter() helpers fs: prepare for stackable filesystems backing file helpers fsnotify: optionally pass access range in file permission hooks fsnotify: assert that file_start_write() is not held in permission hooks fsnotify: split fsnotify_perm() into two hooks fs: use splice_copy_file_range() inline helper splice: return type ssize_t from all helpers fs: use do_splice_direct() for nfsd/ksmbd server-side-copy fs: move file_start_write() into direct_splice_actor() fs: fork splice_file_range() from do_splice_direct() fs: create {sb,file}_write_not_started() helpers fs: create file_write_started() helper fs: create __sb_write_started() helper fs: move kiocb_start_write() into vfs_iocb_iter_write() fs: move permission hook out of do_iter_read() fs: move permission hook out of do_iter_write() fs: move file_start_write() into vfs_iter_write() ...
2024-01-08Merge tag 'vfs-6.8.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Add Jan Kara as VFS reviewer - Show correct device and inode numbers in proc/<pid>/maps for vma files on stacked filesystems. This is now easily doable thanks to the backing file work from the last cycles. This comes with selftests Cleanups: - Remove a redundant might_sleep() from wait_on_inode() - Initialize pointer with NULL, not 0 - Clarify comment on access_override_creds() - Rework and simplify eventfd_signal() and eventfd_signal_mask() helpers - Process aio completions in batches to avoid needless wakeups - Completely decouple struct mnt_idmap from namespaces. We now only keep the actual idmapping around and don't stash references to namespaces - Reformat maintainer entries to indicate that a given subsystem belongs to fs/ - Simplify fput() for files that were never opened - Get rid of various pointless file helpers - Rename various file helpers - Rename struct file members after SLAB_TYPESAFE_BY_RCU switch from last cycle - Make relatime_need_update() return bool - Use GFP_KERNEL instead of GFP_USER when allocating superblocks - Replace deprecated ida_simple_*() calls with their current ida_*() counterparts Fixes: - Fix comments on user namespace id mapping helpers. They aren't kernel doc comments so they shouldn't be using /** - s/Retuns/Returns/g in various places - Add missing parameter documentation on can_move_mount_beneath() - Rename i_mapping->private_data to i_mapping->i_private_data - Fix a false-positive lockdep warning in pipe_write() for watch queues - Improve __fget_files_rcu() code generation to improve performance - Only notify writer that pipe resizing has finished after setting pipe->max_usage otherwise writers are never notified that the pipe has been resized and hang - Fix some kernel docs in hfsplus - s/passs/pass/g in various places - Fix kernel docs in ntfs - Fix kcalloc() arguments order reported by gcc 14 - Fix uninitialized value in reiserfs" * tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits) reiserfs: fix uninit-value in comp_keys watch_queue: fix kcalloc() arguments order ntfs: dir.c: fix kernel-doc function parameter warnings fs: fix doc comment typo fs tree wide selftests/overlayfs: verify device and inode numbers in /proc/pid/maps fs/proc: show correct device and inode numbers in /proc/pid/maps eventfd: Remove usage of the deprecated ida_simple_xx() API fs: super: use GFP_KERNEL instead of GFP_USER for super block allocation fs/hfsplus: wrapper.c: fix kernel-doc warnings fs: add Jan Kara as reviewer fs/inode: Make relatime_need_update return bool pipe: wakeup wr_wait after setting max_usage file: remove __receive_fd() file: stop exposing receive_fd_user() fs: replace f_rcuhead with f_task_work file: remove pointless wrapper file: s/close_fd_get_file()/file_close_fd()/g Improve __fget_files_rcu() code generation (and thus __fget_light()) file: massage cleanup of files that failed to open fs/pipe: Fix lockdep false-positive in watchqueue pipe_write() ...
2023-12-29io_uring: use mempool KASAN hookAndrey Konovalov
Use the proper kasan_mempool_unpoison_object hook for unpoisoning cached objects. A future change might also update io_uring to check the return value of kasan_mempool_poison_object to prevent double-free and invalid-free bugs. This proves to be non-trivial with the current way io_uring caches objects, so this is left out-of-scope of this series. Link: https://lkml.kernel.org/r/eca18d6cbf676ed784f1a1f209c386808a8087c5.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Breno Leitao <leitao@debian.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29kasan: rename kasan_slab_free_mempool to kasan_mempool_poison_objectAndrey Konovalov
Patch series "kasan: save mempool stack traces". This series updates KASAN to save alloc and free stack traces for secondary-level allocators that cache and reuse allocations internally instead of giving them back to the underlying allocator (e.g. mempool). As a part of this change, introduce and document a set of KASAN hooks: bool kasan_mempool_poison_pages(struct page *page, unsigned int order); void kasan_mempool_unpoison_pages(struct page *page, unsigned int order); bool kasan_mempool_poison_object(void *ptr); void kasan_mempool_unpoison_object(void *ptr, size_t size); and use them in the mempool code. Besides mempool, skbuff and io_uring also cache allocations and already use KASAN hooks to poison those. Their code is updated to use the new mempool hooks. The new hooks save alloc and free stack traces (for normal kmalloc and slab objects; stack traces for large kmalloc objects and page_alloc are not supported by KASAN yet), improve the readability of the users' code, and also allow the users to prevent double-free and invalid-free bugs; see the patches for the details. This patch (of 21): Rename kasan_slab_free_mempool to kasan_mempool_poison_object. kasan_slab_free_mempool is a slightly confusing name: it is unclear whether this function poisons the object when it is freed into mempool or does something when the object is freed from mempool to the underlying allocator. The new name also aligns with other mempool-related KASAN hooks added in the following patches in this series. Link: https://lkml.kernel.org/r/cover.1703024586.git.andreyknvl@google.com Link: https://lkml.kernel.org/r/c5618685abb7cdbf9fb4897f565e7759f601da84.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Breno Leitao <leitao@debian.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-21Merge branch 'vfs.file'Christian Brauner
Bring in the changes to the file infrastructure for this cycle. Mostly cleanups and some performance tweaks. * file: remove __receive_fd() * file: stop exposing receive_fd_user() * fs: replace f_rcuhead with f_task_work * file: remove pointless wrapper * file: s/close_fd_get_file()/file_close_fd()/g * Improve __fget_files_rcu() code generation (and thus __fget_light()) * file: massage cleanup of files that failed to open Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-14io_uring/cmd: fix breakage in SOCKET_URING_OP_SIOC* implementationAl Viro
In 8e9fad0e70b7 "io_uring: Add io_uring command support for sockets" you've got an include of asm-generic/ioctls.h done in io_uring/uring_cmd.c. That had been done for the sake of this chunk - + ret = prot->ioctl(sk, SIOCINQ, &arg); + if (ret) + return ret; + return arg; + case SOCKET_URING_OP_SIOCOUTQ: + ret = prot->ioctl(sk, SIOCOUTQ, &arg); SIOC{IN,OUT}Q are defined to symbols (FIONREAD and TIOCOUTQ) that come from ioctls.h, all right, but the values vary by the architecture. FIONREAD is 0x467F on mips 0x4004667F on alpha, powerpc and sparc 0x8004667F on sh and xtensa 0x541B everywhere else TIOCOUTQ is 0x7472 on mips 0x40047473 on alpha, powerpc and sparc 0x80047473 on sh and xtensa 0x5411 everywhere else ->ioctl() expects the same values it would've gotten from userland; all places where we compare with SIOC{IN,OUT}Q are using asm/ioctls.h, so they pick the correct values. io_uring_cmd_sock(), OTOH, ends up passing the default ones. Fixes: 8e9fad0e70b7 ("io_uring: Add io_uring command support for sockets") Cc: <stable@vger.kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20231214213408.GT1674809@ZenIV Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-13io_uring/poll: don't enable lazy wake for POLLEXCLUSIVEJens Axboe
There are a few quirks around using lazy wake for poll unconditionally, and one of them is related the EPOLLEXCLUSIVE. Those may trigger exclusive wakeups, which wake a limited number of entries in the wait queue. If that wake number is less than the number of entries someone is waiting for (and that someone is also using DEFER_TASKRUN), then we can get stuck waiting for more entries while we should be processing the ones we already got. If we're doing exclusive poll waits, flag the request as not being compatible with lazy wakeups. Reported-by: Pavel Begunkov <asml.silence@gmail.com> Fixes: 6ce4a93dbb5b ("io_uring/poll: use IOU_F_TWQ_LAZY_WAKE for wakeups") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-12splice: return type ssize_t from all helpersAmir Goldstein
Not sure why some splice helpers return long, maybe historic reasons. Change them all to return ssize_t to conform to the splice methods and to the rest of the helpers. Suggested-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20231208-horchen-helium-d3ec1535ede5@brauner/ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20231212094440.250945-2-amir73il@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-12file: remove pointless wrapperChristian Brauner
Only io_uring uses __close_fd_get_file(). All it does is hide current->files but io_uring accesses files_struct directly right now anyway so it's a bit pointless. Just rename pick_file() to file_close_fd_locked() and let io_uring use it. Add a lockdep assert in there that we expect the caller to hold file_lock while we're at it. Link: https://lore.kernel.org/r/20231130-vfs-files-fixes-v1-2-e73ca6f4ea83@kernel.org Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-07io_uring/af_unix: disable sending io_uring over socketsPavel Begunkov
File reference cycles have caused lots of problems for io_uring in the past, and it still doesn't work exactly right and races with unix_stream_read_generic(). The safest fix would be to completely disallow sending io_uring files via sockets via SCM_RIGHT, so there are no possible cycles invloving registered files and thus rendering SCM accounting on the io_uring side unnecessary. Cc: <stable@vger.kernel.org> Fixes: 0091bfc81741b ("io_uring/af_unix: defer registered files gc to io_uring release") Reported-and-suggested-by: Jann Horn <jannh@google.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c716c88321939156909cfa1bd8b0faaf1c804103.1701868795.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-05io_uring/kbuf: check for buffer list readiness after NULL checkJens Axboe
Move the buffer list 'is_ready' check below the validity check for the buffer list for a given group. Fixes: 5cf4f52e6d8a ("io_uring: free io_buffer_list entries via RCU") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-05io_uring/kbuf: Fix an NULL vs IS_ERR() bug in io_alloc_pbuf_ring()Dan Carpenter
The io_mem_alloc() function returns error pointers, not NULL. Update the check accordingly. Fixes: b10b73c102a2 ("io_uring/kbuf: recycle freed mapped buffer ring entries") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/5ed268d3-a997-4f64-bd71-47faa92101ab@moroto.mountain Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-03io_uring: fix mutex_unlock with unreferenced ctxPavel Begunkov
Callers of mutex_unlock() have to make sure that the mutex stays alive for the whole duration of the function call. For io_uring that means that the following pattern is not valid unless we ensure that the context outlives the mutex_unlock() call. mutex_lock(&ctx->uring_lock); req_put(req); // typically via io_req_task_submit() mutex_unlock(&ctx->uring_lock); Most contexts are fine: io-wq pins requests, syscalls hold the file, task works are taking ctx references and so on. However, the task work fallback path doesn't follow the rule. Cc: <stable@vger.kernel.org> Fixes: 04fc6c802d ("io_uring: save ctx put/get for task_work submit") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/io-uring/CAG48ez3xSoYb+45f1RLtktROJrpiDQ1otNvdR+YLQf7m+Krj5Q@mail.gmail.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-28io_uring: use fget/fput consistentlyJens Axboe
Normally within a syscall it's fine to use fdget/fdput for grabbing a file from the file table, and it's fine within io_uring as well. We do that via io_uring_enter(2), io_uring_register(2), and then also for cancel which is invoked from the latter. io_uring cannot close its own file descriptors as that is explicitly rejected, and for the cancel side of things, the file itself is just used as a lookup cookie. However, it is more prudent to ensure that full references are always grabbed. For anything threaded, either explicitly in the application itself or through use of the io-wq worker threads, this is what happens anyway. Generalize it and use fget/fput throughout. Also see the below link for more details. Link: https://lore.kernel.org/io-uring/CAG48ez1htVSO3TqmrF8QcX2WFuYTRM-VZ_N10i-VZgbtg=NNqw@mail.gmail.com/ Suggested-by: Jann Horn <jannh@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-28io_uring: free io_buffer_list entries via RCUJens Axboe
mmap_lock nests under uring_lock out of necessity, as we may be doing user copies with uring_lock held. However, for mmap of provided buffer rings, we attempt to grab uring_lock with mmap_lock already held from do_mmap(). This makes lockdep, rightfully, complain: WARNING: possible circular locking dependency detected 6.7.0-rc1-00009-gff3337ebaf94-dirty #4438 Not tainted ------------------------------------------------------ buf-ring.t/442 is trying to acquire lock: ffff00020e1480a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_uring_validate_mmap_request.isra.0+0x4c/0x140 but task is already holding lock: ffff0000dc226190 (&mm->mmap_lock){++++}-{3:3}, at: vm_mmap_pgoff+0x124/0x264 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&mm->mmap_lock){++++}-{3:3}: __might_fault+0x90/0xbc io_register_pbuf_ring+0x94/0x488 __arm64_sys_io_uring_register+0x8dc/0x1318 invoke_syscall+0x5c/0x17c el0_svc_common.constprop.0+0x108/0x130 do_el0_svc+0x2c/0x38 el0_svc+0x4c/0x94 el0t_64_sync_handler+0x118/0x124 el0t_64_sync+0x168/0x16c -> #0 (&ctx->uring_lock){+.+.}-{3:3}: __lock_acquire+0x19a0/0x2d14 lock_acquire+0x2e0/0x44c __mutex_lock+0x118/0x564 mutex_lock_nested+0x20/0x28 io_uring_validate_mmap_request.isra.0+0x4c/0x140 io_uring_mmu_get_unmapped_area+0x3c/0x98 get_unmapped_area+0xa4/0x158 do_mmap+0xec/0x5b4 vm_mmap_pgoff+0x158/0x264 ksys_mmap_pgoff+0x1d4/0x254 __arm64_sys_mmap+0x80/0x9c invoke_syscall+0x5c/0x17c el0_svc_common.constprop.0+0x108/0x130 do_el0_svc+0x2c/0x38 el0_svc+0x4c/0x94 el0t_64_sync_handler+0x118/0x124 el0t_64_sync+0x168/0x16c From that mmap(2) path, we really just need to ensure that the buffer list doesn't go away from underneath us. For the lower indexed entries, they never go away until the ring is freed and we can always sanely reference those as long as the caller has a file reference. For the higher indexed ones in our xarray, we just need to ensure that the buffer list remains valid while we return the address of it. Free the higher indexed io_buffer_list entries via RCU. With that we can avoid needing ->uring_lock inside mmap(2), and simply hold the RCU read lock around the buffer list lookup and address check. To ensure that the arrayed lookup either returns a valid fully formulated entry via RCU lookup, add an 'is_ready' flag that we access with store and release memory ordering. This isn't needed for the xarray lookups, but doesn't hurt either. Since this isn't a fast path, retain it across both types. Similarly, for the allocated array inside the ctx, ensure we use the proper load/acquire as setup could in theory be running in parallel with mmap. While in there, add a few lockdep checks for documentation purposes. Cc: stable@vger.kernel.org Fixes: c56e022c0a27 ("io_uring: add support for user mapped provided buffer ring") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-28io_uring/kbuf: prune deferred locked cache when tearing downJens Axboe
We used to just use our page list for final teardown, which would ensure that we got all the buffers, even the ones that were not on the normal cached list. But while moving to slab for the io_buffers, we know only prune this list, not the deferred locked list that we have. This can cause a leak of memory, if the workload ends up using the intermediate locked list. Fix this by always pruning both lists when tearing down. Fixes: b3a4dbc89d40 ("io_uring/kbuf: Use slab for struct io_buffer objects") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-28io_uring/kbuf: recycle freed mapped buffer ring entriesJens Axboe
Right now we stash any potentially mmap'ed provided ring buffer range for freeing at release time, regardless of when they get unregistered. Since we're keeping track of these ranges anyway, keep track of their registration state as well, and use that to recycle ranges when appropriate rather than always allocate new ones. The lookup is a basic scan of entries, checking for the best matching free entry. Fixes: c392cbecd8ec ("io_uring/kbuf: defer release of mapped buffer rings") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-28io_uring/kbuf: defer release of mapped buffer ringsJens Axboe
If a provided buffer ring is setup with IOU_PBUF_RING_MMAP, then the kernel allocates the memory for it and the application is expected to mmap(2) this memory. However, io_uring uses remap_pfn_range() for this operation, so we cannot rely on normal munmap/release on freeing them for us. Stash an io_buf_free entry away for each of these, if any, and provide a helper to free them post ->release(). Cc: stable@vger.kernel.org Fixes: c56e022c0a27 ("io_uring: add support for user mapped provided buffer ring") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-28eventfd: simplify eventfd_signal_mask()Christian Brauner
The eventfd_signal_mask() helper was introduced for io_uring and similar to eventfd_signal() it always passed 1 for @n. So don't bother with that argument at all. Link: https://lore.kernel.org/r/20231122-vfs-eventfd-signal-v2-3-bd549b14ce0c@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-27io_uring: enable io_mem_alloc/free to be used in other partsJens Axboe
In preparation for using these helpers, make them non-static and add them to our internal header. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-27io_uring: don't guard IORING_OFF_PBUF_RING with SETUP_NO_MMAPJens Axboe
This flag only applies to the SQ and CQ rings, it's perfectly valid to use a mmap approach for the provided ring buffers. Move the check into where it belongs. Cc: stable@vger.kernel.org Fixes: 03d89a2de25b ("io_uring: support for user allocated memory for rings/sqes") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-27io_uring: don't allow discontig pages for IORING_SETUP_NO_MMAPJens Axboe
io_sqes_map() is used rather than io_mem_alloc(), if the application passes in memory for mapping rather than have the kernel allocate it and then mmap(2) the ranges. This then calls __io_uaddr_map() to perform the page mapping and pinning, which checks if we end up with the same pages, if more than one page is mapped. But this check is incorrect and only checks if the first and last pages are the same, where it really should be checking if the mapped pages are contigous. This allows mapping a single normal page, or a huge page range. Down the line we can add support for remapping pages to be virtually contigous, which is really all that io_uring cares about. Cc: stable@vger.kernel.org Fixes: 03d89a2de25b ("io_uring: support for user allocated memory for rings/sqes") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-20io_uring: fix off-by one bvec indexKeith Busch
If the offset equals the bv_len of the first registered bvec, then the request does not include any of that first bvec. Skip it so that drivers don't have to deal with a zero length bvec, which was observed to break NVMe's PRP list creation. Cc: stable@vger.kernel.org Fixes: bd11b3a391e3 ("io_uring: don't use iov_iter_advance() for fixed buffers") Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20231120221831.2646460-1-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-20io_uring/fs: consider link->flags when getting path for LINKATCharles Mirabile
In order for `AT_EMPTY_PATH` to work as expected, the fact that the user wants that behavior needs to make it to `getname_flags` or it will return ENOENT. Fixes: cf30da90bc3a ("io_uring: add support for IORING_OP_LINKAT") Cc: <stable@vger.kernel.org> Link: https://github.com/axboe/liburing/issues/995 Signed-off-by: Charles Mirabile <cmirabil@redhat.com> Link: https://lore.kernel.org/r/20231120105545.1209530-1-cmirabil@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-15io_uring/fdinfo: remove need for sqpoll lock for thread/pid retrievalJens Axboe
A previous commit added a trylock for getting the SQPOLL thread info via fdinfo, but this introduced a regression where we often fail to get it if the thread is busy. For that case, we end up not printing the current CPU and PID info. Rather than rely on this lock, just print the pid we already stored in the io_sq_data struct, and ensure we update the current CPU every time we've slept or potentially rescheduled. The latter won't potentially be 100% accurate, but that wasn't the case before either as the task can get migrated at any time unless it has been pinned at creation time. We retain keeping the io_sq_data dereference inside the ctx->uring_lock, as it has always been, as destruction of the thread and data happen below that. We could make this RCU safe, but there's little point in doing that. With this, we always print the last valid information we had, rather than have spurious outputs with missing information. Fixes: 7644b1a1c9a7 ("io_uring/fdinfo: lock SQ thread while retrieving thread cpu/pid") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-06io_uring: do not clamp read length for multishot readDylan Yudaken
When doing a multishot read, the code path reuses the old read paths. However this breaks an assumption built into those paths, namely that struct io_rw::len is available for reuse by __io_import_iovec. For multishot this results in len being set for the first receive call, and then subsequent calls are clamped to that buffer length incorrectly. Instead keep len as zero after recycling buffers, to reuse the full buffer size of the next selected buffer. Fixes: fc68fcda0491 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT") Signed-off-by: Dylan Yudaken <dyudaken@gmail.com> Link: https://lore.kernel.org/r/20231106203909.197089-4-dyudaken@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-06io_uring: do not allow multishot read to set addr or lenDylan Yudaken
For addr: this field is not used, since buffer select is forced. But by forcing it to be zero it leaves open future uses of the field. len is actually usable, you could imagine that you want to receive multishot up to a certain length. However right now this is not how it is implemented, and it seems safer to force this to be zero. Fixes: fc68fcda0491 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT") Signed-off-by: Dylan Yudaken <dyudaken@gmail.com> Link: https://lore.kernel.org/r/20231106203909.197089-3-dyudaken@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-06io_uring: indicate if io_kbuf_recycle did recycle anythingDylan Yudaken
It can be useful to know if io_kbuf_recycle did actually recycle the buffer on the request, or if it left the request alone. Signed-off-by: Dylan Yudaken <dyudaken@gmail.com> Link: https://lore.kernel.org/r/20231106203909.197089-2-dyudaken@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-06io_uring/rw: add separate prep handler for fixed read/writeJens Axboe
Rather than sprinkle opcode checks in the generic read/write prep handler, have a separate prep handler for the vectored readv/writev operation. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-06io_uring/rw: add separate prep handler for readv/writevJens Axboe
Rather than sprinkle opcode checks in the generic read/write prep handler, have a separate prep handler for the vectored readv/writev operation. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-03io_uring/net: ensure socket is marked connected on connect retryJens Axboe
io_uring does non-blocking connection attempts, which can yield some unexpected results if a connect request is re-attempted by an an application. This is equivalent to the following sync syscall sequence: sock = socket(AF_INET, SOCK_STREAM | SOCK_NONBLOCK, IPPROTO_TCP); connect(sock, &addr, sizeof(addr); ret == -1 and errno == EINPROGRESS expected here. Now poll for POLLOUT on sock, and when that returns, we expect the socket to be connected. But if we follow that procedure with: connect(sock, &addr, sizeof(addr)); you'd expect ret == -1 and errno == EISCONN here, but you actually get ret == 0. If we attempt the connection one more time, then we get EISCON as expected. io_uring used to do this, but turns out that bluetooth fails with EBADFD if you attempt to re-connect. Also looks like EISCONN _could_ occur with this sequence. Retain the ->in_progress logic, but work-around a potential EISCONN or EBADFD error and only in those cases look at the sock_error(). This should work in general and avoid the odd sequence of a repeated connect request returning success when the socket is already connected. This is all a side effect of the socket state being in a CONNECTING state when we get EINPROGRESS, and only a re-connect or other related operation will turn that into CONNECTED. Cc: stable@vger.kernel.org Fixes: 3fb1bd688172 ("io_uring/net: handle -EINPROGRESS correct for IORING_OP_CONNECT") Link: https://github.com/axboe/liburing/issues/980 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-03io_uring/rw: don't attempt to allocate async data if opcode doesn't need itJens Axboe
The new read multishot method doesn't need to allocate async data ever, as it doesn't do vectored IO and it must only be used with provided buffers. While it doesn't have ->prep_async() set, it also sets ->async_size to 0, which is different from any other read/write type we otherwise support. If it's used on a file type that isn't pollable, we do try and allocate this async data, and then try and use that data. But since we passed in a size of 0 for the data, we get a NULL back on data allocation. We then proceed to dereference that to copy state, and that obviously won't end well. Add a check in io_setup_async_rw() for this condition, and avoid copying state. Also add a check for whether or not buffer selection is specified in prep while at it. Fixes: fc68fcda0491 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT") Link: https://bugzilla.kernel.org/show_bug.cgi?id=218101 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-01Merge tag 'io_uring-futex-2023-10-30' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring futex support from Jens Axboe: "This adds support for using futexes through io_uring - first futex wake and wait, and then the vectored variant of waiting, futex waitv. For both wait/wake/waitv, we support the bitset variant, as the 'normal' variants can be easily implemented on top of that. PI and requeue are not supported through io_uring, just the above mentioned parts. This may change in the future, but in the spirit of keeping this small (and based on what people have been asking for), this is what we currently have. Wake support is pretty straight forward, most of the thought has gone into the wait side to avoid needing to offload wait operations to a blocking context. Instead, we rely on the usual callbacks to retry and post a completion event, when appropriate. As far as I can recall, the first request for futex support with io_uring came from Andres Freund, working on postgres. His aio rework of postgres was one of the early adopters of io_uring, and futex support was a natural extension for that. This is relevant from both a usability point of view, as well as for effiency and performance. In Andres's words, for the former: Futex wait support in io_uring makes it a lot easier to avoid deadlocks in concurrent programs that have their own buffer pool: Obviously pages in the application buffer pool have to be locked during IO. If the initiator of IO A needs to wait for a held lock B, the holder of lock B might wait for the IO A to complete. The ability to wait for a lock and IO completions at the same time provides an efficient way to avoid such deadlocks and in terms of effiency, even without unlocking the full potential yet, Andres says: Futex wake support in io_uring is useful because it allows for more efficient directed wakeups. For some "locks" postgres has queues implemented in userspace, with wakeup logic that cannot easily be implemented with FUTEX_WAKE_BITSET on a single "futex word" (imagine waiting for journal flushes to have completed up to a certain point). Thus a "lock release" sometimes need to wake up many processes in a row. A quick-and-dirty conversion to doing these wakeups via io_uring lead to a 3% throughput increase, with 12% fewer context switches, albeit in a fairly extreme workload" * tag 'io_uring-futex-2023-10-30' of git://git.kernel.dk/linux: io_uring: add support for vectored futex waits futex: make the vectored futex operations available futex: make futex_parse_waitv() available as a helper futex: add wake_data to struct futex_q io_uring: add support for futex wake and wait futex: abstract out a __futex_wake_mark() helper futex: factor out the futex wake handling futex: move FUTEX2_VALID_MASK to futex.h
2023-11-01Merge tag 'for-6.7/io_uring-sockopt-2023-10-30' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring {get,set}sockopt support from Jens Axboe: "This adds support for using getsockopt and setsockopt via io_uring. The main use cases for this is to enable use of direct descriptors, rather than first instantiating a normal file descriptor, doing the option tweaking needed, then turning it into a direct descriptor. With this support, we can avoid needing a regular file descriptor completely. The net and bpf bits have been signed off on their side" * tag 'for-6.7/io_uring-sockopt-2023-10-30' of git://git.kernel.dk/linux: selftests/bpf/sockopt: Add io_uring support io_uring/cmd: Introduce SOCKET_URING_OP_SETSOCKOPT io_uring/cmd: Introduce SOCKET_URING_OP_GETSOCKOPT io_uring/cmd: return -EOPNOTSUPP if net is disabled selftests/net: Extract uring helpers to be reusable tools headers: Grab copy of io_uring.h io_uring/cmd: Pass compat mode in issue_flags net/socket: Break down __sys_getsockopt net/socket: Break down __sys_setsockopt bpf: Add sockptr support for setsockopt bpf: Add sockptr support for getsockopt
2023-11-01Merge tag 'for-6.7/io_uring-2023-10-30' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring updates from Jens Axboe: "This contains the core io_uring updates, of which there are not many, and adds support for using WAITID through io_uring and hence not needing to block on these kinds of events. Outside of that, tweaks to the legacy provided buffer handling and some cleanups related to cancelations for uring_cmd support" * tag 'for-6.7/io_uring-2023-10-30' of git://git.kernel.dk/linux: io_uring/poll: use IOU_F_TWQ_LAZY_WAKE for wakeups io_uring/kbuf: Use slab for struct io_buffer objects io_uring/kbuf: Allow the full buffer id space for provided buffers io_uring/kbuf: Fix check of BID wrapping in provided buffers io_uring/rsrc: cleanup io_pin_pages() io_uring: cancelable uring_cmd io_uring: retain top 8bits of uring_cmd flags for kernel internal use io_uring: add IORING_OP_WAITID support exit: add internal include file with helpers exit: add kernel_waitid_prepare() helper exit: move core of do_wait() into helper exit: abstract out should_wake helper for child_wait_callback() io_uring/rw: add support for IORING_OP_READ_MULTISHOT io_uring/rw: mark readv/writev as vectored in the opcode definition io_uring/rw: split io_read() into a helper
2023-10-30Merge tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfsLinus Torvalds
Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Rename and export helpers that get write access to a mount. They are used in overlayfs to get write access to the upper mount. - Print the pretty name of the root device on boot failure. This helps in scenarios where we would usually only print "unknown-block(1,2)". - Add an internal SB_I_NOUMASK flag. This is another part in the endless POSIX ACL saga in a way. When POSIX ACLs are enabled via SB_POSIXACL the vfs cannot strip the umask because if the relevant inode has POSIX ACLs set it might take the umask from there. But if the inode doesn't have any POSIX ACLs set then we apply the umask in the filesytem itself. So we end up with: (1) no SB_POSIXACL -> strip umask in vfs (2) SB_POSIXACL -> strip umask in filesystem The umask semantics associated with SB_POSIXACL allowed filesystems that don't even support POSIX ACLs at all to raise SB_POSIXACL purely to avoid umask stripping. That specifically means NFS v4 and Overlayfs. NFS v4 does it because it delegates this to the server and Overlayfs because it needs to delegate umask stripping to the upper filesystem, i.e., the filesystem used as the writable layer. This went so far that SB_POSIXACL is raised eve on kernels that don't even have POSIX ACL support at all. Stop this blatant abuse and add SB_I_NOUMASK which is an internal superblock flag that filesystems can raise to opt out of umask handling. That should really only be the two mentioned above. It's not that we want any filesystems to do this. Ideally we have all umask handling always in the vfs. - Make overlayfs use SB_I_NOUMASK too. - Now that we have SB_I_NOUMASK, stop checking for SB_POSIXACL in IS_POSIXACL() if the kernel doesn't have support for it. This is a very old patch but it's only possible to do this now with the wider cleanup that was done. - Follow-up work on fake path handling from last cycle. Citing mostly from Amir: When overlayfs was first merged, overlayfs files of regular files and directories, the ones that are installed in file table, had a "fake" path, namely, f_path is the overlayfs path and f_inode is the "real" inode on the underlying filesystem. In v6.5, we took another small step by introducing of the backing_file container and the file_real_path() helper. This change allowed vfs and filesystem code to get the "real" path of an overlayfs backing file. With this change, we were able to make fsnotify work correctly and report events on the "real" filesystem objects that were accessed via overlayfs. This method works fine, but it still leaves the vfs vulnerable to new code that is not aware of files with fake path. A recent example is commit db1d1e8b9867 ("IMA: use vfs_getattr_nosec to get the i_version"). This commit uses direct referencing to f_path in IMA code that otherwise uses file_inode() and file_dentry() to reference the filesystem objects that it is measuring. This contains work to switch things around: instead of having filesystem code opt-in to get the "real" path, have generic code opt-in for the "fake" path in the few places that it is needed. Is it far more likely that new filesystems code that does not use the file_dentry() and file_real_path() helpers will end up causing crashes or averting LSM/audit rules if we keep the "fake" path exposed by default. This change already makes file_dentry() moot, but for now we did not change this helper just added a WARN_ON() in ovl_d_real() to catch if we have made any wrong assumptions. After the dust settles on this change, we can make file_dentry() a plain accessor and we can drop the inode argument to ->d_real(). - Switch struct file to SLAB_TYPESAFE_BY_RCU. This looks like a small change but it really isn't and I would like to see everyone on their tippie toes for any possible bugs from this work. Essentially we've been doing most of what SLAB_TYPESAFE_BY_RCU for files since a very long time because of the nasty interactions between the SCM_RIGHTS file descriptor garbage collection. So extending it makes a lot of sense but it is a subtle change. There are almost no places that fiddle with file rcu semantics directly and the ones that did mess around with struct file internal under rcu have been made to stop doing that because it really was always dodgy. I forgot to put in the link tag for this change and the discussion in the commit so adding it into the merge message: https://lore.kernel.org/r/20230926162228.68666-1-mjguzik@gmail.com Cleanups: - Various smaller pipe cleanups including the removal of a spin lock that was only used to protect against writes without pipe_lock() from O_NOTIFICATION_PIPE aka watch queues. As that was never implemented remove the additional locking from pipe_write(). - Annotate struct watch_filter with the new __counted_by attribute. - Clarify do_unlinkat() cleanup so that it doesn't look like an extra iput() is done that would cause issues. - Simplify file cleanup when the file has never been opened. - Use module helper instead of open-coding it. - Predict error unlikely for stale retry. - Use WRITE_ONCE() for mount expiry field instead of just commenting that one hopes the compiler doesn't get smart. Fixes: - Fix readahead on block devices. - Fix writeback when layztime is enabled and inodes whose timestamp is the only thing that changed reside on wb->b_dirty_time. This caused excessively large zombie memory cgroup when lazytime was enabled as such inodes weren't handled fast enough. - Convert BUG_ON() to WARN_ON_ONCE() in open_last_lookups()" * tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (26 commits) file, i915: fix file reference for mmap_singleton() vfs: Convert BUG_ON to WARN_ON_ONCE in open_last_lookups writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbs chardev: Simplify usage of try_module_get() ovl: rely on SB_I_NOUMASK fs: fix umask on NFS with CONFIG_FS_POSIX_ACL=n fs: store real path instead of fake path in backing file f_path fs: create helper file_user_path() for user displayed mapped file path fs: get mnt_writers count for an open backing file's real path vfs: stop counting on gcc not messing with mnt_expiry_mark if not asked vfs: predict the error in retry_estale as unlikely backing file: free directly vfs: fix readahead(2) on block devices io_uring: use files_lookup_fd_locked() file: convert to SLAB_TYPESAFE_BY_RCU vfs: shave work on failed file open fs: simplify misleading code to remove ambiguity regarding ihold()/iput() watch_queue: Annotate struct watch_filter with __counted_by fs/pipe: use spinlock in pipe_read() only if there is a watch_queue fs/pipe: remove unnecessary spinlock from pipe_write() ...
2023-10-27Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull misc filesystem fixes from Al Viro: "Assorted fixes all over the place: literally nothing in common, could have been three separate pull requests. All are simple regression fixes, but not for anything from this cycle" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ceph_wait_on_conflict_unlink(): grab reference before dropping ->d_lock io_uring: kiocb_done() should *not* trust ->ki_pos if ->{read,write}_iter() failed sparc32: fix a braino in fault handling in csum_and_copy_..._user()
2023-10-27io_uring: kiocb_done() should *not* trust ->ki_pos if ->{read,write}_iter() ↵Al Viro
failed ->ki_pos value is unreliable in such cases. For an obvious example, consider O_DSYNC write - we feed the data to page cache and start IO, then we make sure it's completed. Update of ->ki_pos is dealt with by the first part; failure in the second ends up with negative value returned _and_ ->ki_pos left advanced as if sync had been successful. In the same situation write(2) does not advance the file position at all. Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-10-25io_uring/rw: disable IOCB_DIO_CALLER_COMPJens Axboe
If an application does O_DIRECT writes with io_uring and the file system supports IOCB_DIO_CALLER_COMP, then completions of the dio write side is done from the task_work that will post the completion event for said write as well. Whenever a dio write is done against a file, the inode i_dio_count is elevated. This enables other callers to use inode_dio_wait() to wait for previous writes to complete. If we defer the full dio completion to task_work, we are dependent on that task_work being run before the inode i_dio_count can be decremented. If the same task that issues io_uring dio writes with IOCB_DIO_CALLER_COMP performs a synchronous system call that calls inode_dio_wait(), then we can deadlock as we're blocked sleeping on the event to become true, but not processing the completions that will result in the inode i_dio_count being decremented. Until we can guarantee that this is the case, then disable the deferred caller completions. Fixes: 099ada2c8726 ("io_uring/rw: add write support for IOCB_DIO_CALLER_COMP") Reported-by: Andres Freund <andres@anarazel.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-25io_uring/fdinfo: lock SQ thread while retrieving thread cpu/pidJens Axboe
We could race with SQ thread exit, and if we do, we'll hit a NULL pointer dereference when the thread is cleared. Grab the SQPOLL data lock before attempting to get the task cpu and pid for fdinfo, this ensures we have a stable view of it. Cc: stable@vger.kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=218032 Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-19io_uring/cmd: Introduce SOCKET_URING_OP_SETSOCKOPTBreno Leitao
Add initial support for SOCKET_URING_OP_SETSOCKOPT. This new command is similar to setsockopt. This implementation leverages the function do_sock_setsockopt(), which is shared with the setsockopt() system call path. Important to say that userspace needs to keep the pointer's memory alive until the operation is completed. I.e, the memory could not be deallocated before the CQE is returned to userspace. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20231016134750.1381153-11-leitao@debian.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-19io_uring/cmd: Introduce SOCKET_URING_OP_GETSOCKOPTBreno Leitao
Add support for getsockopt command (SOCKET_URING_OP_GETSOCKOPT), where level is SOL_SOCKET. This is leveraging the sockptr_t infrastructure, where a sockptr_t is either userspace or kernel space, and handled as such. Differently from the getsockopt(2), the optlen field is not a userspace pointers. In getsockopt(2), userspace provides optlen pointer, which is overwritten by the kernel. In this implementation, userspace passes a u32, and the new value is returned in cqe->res. I.e., optlen is not a pointer. Important to say that userspace needs to keep the pointer alive until the CQE is completed. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20231016134750.1381153-10-leitao@debian.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-19io_uring/cmd: return -EOPNOTSUPP if net is disabledBreno Leitao
Protect io_uring_cmd_sock() to be called if CONFIG_NET is not set. If network is not enabled, but io_uring is, then we want to return -EOPNOTSUPP for any possible socket operation. This is helpful because io_uring_cmd_sock() can now call functions that only exits if CONFIG_NET is enabled without having #ifdef CONFIG_NET inside the function itself. Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20231016134750.1381153-9-leitao@debian.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-19io_uring/cmd: Pass compat mode in issue_flagsBreno Leitao
Create a new flag to track if the operation is running compat mode. This basically check the context->compat and pass it to the issue_flags, so, it could be queried later in the callbacks. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20231016134750.1381153-6-leitao@debian.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-19io_uring/poll: use IOU_F_TWQ_LAZY_WAKE for wakeupsJens Axboe
With poll triggered retries, each event trigger will cause a task_work item to be added for processing. If the ring is setup with IORING_SETUP_DEFER_TASKRUN and a task is waiting on multiple events to complete, any task_work addition will wake the task for processing these items. This can cause more context switches than we would like, if the application is deliberately waiting on multiple items to increase efficiency. For example, if an application has receive multishot armed for sockets and wants to wait for N to complete within M usec of time, we should not be waking up and processing these items until we have all the events we asked for. By switching the poll trigger to lazy wake, we'll process them when they are all ready, in one swoop, rather than wake multiple times only to process one and then go back to sleep. At some point we probably want to look at just making the lazy wake the default, but for now, let's just selectively enable it where it makes sense. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-19io_uring: use files_lookup_fd_locked()Christian Brauner
While valid we don't need to open-code rcu dereferences if we're acquiring file_lock anyway. Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20231010030615.GO800259@ZenIV Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-18io_uring: fix crash with IORING_SETUP_NO_MMAP and invalid SQ ring addressJens Axboe
If we specify a valid CQ ring address but an invalid SQ ring address, we'll correctly spot this and free the allocated pages and clear them to NULL. However, we don't clear the ring page count, and hence will attempt to free the pages again. We've already cleared the address of the page array when freeing them, but we don't check for that. This causes the following crash: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 Oops [#1] Modules linked in: CPU: 0 PID: 20 Comm: kworker/u2:1 Not tainted 6.6.0-rc5-dirty #56 Hardware name: ucbbar,riscvemu-bare (DT) Workqueue: events_unbound io_ring_exit_work epc : io_pages_free+0x2a/0x58 ra : io_rings_free+0x3a/0x50 epc : ffffffff808811a2 ra : ffffffff80881406 sp : ffff8f80000c3cd0 status: 0000000200000121 badaddr: 0000000000000000 cause: 000000000000000d [<ffffffff808811a2>] io_pages_free+0x2a/0x58 [<ffffffff80881406>] io_rings_free+0x3a/0x50 [<ffffffff80882176>] io_ring_exit_work+0x37e/0x424 [<ffffffff80027234>] process_one_work+0x10c/0x1f4 [<ffffffff8002756e>] worker_thread+0x252/0x31c [<ffffffff8002f5e4>] kthread+0xc4/0xe0 [<ffffffff8000332a>] ret_from_fork+0xa/0x1c Check for a NULL array in io_pages_free(), but also clear the page counts when we free them to be on the safer side. Reported-by: rtm@csail.mit.edu Fixes: 03d89a2de25b ("io_uring: support for user allocated memory for rings/sqes") Cc: stable@vger.kernel.org Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-05io-wq: fully initialize wqe before calling cpuhp_state_add_instance_nocalls()Jeff Moyer
I received a bug report with the following signature: [ 1759.937637] BUG: unable to handle page fault for address: ffffffffffffffe8 [ 1759.944564] #PF: supervisor read access in kernel mode [ 1759.949732] #PF: error_code(0x0000) - not-present page [ 1759.954901] PGD 7ab615067 P4D 7ab615067 PUD 7ab617067 PMD 0 [ 1759.960596] Oops: 0000 1 PREEMPT SMP PTI [ 1759.964804] CPU: 15 PID: 109 Comm: cpuhp/15 Kdump: loaded Tainted: G X ------- — 5.14.0-362.3.1.el9_3.x86_64 #1 [ 1759.976609] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 06/20/2018 [ 1759.985181] RIP: 0010:io_wq_for_each_worker.isra.0+0x24/0xa0 [ 1759.990877] Code: 90 90 90 90 90 90 0f 1f 44 00 00 41 56 41 55 41 54 55 48 8d 6f 78 53 48 8b 47 78 48 39 c5 74 4f 49 89 f5 49 89 d4 48 8d 58 e8 <8b> 13 85 d2 74 32 8d 4a 01 89 d0 f0 0f b1 0b 75 5c 09 ca 78 3d 48 [ 1760.009758] RSP: 0000:ffffb6f403603e20 EFLAGS: 00010286 [ 1760.015013] RAX: 0000000000000000 RBX: ffffffffffffffe8 RCX: 0000000000000000 [ 1760.022188] RDX: ffffb6f403603e50 RSI: ffffffffb11e95b0 RDI: ffff9f73b09e9400 [ 1760.029362] RBP: ffff9f73b09e9478 R08: 000000000000000f R09: 0000000000000000 [ 1760.036536] R10: ffffffffffffff00 R11: ffffb6f403603d80 R12: ffffb6f403603e50 [ 1760.043712] R13: ffffffffb11e95b0 R14: ffffffffb28531e8 R15: ffff9f7a6fbdf548 [ 1760.050887] FS: 0000000000000000(0000) GS:ffff9f7a6fbc0000(0000) knlGS:0000000000000000 [ 1760.059025] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1760.064801] CR2: ffffffffffffffe8 CR3: 00000007ab610002 CR4: 00000000007706e0 [ 1760.071976] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1760.079150] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1760.086325] PKRU: 55555554 [ 1760.089044] Call Trace: [ 1760.091501] <TASK> [ 1760.093612] ? show_trace_log_lvl+0x1c4/0x2df [ 1760.097995] ? show_trace_log_lvl+0x1c4/0x2df [ 1760.102377] ? __io_wq_cpu_online+0x54/0xb0 [ 1760.106584] ? __die_body.cold+0x8/0xd [ 1760.110356] ? page_fault_oops+0x134/0x170 [ 1760.114479] ? kernelmode_fixup_or_oops+0x84/0x110 [ 1760.119298] ? exc_page_fault+0xa8/0x150 [ 1760.123247] ? asm_exc_page_fault+0x22/0x30 [ 1760.127458] ? __pfx_io_wq_worker_affinity+0x10/0x10 [ 1760.132453] ? __pfx_io_wq_worker_affinity+0x10/0x10 [ 1760.137446] ? io_wq_for_each_worker.isra.0+0x24/0xa0 [ 1760.142527] __io_wq_cpu_online+0x54/0xb0 [ 1760.146558] cpuhp_invoke_callback+0x109/0x460 [ 1760.151029] ? __pfx_io_wq_cpu_offline+0x10/0x10 [ 1760.155673] ? __pfx_smpboot_thread_fn+0x10/0x10 [ 1760.160320] cpuhp_thread_fun+0x8d/0x140 [ 1760.164266] smpboot_thread_fn+0xd3/0x1a0 [ 1760.168297] kthread+0xdd/0x100 [ 1760.171457] ? __pfx_kthread+0x10/0x10 [ 1760.175225] ret_from_fork+0x29/0x50 [ 1760.178826] </TASK> [ 1760.181022] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs rfkill sunrpc vfat fat dm_multipath intel_rapl_msr intel_rapl_common isst_if_common ipmi_ssif nfit libnvdimm mgag200 i2c_algo_bit ioatdma drm_shmem_helper drm_kms_helper acpi_ipmi syscopyarea x86_pkg_temp_thermal sysfillrect ipmi_si intel_powerclamp sysimgblt ipmi_devintf coretemp acpi_power_meter ipmi_msghandler rapl pcspkr dca intel_pch_thermal intel_cstate ses lpc_ich intel_uncore enclosure hpilo mei_me mei acpi_tad fuse drm xfs sd_mod sg bnx2x nvme nvme_core crct10dif_pclmul crc32_pclmul nvme_common ghash_clmulni_intel smartpqi tg3 t10_pi mdio uas libcrc32c crc32c_intel scsi_transport_sas usb_storage hpwdt wmi dm_mirror dm_region_hash dm_log dm_mod [ 1760.248623] CR2: ffffffffffffffe8 A cpu hotplug callback was issued before wq->all_list was initialized. This results in a null pointer dereference. The fix is to fully setup the io_wq before calling cpuhp_state_add_instance_nocalls(). Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Link: https://lore.kernel.org/r/x49y1ghnecs.fsf@segfault.boston.devel.redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-05io_uring/kbuf: Use slab for struct io_buffer objectsGabriel Krisman Bertazi
The allocation of struct io_buffer for metadata of provided buffers is done through a custom allocator that directly gets pages and fragments them. But, slab would do just fine, as this is not a hot path (in fact, it is a deprecated feature) and, by keeping a custom allocator implementation we lose benefits like tracking, poisoning, sanitizers. Finally, the custom code is more complex and requires keeping the list of pages in struct ctx for no good reason. This patch cleans this path up and just uses slab. I microbenchmarked it by forcing the allocation of a large number of objects with the least number of io_uring commands possible (keeping nbufs=USHRT_MAX), with and without the patch. There is a slight increase in time spent in the allocation with slab, of course, but even when allocating to system resources exhaustion, which is not very realistic and happened around 1/2 billion provided buffers for me, it wasn't a significant hit in system time. Specially if we think of a real-world scenario, an application doing register/unregister of provided buffers will hit ctx->io_buffers_cache more often than actually going to slab. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20231005000531.30800-4-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>