summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-01-15io_uring: Factor out a function to parse restrictionsJosh Triplett
Preparation for subsequent work on inherited restrictions. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9bac2b4d1b9b9ab41c55ea3816021be847f354df.1736932318.git.josh@joshtriplett.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-14io_uring/rsrc: require cloned buffers to share accounting contextsJann Horn
When IORING_REGISTER_CLONE_BUFFERS is used to clone buffers from uring instance A to uring instance B, where A and B use different MMs for accounting, the accounting can go wrong: If uring instance A is closed before uring instance B, the pinned memory counters for uring instance B will be decremented, even though the pinned memory was originally accounted through uring instance A; so the MM of uring instance B can end up with negative locked memory. Cc: stable@vger.kernel.org Closes: https://lore.kernel.org/r/CAG48ez1zez4bdhmeGLEFxtbFADY4Czn3CV0u9d_TMcbvRA01bg@mail.gmail.com Fixes: 7cc2a6eadcd7 ("io_uring: add IORING_REGISTER_COPY_BUFFERS method") Signed-off-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/r/20250114-uring-check-accounting-v1-1-42e4145aa743@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-13io_uring: simplify the SQPOLL thread check when cancelling requestsBui Quang Minh
In io_uring_try_cancel_requests, we check whether sq_data->thread == current to determine if the function is called by the SQPOLL thread to do iopoll when IORING_SETUP_SQPOLL is set. This check can race with the SQPOLL thread termination. io_uring_cancel_generic is used in 2 places: io_uring_cancel_generic and io_ring_exit_work. In io_uring_cancel_generic, we have the information whether the current is SQPOLL thread already. And the SQPOLL thread never reaches io_ring_exit_work. So to avoid the racy check, this commit adds a boolean flag to io_uring_try_cancel_requests to determine if the caller is SQPOLL thread. Reported-by: syzbot+3c750be01dab672c513d@syzkaller.appspotmail.com Reported-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: Li Zetao <lizetao1@huawei.com> Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20250113160331.44057-1-minhquangbui99@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-10io_uring: expose read/write attribute capabilityAnuj Gupta
After commit 9a213d3b80c0, we can pass additional attributes along with read/write. However, userspace doesn't know that. Add a new feature flag IORING_FEAT_RW_ATTR, to notify the userspace that the kernel has this ability. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Tested-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20241205062109.1788-1-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-10io_uring/rw: don't gate retry on completion contextJens Axboe
nvme multipath reports that they see spurious -EAGAIN bubbling back to userspace, which is caused by how they handle retries internally through a kworker. However, any data that needs preserving or importing for a read/write request has always been done so at prep time, and we can sanely skip this check. Reported-by: "Haeuptle, Michael" <michael.haeuptle@hpe.com> Link: https://lore.kernel.org/io-uring/DS7PR84MB31105C2C63CFA47BE8CBD6EE95102@DS7PR84MB3110.NAMPRD84.PROD.OUTLOOK.COM/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-10io_uring/rw: handle -EAGAIN retry at IO completion timeJens Axboe
Rather than try and have io_read/io_write turn REQ_F_REISSUE into -EAGAIN, catch the REQ_F_REISSUE when the request is otherwise considered as done. This is saner as we know this isn't happening during an actual submission, and it removes the need to randomly check REQ_F_REISSUE after read/write submission. If REQ_F_REISSUE is set, __io_submit_flush_completions() will skip over this request in terms of posting a CQE, and the regular request cleaning will ensure that it gets reissued via io-wq. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-10io_uring/rw: use io_rw_recycle() from cleanup pathJens Axboe
Cleanup should always have the uring lock held, it's safe to recycle from here. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-03io_uring/rsrc: simplify the bvec iter count calculationBui Quang Minh
As we don't use iov_iter_advance() but our own logic in io_import_fixed(), we can remove the logic that over-sets the iter's count to len + offset then adjusts it later to len. This helps to make the code cleaner. Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com> Link: https://lore.kernel.org/r/20250103150412.12549-1-minhquangbui99@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-30io_uring: ensure io_queue_deferred() is out-of-lineJens Axboe
This is not the hot path, it's a slow path. Yet the locking for it is in the hot path, and __cold does not prevent it from being inlined. Move the locking to the function itself, and mark it noinline as well to avoid it polluting the icache of the hot path. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-27io_uring/rw: always clear ->bytes_done on io_async_rw setupJens Axboe
A previous commit mistakenly moved the clearing of the in-progress byte count into the section that's dependent on having a cached iovec or not, but it should be cleared for any IO. If not, then extra bytes may be added at IO completion time, causing potentially weird behavior like over-reporting the amount of IO done. Fixes: d7f11616edf5 ("io_uring/rw: Allocate async data through helper") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202412271132.a09c3500-lkp@intel.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-27io_uring/rw: use NULL for rw->free_iovec assigmentJens Axboe
It's a pointer, don't use 0 for that. sparse throws a warning for that, as the kernel test robot noticed. Fixes: d7f11616edf5 ("io_uring/rw: Allocate async data through helper") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202412180253.YML3qN4d-lkp@intel.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-27io_uring/rw: don't mask in f_iocb_flagsJens Axboe
A previous commit changed overwriting kiocb->ki_flags with ->f_iocb_flags with masking it in. This breaks for retry situations, where we don't necessarily want to retain previously set flags, like IOCB_NOWAIT. The use case needs IOCB_HAS_METADATA to be persistent, but the change makes all flags persistent, which is an issue. Add a request flag to track whether the request has metadata or not, as that is persistent across issues. Fixes: 59a7d12a7fb5 ("io_uring: introduce attributes for read/write and PI support") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-27io_uring/msg_ring: Drop custom destructorGabriel Krisman Bertazi
kfree can handle slab objects nowadays. Drop the extra callback and just use kfree. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20241216204615.759089-10-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-27io_uring: Move old async data allocation helper to headerGabriel Krisman Bertazi
There are two remaining uses of the old async data allocator that do not rely on the alloc cache. I don't want to make them use the new allocator helper because that would require a if(cache) check, which will result in dead code for the cached case (for callers passing a cache, gcc can't prove the cache isn't NULL, and will therefore preserve the check. Since this is an inline function and just a few lines long, keep a second helper to deal with cases where we don't have an async data cache. No functional change intended here. This is just moving the helper around and making it inline. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20241216204615.759089-9-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-27io_uring/rw: Allocate async data through helperGabriel Krisman Bertazi
This abstract away the cache details. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20241216204615.759089-8-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-27io_uring/net: Allocate msghdr async data through helperGabriel Krisman Bertazi
This abstracts away the cache details. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20241216204615.759089-7-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-27io_uring/uring_cmd: Allocate async data through generic helperGabriel Krisman Bertazi
This abstracts away the cache details and simplify the code. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20241216204615.759089-6-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-27io_uring/poll: Allocate apoll with generic alloc_cache helperGabriel Krisman Bertazi
This abstracts away the cache details to simplify the code. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20241216204615.759089-5-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-27io_uring/futex: Allocate ifd with generic alloc_cache helperGabriel Krisman Bertazi
Instead of open-coding the allocation, use the generic alloc_cache helper. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20241216204615.759089-4-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-27io_uring: Add generic helper to allocate async dataGabriel Krisman Bertazi
This helper replaces io_alloc_async_data by using the folded allocation. Do it in a header to allow the compiler to decide whether to inline. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20241216204615.759089-3-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring: Fold allocation into alloc_cache helperGabriel Krisman Bertazi
The allocation paths that use alloc_cache duplicate the same code pattern, sometimes in a quite convoluted way. Fold the allocation into the cache code itself, making it just an allocator function, and keeping the cache policy invisible to callers. Another justification for doing this, beyond code simplicity, is that it makes it trivial to test the impact of disabling the cache and using slab directly, which I've used for slab improvement experiments. One relevant detail is that we provide a callback to optionally initialize memory only when we actually reach slab. This allows us to avoid blindly executing the allocation with GFP_ZERO and only clean fields when they matter. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20241216204615.759089-2-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring: prevent reg-wait speculationsPavel Begunkov
With *ENTER_EXT_ARG_REG instead of passing a user pointer with arguments for the waiting loop the user can specify an offset into a pre-mapped region of memory, in which case the [offset, offset + sizeof(io_uring_reg_wait)) will be intepreted as the argument. As we address a kernel array using a user given index, it'd be a subject to speculation type of exploits. Use array_index_nospec() to prevent that. Make sure to pass not the full region size but truncate by the maximum offset allowed considering the structure size. Fixes: d617b3147d54c ("io_uring: restore back registered wait arguments") Fixes: aa00f67adc2c0 ("io_uring: add support for fixed wait regions") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1e3d9da7c43d619de7bcf41d1cd277ab2688c443.1733694126.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring: don't vmap single page regionsPavel Begunkov
When io_check_coalesce_buffer() meets a single page buffer it bails out and tells that it can be coalesced. That's fine for registered buffers as io_coalesce_buffer() wouldn't change anything, but the region code now uses the function to decided on whether to vmap the buffer or not. Report that a single page buffer is trivially coalescable and let io_sqe_buffer_register() to filter them. Fixes: c4d0ac1c1567 ("io_uring/memmap: optimise single folio regions") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/cb83e053f318857068447d40c95becebcd8aeced.1733689833.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring: clean up io_prep_rw_setup()David Wei
Remove unnecessary call to iov_iter_save_state() in io_prep_rw_setup() as io_import_iovec() already does this. Then the result from io_import_iovec() can be returned directly. Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Tested-by: Li Zetao <lizetao1@huawei.com> Link: https://lore.kernel.org/r/20241207004144.783631-1-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring/kbuf: fix unintentional sign extension on shift of reg.bgidColin Ian King
Shifting reg.bgid << IORING_OFF_PBUF_SHIFT results in a promotion from __u16 to a 32 bit signed integer, this is then sign extended to a 64 bit unsigned long on 64 bit architectures. If reg.bgid is greater than 0x7fff then this leads to a sign extended result where all the upper 32 bits of mmap_offset are set to 1. Fix this by casting reg.bgid to the same type as mmap_offset before performing the shift. Fixes: ef62de3c4ad5 ("io_uring/kbuf: use region api for pbuf rings") Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Link: https://lore.kernel.org/r/20241204153923.401674-1-colin.i.king@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23block: make bio_integrity_map_user() static inlineJens Axboe
If CONFIG_BLK_DEV_INTEGRITY isn't set, then the dummy helper must be static inline to avoid complaints about the function being unused. Fixes: fe8f4ca7107e ("block: modify bio_integrity_map_user to accept iov_iter as argument") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202411300229.y7h60mDg-lkp@intel.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23block: add support to pass user meta bufferKanchan Joshi
If an iocb contains metadata, extract that and prepare the bip. Based on flags specified by the user, set corresponding guard/app/ref tags to be checked in bip. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20241128112240.8867-11-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23scsi: add support for user-meta interfaceAnuj Gupta
Add support for sending user-meta buffer. Set tags to be checked using flags specified by user/block-layer. With this change, BIP_CTRL_NOCHECK becomes unused. Remove it. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241128112240.8867-10-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23nvme: add support for passing on the application tagKanchan Joshi
With user integrity buffer, there is a way to specify the app_tag. Set the corresponding protocol specific flags and send the app_tag down. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20241128112240.8867-9-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23block: introduce BIP_CHECK_GUARD/REFTAG/APPTAG bip_flagsAnuj Gupta
This patch introduces BIP_CHECK_GUARD/REFTAG/APPTAG bip_flags which indicate how the hardware should check the integrity payload. BIP_CHECK_GUARD/REFTAG are conversion of existing semantics, while BIP_CHECK_APPTAG is a new flag. The driver can now just rely on block layer flags, and doesn't need to know the integrity source. Submitter of PI decides which tags to check. This would also give us a unified interface for user and kernel generated integrity. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20241128112240.8867-8-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring: introduce attributes for read/write and PI supportAnuj Gupta
Add the ability to pass additional attributes along with read/write. Application can prepare attibute specific information and pass its address using the SQE field: __u64 attr_ptr; Along with setting a mask indicating attributes being passed: __u64 attr_type_mask; Overall 64 attributes are allowed and currently one attribute 'IORING_RW_ATTR_FLAG_PI' is supported. With PI attribute, userspace can pass following information: - flags: integrity check flags IO_INTEGRITY_CHK_{GUARD/APPTAG/REFTAG} - len: length of PI/metadata buffer - addr: address of metadata buffer - seed: seed value for reftag remapping - app_tag: application defined 16b value Process this information to prepare uio_meta_descriptor and pass it down using kiocb->private. PI attribute is supported only for direct IO. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20241128112240.8867-7-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23fs: introduce IOCB_HAS_METADATA for metadataAnuj Gupta
Introduce an IOCB_HAS_METADATA flag for the kiocb struct, for handling requests containing meta payload. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241128112240.8867-6-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23fs, iov_iter: define meta io descriptorAnuj Gupta
Add flags to describe checks for integrity meta buffer. Also, introduce a new 'uio_meta' structure that upper layer can use to pass the meta/integrity information. Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241128112240.8867-5-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23block: modify bio_integrity_map_user to accept iov_iter as argumentAnuj Gupta
This patch refactors bio_integrity_map_user to accept iov_iter as argument. This is a prep patch. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20241128112240.8867-4-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23block: copy back bounce buffer to user-space correctly in case of splitChristoph Hellwig
Copy back the bounce buffer to user-space in entirety when the parent bio completes. The existing code uses bip_iter.bi_size for sizing the copy, which can be modified. So move away from that and fetch it from the vector passed to the block layer. While at it, switch to using better variable names. Fixes: 492c5d455969f ("block: bio-integrity: directly map user buffers") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20241128112240.8867-3-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23block: define set of integrity flags to be inherited by cloned bipAnuj Gupta
Introduce BIP_CLONE_FLAGS describing integrity flags that should be inherited in the cloned bip from the parent. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20241128112240.8867-2-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring/memmap: unify io_uring mmap'ing codePavel Begunkov
All mapped memory is now backed by regions and we can unify and clean up io_region_validate_mmap() and io_uring_mmap(). Extract a function looking up a region, the rest of the handling should be generic and just needs the region. There is one more ring type specific code, i.e. the mmaping size truncation quirk for IORING_OFF_[S,C]Q_RING, which is left as is. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f5e1eda1562bfd34276de07465525ae5f10e1e84.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring/kbuf: use region api for pbuf ringsPavel Begunkov
Convert internal parts of the provided buffer ring managment to the region API. It's the last non-region mapped ring we have, so it also kills a bunch of now unused memmap.c helpers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6c40cf7beaa648558acd4d84bc0fb3279a35d74b.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring/kbuf: remove pbuf ring refcountingPavel Begunkov
struct io_buffer_list refcounting was needed for RCU based sync with mmap, now we can kill it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4a9cc54bf0077bb2bf2f3daf917549ddd41080da.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring/kbuf: use mmap_lock to sync with mmapPavel Begunkov
A preparation / cleanup patch simplifying the buf ring - mmap synchronisation. Instead of relying on RCU, which is trickier, do it by grabbing the mmap_lock when when anyone tries to publish or remove a registered buffer to / from ->io_bl_xa. Modifications of the xarray should always be protected by both ->uring_lock and ->mmap_lock, while lookups should hold either of them. While a struct io_buffer_list is in the xarray, the mmap related fields like ->flags and ->buf_pages should stay stable. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/af13bde56ee1a26bcaefaa9aad37a9ea318a590e.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring: use region api for CQPavel Begunkov
Convert internal parts of the CQ/SQ array managment to the region API. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/46fc3c801290d6b1ac16023d78f6b8e685c87fd6.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring: use region api for SQPavel Begunkov
Convert internal parts of the SQ managment to the region API. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1fb73ced6b835cb319ab0fe1dc0b2e982a9a5650.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring: pass ctx to io_register_free_ringsPavel Begunkov
A preparation patch, pass the context to io_register_free_rings. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c1865fd2b3d4db22d1a1aac7dd06ea22cb990834.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring/memmap: implement mmap for regionsPavel Begunkov
The patch implements mmap for the param region and enables the kernel allocation mode. Internally it uses a fixed mmap offset, however the user has to use the offset returned in struct io_uring_region_desc::mmap_offset. Note, mmap doesn't and can't take ->uring_lock and the region / ring lookup is protected by ->mmap_lock, and it's directly peeking at ctx->param_region. We can't protect io_create_region() with the mmap_lock as it'd deadlock, which is why io_create_region_mmap_safe() initialises it for us in a temporary variable and then publishes it with the lock taken. It's intentionally decoupled from main region helpers, and in the future we might want to have a list of active regions, which then could be protected by the ->mmap_lock. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0f1212bd6af7fb39b63514b34fae8948014221d1.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring/memmap: implement kernel allocated regionsPavel Begunkov
Allow the kernel to allocate memory for a region. That's the classical way SQ/CQ are allocated. It's not yet useful to user space as there is no way to mmap it, which is why it's explicitly disabled in io_register_mem_region(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7b8c40e6542546bbf93f4842a9a42a7373b81e0d.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring/memmap: add IO_REGION_F_SINGLE_REFPavel Begunkov
Kernel allocated compound pages will have just one reference for the entire page array, add a flag telling io_free_region about that. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a7abfa7535e9728d5fcade29a1ea1605ec2c04ce.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring/memmap: helper for pinning region pagesPavel Begunkov
In preparation to adding kernel allocated regions extract a new helper that pins user pages. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a17d7c39c3de4266b66b75b2dcf768150e1fc618.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring/memmap: optimise single folio regionsPavel Begunkov
We don't need to vmap if memory is already physically contiguous. There are two important cases it covers: PAGE_SIZE regions and huge pages. Use io_check_coalesce_buffer() to get the number of contiguous folios. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d5240af23064a824c29d14d2406f1ae764bf4505.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring/memmap: reuse io_free_region for failure pathPavel Begunkov
Regions are going to become more complex with allocation options and optimisations, I want to split initialisation into steps and for that it needs a sane fail path. Reuse io_free_region(), it's smart enough to undo only what's needed and leaves the structure in a consistent state. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b853b4ec407cc80d033d021bdd2c14e22378fc78.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring/memmap: account memory before pinningPavel Begunkov
Move memory accounting before page pinning. It shouldn't even try to pin pages if it's not allowed, and accounting is also relatively inexpensive. It also give a better code structure as we do generic accounting and then can branch for different mapping types. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1e242b8038411a222e8b269d35e021fa5015289f.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>