summaryrefslogtreecommitdiff
path: root/fs/netfs/read_collect.c
AgeCommit message (Collapse)Author
2025-07-01netfs: Update tracepoints in a number of waysDavid Howells
Make a number of updates to the netfs tracepoints: (1) Remove a duplicate trace from netfs_unbuffered_write_iter_locked(). (2) Move the trace in netfs_wake_rreq_flag() to after the flag is cleared so that the change appears in the trace. (3) Differentiate the use of netfs_rreq_trace_wait/woke_queue symbols. (4) Don't do so many trace emissions in the wait functions as some of them are redundant. (5) In netfs_collect_read_results(), differentiate a subreq that's being abandoned vs one that has been consumed in a regular way. (6) Add a tracepoint to indicate the call to ->ki_complete(). (7) Don't double-increment the subreq_counter when retrying a write. (8) Move the netfs_sreq_trace_io_progress tracepoint within cifs code to just MID_RESPONSE_RECEIVED and add different tracepoints for other MID states and note check failure. Signed-off-by: David Howells <dhowells@redhat.com> Co-developed-by: Paulo Alcantara <pc@manguebit.org> Signed-off-by: Paulo Alcantara <pc@manguebit.org> Link: https://lore.kernel.org/20250701163852.2171681-14-dhowells@redhat.com cc: Steve French <sfrench@samba.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-cifs@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-01netfs: Provide helpers to perform NETFS_RREQ_IN_PROGRESS flag wanglingDavid Howells
Provide helpers to clear and test the NETFS_RREQ_IN_PROGRESS and to insert the appropriate barrierage. Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/20250701163852.2171681-4-dhowells@redhat.com Tested-by: Steve French <sfrench@samba.org> Reviewed-by: Paulo Alcantara <pc@manguebit.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-23netfs: Fix undifferentiation of DIO reads from unbuffered readsDavid Howells
On cifs, "DIO reads" (specified by O_DIRECT) need to be differentiated from "unbuffered reads" (specified by cache=none in the mount parameters). The difference is flagged in the protocol and the server may behave differently: Windows Server will, for example, mandate that DIO reads are block aligned. Fix this by adding a NETFS_UNBUFFERED_READ to differentiate this from NETFS_DIO_READ, parallelling the write differentiation that already exists. cifs will then do the right thing. Fixes: 016dc8516aec ("netfs: Implement unbuffered/DIO read support") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/3444961.1747987072@warthog.procyon.org.uk Reviewed-by: "Paulo Alcantara (Red Hat)" <pc@manguebit.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> cc: Steve French <sfrench@samba.org> cc: netfs@lists.linux.dev cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: linux-cifs@vger.kernel.org cc: ceph-devel@vger.kernel.org cc: linux-nfs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-21Merge patch series "netfs: Miscellaneous fixes"Christian Brauner
David Howells <dhowells@redhat.com> says: Here are some miscellaneous fixes and changes for netfslib, if you could pull them: (1) Fix an oops in write-retry due to mis-resetting the I/O iterator. (2) Fix the recording of transferred bytes for short DIO reads. (3) Fix a request's work item to not require a reference, thereby avoiding the need to get rid of it in BH/IRQ context. (4) Fix waiting and waking to be consistent about the waitqueue used. * patches from https://lore.kernel.org/20250519090707.2848510-1-dhowells@redhat.com: netfs: Fix wait/wake to be consistent about the waitqueue used netfs: Fix the request's work item to not require a ref netfs: Fix setting of transferred bytes with short DIO reads netfs: Fix oops in write-retry from mis-resetting the subreq iterator Link: https://lore.kernel.org/20250519090707.2848510-1-dhowells@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-21netfs: Fix wait/wake to be consistent about the waitqueue usedDavid Howells
Fix further inconsistencies in the use of waitqueues (clear_and_wake_up_bit() vs private waitqueue). Move some of this stuff from the read and write sides into common code so that it can be done in fewer places. To make this work, async I/O needs to set NETFS_RREQ_OFFLOAD_COLLECTION to indicate that a workqueue will do the collecting and places that call the wait function need to deal with it returning the amount transferred. Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/20250519090707.2848510-5-dhowells@redhat.com cc: Marc Dionne <marc.dionne@auristor.com> cc: Steve French <stfrench@microsoft.com> cc: Ihor Solodrai <ihor.solodrai@pm.me> cc: Eric Van Hensbergen <ericvh@kernel.org> cc: Latchesar Ionkov <lucho@ionkov.net> cc: Dominique Martinet <asmadeus@codewreck.org> cc: Christian Schoenebeck <linux_oss@crudebyte.com> cc: Paulo Alcantara <pc@manguebit.com> cc: Jeff Layton <jlayton@kernel.org> cc: v9fs@lists.linux.dev cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-21netfs: Fix the request's work item to not require a refDavid Howells
When the netfs_io_request struct's work item is queued, it must be supplied with a ref to the work item struct to prevent it being deallocated whilst on the queue or whilst it is being processed. This is tricky to manage as we have to get a ref before we try and queue it and then we may find it's already queued and is thus already holding a ref - in which case we have to try and get rid of the ref again. The problem comes if we're in BH or IRQ context and need to drop the ref: if netfs_put_request() reduces the count to 0, we have to do the cleanup - but the cleanup may need to wait. Fix this by adding a new work item to the request, ->cleanup_work, and dispatching that when the refcount hits zero. That can then synchronously cancel any outstanding work on the main work item before doing the cleanup. Adding a new work item also deals with another problem upstream where it's sometimes changing the work func in the put function and requeuing it - which has occasionally in the past caused the cleanup to happen incorrectly. As a bonus, this allows us to get rid of the 'was_async' parameter from a bunch of functions. This indicated whether the put function might not be permitted to sleep. Fixes: 3d3c95046742 ("netfs: Provide readahead and readpage netfs helpers") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/20250519090707.2848510-4-dhowells@redhat.com cc: Paulo Alcantara <pc@manguebit.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Steve French <stfrench@microsoft.com> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-21netfs: Fix setting of transferred bytes with short DIO readsPaulo Alcantara
A netfslib request comprises an ordered stream of subrequests that, when doing an unbuffered/DIO read, are contiguous. The subrequests may be performed in parallel, but may not be fully completed. For instance, if we try and make a 256KiB DIO read from a 3-byte file with a 64KiB rsize and 256KiB bsize, netfslib will attempt to make a read of 256KiB, broken up into four 64KiB subreads, with the expectation that the first will be short and the subsequent three be completely devoid - but we do all four on the basis that the file may have been changed by a third party. The read-collection code, however, walks through all the subreqs and advances the notion of how much data has been read in the stream to the start of each subreq plus its amount transferred (which are 3, 0, 0, 0 for the example above) - which gives an amount apparently read of 3*64KiB - which is incorrect. Fix the collection code to cut short the calculation of the transferred amount with the first short subrequest in an unbuffered read; everything beyond that must be ignored as there's a hole that cannot be filled. This applies both to shortness due to hitting the EOF and shortness due to an error. This is achieved by setting a flag on the request when we collect the first short subrequest (collection is done in ascending order). This can be tested by mounting a cifs volume with rsize=65536,bsize=262144 and doing a 256k DIO read of a very small file (e.g. 3 bytes). read() should return 3, not >3. This problem came in when netfs_read_collection() set rreq->transferred to stream->transferred, even for DIO. Prior to that, netfs_rreq_assess_dio() just went over the list and added up the subreqs till it met a short one - but now the subreqs are discarded earlier. Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item") Reported-by: Nicolas Baranger <nicolas.baranger@3xo.fr> Closes: https://lore.kernel.org/all/10bec2430ed4df68bde10ed95295d093@3xo.fr/ Signed-off-by: "Paulo Alcantara (Red Hat)" <pc@manguebit.com> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/20250519090707.2848510-3-dhowells@redhat.com cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-21fs/netfs: remove unused flag NETFS_RREQ_DONT_UNLOCK_FOLIOSMax Kellermann
NETFS_RREQ_DONT_UNLOCK_FOLIOS has never been used ever since it was added by commit 3d3c95046742 ("netfs: Provide readahead and readpage netfs helpers"). Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/20250519134813.2975312-11-dhowells@redhat.com cc: Paulo Alcantara <pc@manguebit.com> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-19netfs: Fix collection of results during pause when collection offloadedDavid Howells
A netfs read request can run in one of two modes: for synchronous reads writes, the app thread does the collection of results and for asynchronous reads, this is offloaded to a worker thread. This is controlled by the NETFS_RREQ_OFFLOAD_COLLECTION flag. Now, if a subrequest incurs an error, the NETFS_RREQ_PAUSE flag is set to stop the issuing loop temporarily from issuing more subrequests until a retry is successful or the request is abandoned. When the issuing loop sees NETFS_RREQ_PAUSE, it jumps to netfs_wait_for_pause() which will wait for the PAUSE flag to be cleared - and whilst it is waiting, it will call out to the collector as more results acrue... But this is the wrong thing to do if OFFLOAD_COLLECTION is set as we can then end up with both the app thread and the work item collecting results simultaneously. This manifests itself occasionally when running the generic/323 xfstest against multichannel cifs as an oops that's a bit random but frequently involving io_submit() (the test does lots of simultaneous async DIO reads). Fix this by only doing the collection in netfs_wait_for_pause() if the NETFS_RREQ_OFFLOAD_COLLECTION is not set. Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item") Reported-by: Steve French <stfrench@microsoft.com> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20250314164201.1993231-2-dhowells@redhat.com Acked-by: "Paulo Alcantara (Red Hat)" <pc@manguebit.com> cc: Paulo Alcantara <pc@manguebit.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-13netfs: Fix a number of read-retry hangsDavid Howells
Fix a number of hangs in the netfslib read-retry code, including: (1) netfs_reissue_read() doubles up the getting of references on subrequests, thereby leaking the subrequest and causing inode eviction to wait indefinitely. This can lead to the kernel reporting a hang in the filesystem's evict_inode(). Fix this by removing the get from netfs_reissue_read() and adding one to netfs_retry_read_subrequests() to deal with the one place that didn't double up. (2) The loop in netfs_retry_read_subrequests() that retries a sequence of failed subrequests doesn't record whether or not it retried the one that the "subreq" pointer points to when it leaves the loop. It may not if renegotiation/repreparation of the subrequests means that fewer subrequests are needed to span the cumulative range of the sequence. Because it doesn't record this, the piece of code that discards now-superfluous subrequests doesn't know whether it should discard the one "subreq" points to - and so it doesn't. Fix this by noting whether the last subreq it examines is superfluous and if it is, then getting rid of it and all subsequent subrequests. If that one one wasn't superfluous, then we would have tried to go round the previous loop again and so there can be no further unretried subrequests in the sequence. (3) netfs_retry_read_subrequests() gets yet an extra ref on any additional subrequests it has to get because it ran out of ones it could reuse to to renegotiation/repreparation shrinking the subrequests. Fix this by removing that extra ref. (4) In netfs_retry_reads(), it was using wait_on_bit() to wait for NETFS_SREQ_IN_PROGRESS to be cleared on all subrequests in the sequence - but netfs_read_subreq_terminated() is now using a wait queue on the request instead and so this wait will never finish. Fix this by waiting on the wait queue instead. To make this work, a new flag, NETFS_RREQ_RETRYING, is now set around the wait loop to tell the wake-up code to wake up the wait queue rather than requeuing the request's work item. Note that this flag replaces the NETFS_RREQ_NEED_RETRY flag which is no longer used. (5) Whilst not strictly anything to do with the hang, netfs_retry_read_subrequests() was also doubly incrementing the subreq_counter and re-setting the debug index, leaving a gap in the trace. This is also fixed. One of these hangs was observed with 9p and with cifs. Others were forced by manual code injection into fs/afs/file.c. Firstly, afs_prepare_read() was created to provide an changing pattern of maximum subrequest sizes: static int afs_prepare_read(struct netfs_io_subrequest *subreq) { struct netfs_io_request *rreq = subreq->rreq; if (!S_ISREG(subreq->rreq->inode->i_mode)) return 0; if (subreq->retry_count < 20) rreq->io_streams[0].sreq_max_len = umax(200, 2222 - subreq->retry_count * 40); else rreq->io_streams[0].sreq_max_len = 3333; return 0; } and pointed to by afs_req_ops. Then the following: struct netfs_io_subrequest *subreq = op->fetch.subreq; if (subreq->error == 0 && S_ISREG(subreq->rreq->inode->i_mode) && subreq->retry_count < 20) { subreq->transferred = subreq->already_done; __clear_bit(NETFS_SREQ_HIT_EOF, &subreq->flags); __set_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags); afs_fetch_data_notify(op); return; } was inserted into afs_fetch_data_success() at the beginning and struct netfs_io_subrequest given an extra field, "already_done" that was set to the value in "subreq->transferred" by netfs_reissue_read(). When reading a 4K file, the subrequests would get gradually smaller, a new subrequest would be allocated around the 3rd retry and then eventually be rendered superfluous when the 20th retry was hit and the limit on the first subrequest was eased. Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20250212222402.3618494-2-dhowells@redhat.com Tested-by: Marc Dionne <marc.dionne@auristor.com> Tested-by: Steve French <stfrench@microsoft.com> cc: Ihor Solodrai <ihor.solodrai@pm.me> cc: Eric Van Hensbergen <ericvh@kernel.org> cc: Latchesar Ionkov <lucho@ionkov.net> cc: Dominique Martinet <asmadeus@codewreck.org> cc: Christian Schoenebeck <linux_oss@crudebyte.com> cc: Paulo Alcantara <pc@manguebit.com> cc: Jeff Layton <jlayton@kernel.org> cc: v9fs@lists.linux.dev cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-20netfs: Change the read result collector to only use one work itemDavid Howells
Change the way netfslib collects read results to do all the collection for a particular read request using a single work item that walks along the subrequest queue as subrequests make progress or complete, unlocking folios progressively rather than doing the unlock in parallel as parallel requests come in. The code is remodelled to be more like the write-side code, though only using a single stream. This makes it more directly comparable and thus easier to duplicate fixes between the two sides. This has a number of advantages: (1) It's simpler. There doesn't need to be a complex donation mechanism to handle mismatches between the size and alignment of subrequests and folios. The collector unlocks folios as the subrequests covering each complete. (2) It should cause less scheduler overhead as there's a single work item in play unlocking pages in parallel when a read gets split up into a lot of subrequests instead of one per subrequest. Whilst the parallellism is nice in theory, in practice, the vast majority of loads are sequential reads of the whole file, so committing a bunch of threads to unlocking folios out of order doesn't help in those cases. (3) It should make it easier to implement content decryption. A folio cannot be decrypted until all the requests that contribute to it have completed - and, again, most loads are sequential and so, most of the time, we want to begin decryption sequentially (though it's great if the decryption can happen in parallel). There is a disadvantage in that we're losing the ability to decrypt and unlock things on an as-things-arrive basis which may affect some applications. Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20241216204124.3752367-28-dhowells@redhat.com cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-20netfs: Add support for caching single monolithic objects such as AFS dirsDavid Howells
Add support for caching the content of a file that contains a single monolithic object that must be read/written with a single I/O operation, such as an AFS directory. Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20241216204124.3752367-20-dhowells@redhat.com cc: Jeff Layton <jlayton@kernel.org> cc: Marc Dionne <marc.dionne@auristor.com> cc: netfs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-20netfs: Don't use bh spinlockDavid Howells
All the accessing of the subrequest lists is now done in process context, possibly in a workqueue, but not now in a BH context, so we don't need the lock against BH interference when taking the netfs_io_request::lock spinlock. Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20241216204124.3752367-11-dhowells@redhat.com cc: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-20netfs: Drop the was_async arg from netfs_read_subreq_terminated()David Howells
Drop the was_async argument from netfs_read_subreq_terminated(). Almost every caller is either in process context and passes false. Some filesystems delegate the call to a workqueue to avoid doing the work in their network message queue parsing thread. The only exception is netfs_cache_read_terminated() which handles completion in the cache - which is usually a callback from the backing filesystem in softirq context, though it can be from process context if an error occurred. In this case, delegate to a workqueue. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/CAHk-=wiVC5Cgyz6QKXFu6fTaA6h4CjexDR-OV9kL6Vo5x9v8=A@mail.gmail.com/ Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20241216204124.3752367-10-dhowells@redhat.com cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-20netfs: Drop the error arg from netfs_read_subreq_terminated()David Howells
Drop the error argument from netfs_read_subreq_terminated() in favour of passing the value in subreq->error. Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20241216204124.3752367-9-dhowells@redhat.com cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-20netfs: Add a tracepoint to log the lifespan of folio_queue structsDavid Howells
Add a tracepoint to log the lifespan of folio_queue structs. For tracing illustrative purposes, folio_queues are tagged with the debug ID of whatever they're related to (typically a netfs_io_request) and a debug ID of their own. Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20241216204124.3752367-5-dhowells@redhat.com cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-20netfs: Fix ceph copy to cache on write-beginDavid Howells
At the end of netfs_unlock_read_folio() in which folios are marked appropriately for copying to the cache (either with by being marked dirty and having their private data set or by having PG_private_2 set) and then unlocked, the folio_queue struct has the entry pointing to the folio cleared. This presents a problem for netfs_pgpriv2_write_to_the_cache(), which is used to write folios marked with PG_private_2 to the cache as it expects to be able to trawl the folio_queue list thereafter to find the relevant folios, leading to a hang. Fix this by not clearing the folio_queue entry if we're going to do the deprecated copy-to-cache. The clearance will be done instead as the folios are written to the cache. This can be reproduced by starting cachefiles, mounting a ceph filesystem with "-o fsc" and writing to it. Fixes: 796a4049640b ("netfs: In readahead, put the folio refs as soon extracted") Reported-by: Max Kellermann <max.kellermann@ionos.com> Closes: https://lore.kernel.org/r/CAKPOu+_4m80thNy5_fvROoxBm689YtA0dZ-=gcmkzwYSY4syqw@mail.gmail.com/ Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20241213135013.2964079-10-dhowells@redhat.com Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading") cc: Jeff Layton <jlayton@kernel.org> cc: Ilya Dryomov <idryomov@gmail.com> cc: Xiubo Li <xiubli@redhat.com> cc: netfs@lists.linux.dev cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-20netfs: Work around recursion by abandoning retry if nothing readDavid Howells
syzkaller reported recursion with a loop of three calls (netfs_rreq_assess, netfs_retry_reads and netfs_rreq_terminated) hitting the limit of the stack during an unbuffered or direct I/O read. There are a number of issues: (1) There is no limit on the number of retries. (2) A subrequest is supposed to be abandoned if it does not transfer anything (NETFS_SREQ_NO_PROGRESS), but that isn't checked under all circumstances. (3) The actual root cause, which is this: if (atomic_dec_and_test(&rreq->nr_outstanding)) netfs_rreq_terminated(rreq, ...); When we do a retry, we bump the rreq->nr_outstanding counter to prevent the final cleanup phase running before we've finished dispatching the retries. The problem is if we hit 0, we have to do the cleanup phase - but we're in the cleanup phase and end up repeating the retry cycle, hence the recursion. Work around the problem by limiting the number of retries. This is based on Lizhi Xu's patch[1], and makes the following changes: (1) Replace NETFS_SREQ_NO_PROGRESS with NETFS_SREQ_MADE_PROGRESS and make the filesystem set it if it managed to read or write at least one byte of data. Clear this bit before issuing a subrequest. (2) Add a ->retry_count member to the subrequest and increment it any time we do a retry. (3) Remove the NETFS_SREQ_RETRYING flag as it is superfluous with ->retry_count. If the latter is non-zero, we're doing a retry. (4) Abandon a subrequest if retry_count is non-zero and we made no progress. (5) Use ->retry_count in both the write-side and the read-size. [?] Question: Should I set a hard limit on retry_count in both read and write? Say it hits 50, we always abandon it. The problem is that these changes only mitigate the issue. As long as it made at least one byte of progress, the recursion is still an issue. This patch mitigates the problem, but does not fix the underlying cause. I have patches that will do that, but it's an intrusive fix that's currently pending for the next merge window. The oops generated by KASAN looks something like: BUG: TASK stack guard page was hit at ffffc9000482ff48 (stack is ffffc90004830000..ffffc90004838000) Oops: stack guard page: 0000 [#1] PREEMPT SMP KASAN NOPTI ... RIP: 0010:mark_lock+0x25/0xc60 kernel/locking/lockdep.c:4686 ... mark_usage kernel/locking/lockdep.c:4646 [inline] __lock_acquire+0x906/0x3ce0 kernel/locking/lockdep.c:5156 lock_acquire.part.0+0x11b/0x380 kernel/locking/lockdep.c:5825 local_lock_acquire include/linux/local_lock_internal.h:29 [inline] ___slab_alloc+0x123/0x1880 mm/slub.c:3695 __slab_alloc.constprop.0+0x56/0xb0 mm/slub.c:3908 __slab_alloc_node mm/slub.c:3961 [inline] slab_alloc_node mm/slub.c:4122 [inline] kmem_cache_alloc_noprof+0x2a7/0x2f0 mm/slub.c:4141 radix_tree_node_alloc.constprop.0+0x1e8/0x350 lib/radix-tree.c:253 idr_get_free+0x528/0xa40 lib/radix-tree.c:1506 idr_alloc_u32+0x191/0x2f0 lib/idr.c:46 idr_alloc+0xc1/0x130 lib/idr.c:87 p9_tag_alloc+0x394/0x870 net/9p/client.c:321 p9_client_prepare_req+0x19f/0x4d0 net/9p/client.c:644 p9_client_zc_rpc.constprop.0+0x105/0x880 net/9p/client.c:793 p9_client_read_once+0x443/0x820 net/9p/client.c:1570 p9_client_read+0x13f/0x1b0 net/9p/client.c:1534 v9fs_issue_read+0x115/0x310 fs/9p/vfs_addr.c:74 netfs_retry_read_subrequests fs/netfs/read_retry.c:60 [inline] netfs_retry_reads+0x153a/0x1d00 fs/netfs/read_retry.c:232 netfs_rreq_assess+0x5d3/0x870 fs/netfs/read_collect.c:371 netfs_rreq_terminated+0xe5/0x110 fs/netfs/read_collect.c:407 netfs_retry_reads+0x155e/0x1d00 fs/netfs/read_retry.c:235 netfs_rreq_assess+0x5d3/0x870 fs/netfs/read_collect.c:371 netfs_rreq_terminated+0xe5/0x110 fs/netfs/read_collect.c:407 netfs_retry_reads+0x155e/0x1d00 fs/netfs/read_retry.c:235 netfs_rreq_assess+0x5d3/0x870 fs/netfs/read_collect.c:371 ... netfs_rreq_terminated+0xe5/0x110 fs/netfs/read_collect.c:407 netfs_retry_reads+0x155e/0x1d00 fs/netfs/read_retry.c:235 netfs_rreq_assess+0x5d3/0x870 fs/netfs/read_collect.c:371 netfs_rreq_terminated+0xe5/0x110 fs/netfs/read_collect.c:407 netfs_retry_reads+0x155e/0x1d00 fs/netfs/read_retry.c:235 netfs_rreq_assess+0x5d3/0x870 fs/netfs/read_collect.c:371 netfs_rreq_terminated+0xe5/0x110 fs/netfs/read_collect.c:407 netfs_dispatch_unbuffered_reads fs/netfs/direct_read.c:103 [inline] netfs_unbuffered_read fs/netfs/direct_read.c:127 [inline] netfs_unbuffered_read_iter_locked+0x12f6/0x19b0 fs/netfs/direct_read.c:221 netfs_unbuffered_read_iter+0xc5/0x100 fs/netfs/direct_read.c:256 v9fs_file_read_iter+0xbf/0x100 fs/9p/vfs_file.c:361 do_iter_readv_writev+0x614/0x7f0 fs/read_write.c:832 vfs_readv+0x4cf/0x890 fs/read_write.c:1025 do_preadv fs/read_write.c:1142 [inline] __do_sys_preadv fs/read_write.c:1192 [inline] __se_sys_preadv fs/read_write.c:1187 [inline] __x64_sys_preadv+0x22d/0x310 fs/read_write.c:1187 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83 Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading") Closes: https://syzkaller.appspot.com/bug?extid=1fc6f64c40a9d143cfb6 Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20241108034020.3695718-1-lizhi.xu@windriver.com/ [1] Link: https://lore.kernel.org/r/20241213135013.2964079-9-dhowells@redhat.com Tested-by: syzbot+885c03ad650731743489@syzkaller.appspotmail.com Suggested-by: Lizhi Xu <lizhi.xu@windriver.com> cc: Dominique Martinet <asmadeus@codewreck.org> cc: Jeff Layton <jlayton@kernel.org> cc: v9fs@lists.linux.dev cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Reported-by: syzbot+885c03ad650731743489@syzkaller.appspotmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-20netfs: Fix missing barriers by using clear_and_wake_up_bit()David Howells
Use clear_and_wake_up_bit() rather than something like: clear_bit_unlock(NETFS_RREQ_IN_PROGRESS, &rreq->flags); wake_up_bit(&rreq->flags, NETFS_RREQ_IN_PROGRESS); as there needs to be a barrier inserted between which is present in clear_and_wake_up_bit(). Fixes: 288ace2f57c9 ("netfs: New writeback implementation") Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20241213135013.2964079-8-dhowells@redhat.com Reviewed-by: Akira Yokosawa <akiyks@gmail.com> cc: Zilin Guan <zilin@seu.edu.cn> cc: Akira Yokosawa <akiyks@gmail.com> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-20netfs: Fix non-contiguous donation between completed readsDavid Howells
When a read subrequest finishes, if it doesn't have sufficient coverage to complete the folio(s) covering either side of it, it will donate the excess coverage to the adjacent subrequests on either side, offloading responsibility for unlocking the folio(s) covered to them. Now, preference is given to donating down to a lower file offset over donating up because that check is done first - but there's no check that the lower subreq is actually contiguous, and so we can end up donating incorrectly. The scenario seen[1] is that an 8MiB readahead request spanning four 2MiB folios is split into eight 1MiB subreqs (numbered 1 through 8). These terminate in the order 1,6,2,5,3,7,4,8. What happens is: - 1 donates to 2 - 6 donates to 5 - 2 completes, unlocking the first folio (with 1). - 5 completes, unlocking the third folio (with 6). - 3 donates to 4 - 7 donates to 4 incorrectly - 4 completes, unlocking the second folio (with 3), but can't use the excess from 7. - 8 donates to 4, also incorrectly. Fix this by preventing downward donation if the subreqs are not contiguous (in the example above, 7 donates to 4 across the gap left by 5 and 6). Reported-by: Shyam Prasad N <nspmangalore@gmail.com> Closes: https://lore.kernel.org/r/CANT5p=qBwjBm-D8soFVVtswGEfmMtQXVW83=TNfUtvyHeFQZBA@mail.gmail.com/ Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/526707.1733224486@warthog.procyon.org.uk/ [1] Link: https://lore.kernel.org/r/20241213135013.2964079-3-dhowells@redhat.com cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-07netfs: In readahead, put the folio refs as soon extractedDavid Howells
netfslib currently defers dropping the ref on the folios it obtains during readahead to after it has started I/O on the basis that we can do it whilst we wait for the I/O to complete, but this runs the risk of the I/O collection racing with this in future. Furthermore, Matthew Wilcox strongly suggests that the refs should be dropped immediately, as readahead_folio() does (netfslib is using __readahead_batch() which doesn't drop the refs). Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading") Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/3771538.1728052438@warthog.procyon.org.uk cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-12netfs: Speed up buffered readingDavid Howells
Improve the efficiency of buffered reads in a number of ways: (1) Overhaul the algorithm in general so that it's a lot more compact and split the read submission code between buffered and unbuffered versions. The unbuffered version can be vastly simplified. (2) Read-result collection is handed off to a work queue rather than being done in the I/O thread. Multiple subrequests can be processes simultaneously. (3) When a subrequest is collected, any folios it fully spans are collected and "spare" data on either side is donated to either the previous or the next subrequest in the sequence. Notes: (*) Readahead expansion is massively slows down fio, presumably because it causes a load of extra allocations, both folio and xarray, up front before RPC requests can be transmitted. (*) RDMA with cifs does appear to work, both with SIW and RXE. (*) PG_private_2-based reading and copy-to-cache is split out into its own file and altered to use folio_queue. Note that the copy to the cache now creates a new write transaction against the cache and adds the folios to be copied into it. This allows it to use part of the writeback I/O code. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-20-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>