Age | Commit message (Collapse) | Author |
|
An issue can occur between write-streaming (storing dirty data in partial
non-uptodate pages) and a cachefiles object being culled to make space.
The problem occurs because the cache object is only marked in use while
there are files open using it. Once it has been released, it can be culled
and the cookie marked disabled.
At this point, a streaming write is permitted to occur (if the cache is
active, we require pages to be prefetched and cached), but the cache can
become active again before this gets flushed out - and then two effects can
occur:
(1) The cache may be asked to write out a region that's less than its DIO
block size (assumed by cachefiles to be PAGE_SIZE) - and this causes
one of two debugging statements to be emitted.
(2) netfs_how_to_modify() gets confused because it sees a page that isn't
allowed to be non-uptodate being uptodate and tries to prefetch it -
leading to a warning that PG_fscache is set twice.
Fix this by the following means:
(1) Add a netfs_inode flag to disallow write-streaming to an inode and set
it if we ever do local caching of that inode. It remains set for the
lifetime of that inode - even if the cookie becomes disabled.
(2) If the no-write-streaming flag is set, then make netfs_how_to_modify()
always want to prefetch instead.
(3) If netfs_how_to_modify() decides it wants to prefetch a folio, but
that folio has write-streamed data in it, then it requires the folio
be flushed first.
(4) Export a counter of the number of times we wanted to prefetch a
non-uptodate page, but found it had write-streamed data in it.
(5) Export a counter of the number of times we cancelled a write to the
cache because it didn't DIO align and remove the debug statements.
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-erofs@lists.ozlabs.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Rearrange the netfs_io_subrequest struct to put the netfs_io_request
pointer (rreq) first. This then allows netfs_io_subrequest to be put in a
union with a pointer to a wrapper around netfs_io_request. This will be
useful in the future for cifs and maybe ceph.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Track the file position above which the server is not expected to have any
data (the "zero point") and preemptively assume that we can satisfy
requests by filling them with zeroes locally rather than attempting to
download them if they're over that line - even if we've written data back
to the server. Assume that any data that was written back above that
position is held in the local cache. Note that we have to split requests
that straddle the line.
Make use of this to optimise away some reads from the server. We need to
set the zero point in the following circumstances:
(1) When we see an extant remote inode and have no cache for it, we set
the zero_point to i_size.
(2) On local inode creation, we set zero_point to 0.
(3) On local truncation down, we reduce zero_point to the new i_size if
the new i_size is lower.
(4) On local truncation up, we don't change zero_point.
(5) On local modification, we don't change zero_point.
(6) On remote invalidation, we set zero_point to the new i_size.
(7) If stored data is discarded from the pagecache or culled from fscache,
we must set zero_point above that if the data also got written to the
server.
(8) If dirty data is written back to the server, but not fscache, we must
set zero_point above that.
(9) If a direct I/O write is made, set zero_point above that.
Assuming the above, any read from the server at or above the zero_point
position will return all zeroes.
The zero_point value can be stored in the cache, provided the above rules
are applied to it by any code that culls part of the local cache.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Provide a flag whereby a filesystem may request that cifs_perform_write()
perform write-through caching. This involves putting pages directly into
writeback rather than dirty and attaching them to a write operation as we
go.
Further, the writes being made are limited to the byte range being written
rather than whole folios being written. This can be used by cifs, for
example, to deal with strict byte-range locking.
This can't be used with content encryption as that may require expansion of
the write RPC beyond the write being made.
This doesn't affect writes via mmap - those are written back in the normal
way; similarly failed writethrough writes are marked dirty and left to
writeback to retry. Another option would be to simply invalidate them, but
the contents can be simultaneously accessed by read() and through mmap.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Provide a launder_folio implementation for netfslib.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Provide an implementation of writepages for network filesystems to delegate
to.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Make netfslib pass the maximum length to the ->prepare_write() op to tell
the cache how much it can expand the length of a write to. This allows a
write to the server at the end of a file to be limited to a few bytes
whilst writing an entire block to the cache (something required by direct
I/O).
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Provide a top-level-ish function that can be pointed to directly by
->read_iter file op.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Provide an entry point to delegate a filesystem's ->page_mkwrite() to.
This checks for conflicting writes, then attached any netfs-specific group
marking (e.g. ceph snap) to the page to be considered dirty.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Institute a netfs write helper, netfs_file_write_iter(), to be pointed at
by the network filesystem ->write_iter() call. Make it handled buffered
writes by calling the previously defined netfs_perform_write() to copy the
source data into the pagecache.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Implement support for unbuffered writes and direct I/O writes. If the
write is misaligned with respect to the fscrypt block size, then RMW cycles
are performed if necessary. DIO writes are a special case of unbuffered
writes with extra restriction imposed, such as block size alignment
requirements.
Also provide a field that can tell the code to add some extra space onto
the bounce buffer for use by the filesystem in the case of a
content-encrypted file.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Implement support for unbuffered and DIO reads in the netfs library,
utilising the existing read helper code to do block splitting and
individual queuing. The code also handles extraction of the destination
buffer from the supplied iterator, allowing async unbuffered reads to take
place.
The read will be split up according to the rsize setting and, if supplied,
the ->clamp_length() method. Note that the next subrequest will be issued
as soon as issue_op returns, without waiting for previous ones to finish.
The network filesystem needs to pause or handle queuing them if it doesn't
want to fire them all at the server simultaneously.
Once all the subrequests have finished, the state will be assessed and the
amount of data to be indicated as having being obtained will be
determined. As the subrequests may finish in any order, if an intermediate
subrequest is short, any further subrequests may be copied into the buffer
and then abandoned.
In the future, this will also take care of doing an unbuffered read from
encrypted content, with the decryption being done by the library.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Provide a netfs write helper, netfs_perform_write() to buffer data to be
written in the pagecache and mark the modified folios dirty.
It will perform "streaming writes" for folios that aren't currently
resident, if possible, storing data in partially modified folios that are
marked dirty, but not uptodate. It will also tag pages as belonging to
fs-specific write groups if so directed by the filesystem.
This is derived from generic_perform_write(), but doesn't use
->write_begin() and ->write_end(), having that logic rolled in instead.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Dispatch one or more write reqeusts to process a writeback slice, where a
slice is tailored more to logical block divisions within the file (such as
crypto blocks, an object layout or cache granules) than the protocol RPC
maximum capacity.
The dispatch doesn't happen until throttling allows, at which point the
entire writeback slice is processed and queued. A slice may be written to
multiple destinations (one or more servers and the local cache) and the
writes to each destination might be split up along different lines.
The writeback slice holds the required folios pinned. An iov_iter is
provided in netfs_write_request that describes the buffer to be used. This
may be part of the pagecache, may have auxiliary padding pages attached or
may be a bounce buffer resulting from crypto or compression. Consequently,
the filesystem must not twiddle the folio markings directly.
The following API is available to the filesystem:
(1) The ->create_write_requests() method is called to ask the filesystem
to create the requests it needs. This is passed the writeback slice
to be processed.
(2) The filesystem should then call netfs_create_write_request() to create
the requests it needs.
(3) Once a request is initialised, netfs_queue_write_request() can be
called to dispatch it asynchronously, if not completed immediately.
(4) netfs_write_request_completed() should be called to note the
completion of a request.
(5) netfs_get_write_request() and netfs_put_write_request() are provided
to refcount a request. These take constants from the netfs_wreq_trace
enum for logging into ftrace.
(6) The ->free_write_request is method is called to ask the filesystem to
clean up a request.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Prepare to use folio->private to hold information write grouping and
streaming write. These are implemented in the same commit as they both
make use of folio->private and will be both checked at the same time in
several places.
"Write grouping" involves ordering the writeback of groups of writes, such
as is needed for ceph snaps. A group is represented by a
filesystem-supplied object which must contain a netfs_group struct. This
contains just a refcount and a pointer to a destructor.
"Streaming write" is the storage of data in folios that are marked dirty,
but not uptodate, to avoid unnecessary reads of data. This is represented
by a netfs_folio struct. This contains the offset and length of the
modified region plus the otherwise displaced write grouping pointer.
The way folio->private is multiplexed is:
(1) If private is NULL then neither is in operation on a dirty folio.
(2) If private is set, with bit 0 clear, then this points to a group.
(3) If private is set, with bit 0 set, then this points to a netfs_folio
struct (with bit 0 AND'ed out).
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Add a hook for netfslib's write helpers to call to tell the network
filesystem that it should update its i_size.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Modify the netfs_io_request struct to act as a point around which writes
can be coordinated. It represents and pins a range of pages that need
writing and a list of regions of dirty data in that range of pages.
If RMW is required, the original data can be downloaded into the bounce
buffer, decrypted if necessary, the modifications made, then the modified
data can be reencrypted/recompressed and sent back to the server.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Limit a subrequest to a maximum size and/or a maximum number of contiguous
physical regions. This permits, for instance, an subreq's iterator to be
limited to the number of DMA'able segments that a large RDMA request can
handle.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Add a function to work out how much of an ITER_BVEC or ITER_XARRAY iterator
we can use in a pagecount-limited and size-limited span. This will be
used, for example, to limit the number of segments in a subrequest to the
maximum number of elements that an RDMA transfer can handle.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Provide tools to create a buffer in an xarray, with a function to add new
folios with a mark. This will be used to create bounce buffer and can be
used more easily to create a list of folios the span of which would require
more than a page's worth of bio_vec structs.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Add a bvec array pointer and an iterator to netfs_io_request for either
holding a copy of a DIO iterator or a list of all the bits of buffer
pointed to by a DIO iterator.
There are two problems: Firstly, if an iovec-class iov_iter is passed to
->read_iter() or ->write_iter(), this cannot be passed directly to
kernel_sendmsg() or kernel_recvmsg() as that may cause locking recursion if
a fault is generated, so we need to keep track of the pages involved
separately.
Secondly, if the I/O is asynchronous, we must copy the iov_iter describing
the buffer before returning to the caller as it may be immediately
deallocated.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Add three iov_iter structs:
(1) Add an iov_iter (->iter) to the I/O request to describe the
unencrypted-side buffer.
(2) Add an iov_iter (->io_iter) to the I/O request to describe the
encrypted-side I/O buffer. This may be a different size to the buffer
in (1).
(3) Add an iov_iter (->io_iter) to the I/O subrequest to describe the part
of the I/O buffer for that subrequest.
This will allow future patches to point to a bounce buffer instead for
purposes of handling oversize writes, decryption (where we want to save the
encrypted data to the cache) and decompression.
These iov_iters persist for the lifetime of the (sub)request, and so can be
accessed multiple times without worrying about them being deallocated upon
return to the caller.
The network filesystem must appropriately advance the iterator before
terminating the request.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Borrow NFS's direct-vs-buffered I/O locking into netfslib. Similar code is
also used in ceph.
Modify it to have the correct checker annotations for i_rwsem lock
acquisition/release and to return -ERESTARTSYS if waits are interrupted.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Provide default invalidate_folio and release_folio calls. These will need
to interact with invalidation correctly at some point. They will be needed
if netfslib is to make use of folio->private for its own purposes.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Add a ->free_subrequest() op so that the netfs can clean up data attached
to a subrequest.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Allow the network filesystem to specify extra space to be allocated on the
end of the io (sub)request. This allows cifs, for example, to use this
space rather than allocating its own cifs_readdata struct.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Add a procfile, /proc/fs/netfs/requests, to list in-progress netfslib I/O
requests.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Move the resource pinning-for-writeback from fscache code to netfslib code.
This is used to keep a cache backing object pinned whilst we have dirty
pages on the netfs inode in the pagecache such that VM writeback will be
able to reach it.
Whilst we're at it, switch the parameters of netfs_unpin_writeback() to
match ->write_inode() so that it can be used for that directly.
Note that this mechanism could be more generically useful than that for
network filesystems. Quite often they have to keep around other resources
(e.g. authentication tokens or network connections) until the writeback is
complete.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Rename /proc/fs/fscache to "netfs" and make a symlink from fscache to that.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Christian Brauner <christian@brauner.io>
cc: linux-fsdevel@vger.kernel.org
cc: linux-cachefs@redhat.com
|
|
Remove ->begin_cache_operation() in favour of just calling fscache directly.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Christian Brauner <christian@brauner.io>
cc: linux-fsdevel@vger.kernel.org
cc: linux-cachefs@redhat.com
|
|
Move netfs_extract_iter_to_sg() to lib/scatterlist.c as it's going to be
used by more than just network filesystems (AF_ALG, for example).
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Herbert Xu <herbert@gondor.apana.org.au>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-crypto@vger.kernel.org
cc: linux-cachefs@redhat.com
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: netdev@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Rename netfs_extract_iter_to_sg() and its auxiliary functions to drop the
netfs_ prefix.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Herbert Xu <herbert@gondor.apana.org.au>
cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-crypto@vger.kernel.org
cc: linux-cachefs@redhat.com
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: netdev@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Provide a function for filling in a scatterlist from the list of pages
contained in an iterator.
If the iterator is UBUF- or IOBUF-type, the pages have a pin taken on them
(as FOLL_PIN).
If the iterator is BVEC-, KVEC- or XARRAY-type, no pin is taken on the
pages and it is left to the caller to manage their lifetime. It cannot be
assumed that a ref can be validly taken, particularly in the case of a KVEC
iterator.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: linux-cachefs@redhat.com
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Add a function to extract the pages from a user-space supplied iterator
(UBUF- or IOVEC-type) into a BVEC-type iterator, retaining the pages by
getting a pin on them (as FOLL_PIN) as we go.
This is useful in three situations:
(1) A userspace thread may have a sibling that unmaps or remaps the
process's VM during the operation, changing the assignment of the
pages and potentially causing an error. Retaining the pages keeps
some pages around, even if this occurs; futher, we find out at the
point of extraction if EFAULT is going to be incurred.
(2) Pages might get swapped out/discarded if not retained, so we want to
retain them to avoid the reload causing a deadlock due to a DIO
from/to an mmapped region on the same file.
(3) The iterator may get passed to sendmsg() by the filesystem. If a
fault occurs, we may get a short write to a TCP stream that's then
tricky to recover from.
We don't deal with other types of iterator here, leaving it to other
mechanisms to retain the pages (eg. PG_locked, PG_writeback and the pipe
lock).
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: linux-cachefs@redhat.com
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Add prepare_ondemand_read() callback dedicated for the on-demand read
scenario, so that callers from this scenario can be decoupled from
netfs_io_subrequest.
The original cachefiles_prepare_read() is now refactored to a generic
routine accepting a parameter list instead of netfs_io_subrequest.
There's no logic change, except that the debug id of subrequest and
request is removed from trace_cachefiles_prep_read().
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Acked-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20221124034212.81892-2-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
Pull folio updates from Matthew Wilcox:
- Fix an accounting bug that made NR_FILE_DIRTY grow without limit
when running xfstests
- Convert more of mpage to use folios
- Remove add_to_page_cache() and add_to_page_cache_locked()
- Convert find_get_pages_range() to filemap_get_folios()
- Improvements to the read_cache_page() family of functions
- Remove a few unnecessary checks of PageError
- Some straightforward filesystem conversions to use folios
- Split PageMovable users out from address_space_operations into
their own movable_operations
- Convert aops->migratepage to aops->migrate_folio
- Remove nobh support (Christoph Hellwig)
* tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache: (78 commits)
fs: remove the NULL get_block case in mpage_writepages
fs: don't call ->writepage from __mpage_writepage
fs: remove the nobh helpers
jfs: stop using the nobh helper
ext2: remove nobh support
ntfs3: refactor ntfs_writepages
mm/folio-compat: Remove migration compatibility functions
fs: Remove aops->migratepage()
secretmem: Convert to migrate_folio
hugetlb: Convert to migrate_folio
aio: Convert to migrate_folio
f2fs: Convert to filemap_migrate_folio()
ubifs: Convert to filemap_migrate_folio()
btrfs: Convert btrfs_migratepage to migrate_folio
mm/migrate: Add filemap_migrate_folio()
mm/migrate: Convert migrate_page() to migrate_folio()
nfs: Convert to migrate_folio
btrfs: Convert btree_migratepage to migrate_folio
mm/migrate: Convert expected_page_refs() to folio_expected_refs()
mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio()
...
|
|
check_write_begin() will unlock and put the folio when return
non-zero. So we should avoid unlocking and putting it twice in
netfs layer.
Change the way ->check_write_begin() works in the following two ways:
(1) Pass it a pointer to the folio pointer, allowing it to unlock and put
the folio prior to doing the stuff it wants to do, provided it clears
the folio pointer.
(2) Change the return values such that 0 with folio pointer set means
continue, 0 with folio pointer cleared means re-get and all error
codes indicating an error (no special treatment for -EAGAIN).
[ bagasdotme: use Sphinx code text syntax for *foliop pointer ]
Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/56423
Link: https://lore.kernel.org/r/cf169f43-8ee7-8697-25da-0204d1b4343e@redhat.com
Co-developed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
The 'extern' keyword is not necessary and removing it lets us shorten
some lines.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
|
|
Commit e81fb4198e27 ("netfs: Further cleanups after struct netfs_inode
wrapper introduced") changed the argument types and names, and actually
updated the comment too (although that was thanks to David Howells, not
me: my original patch only changed the code).
But the comment fixup didn't go quite far enough, and didn't change the
argument name in the comment, resulting in
include/linux/netfs.h:314: warning: Function parameter or member 'ctx' not described in 'netfs_inode_init'
include/linux/netfs.h:314: warning: Excess function parameter 'inode' description in 'netfs_inode_init'
during htmldoc generation.
Fixes: e81fb4198e27 ("netfs: Further cleanups after struct netfs_inode wrapper introduced")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The netfs_io_request cleanup op is now always in a position to be given a
pointer to a netfs_io_request struct, so this can be passed in instead of
the mapping and private data arguments (both of which are included in the
struct).
So rename the ->cleanup op to ->free_request (to match ->init_request) and
pass in the I/O pointer.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
|
|
Change the signature of netfs helper functions to take a struct netfs_inode
pointer rather than a struct inode pointer where appropriate, thereby
relieving the need for the network filesystem to convert its internal inode
format down to the VFS inode only for netfslib to bounce it back up. For
type safety, it's better not to do that (and it's less typing too).
Give netfs_write_begin() an extra argument to pass in a pointer to the
netfs_inode struct rather than deriving it internally from the file
pointer. Note that the ->write_begin() and ->write_end() ops are intended
to be replaced in the future by netfslib code that manages this without the
need to call in twice for each page.
netfs_readpage() and similar are intended to be pointed at directly by the
address_space_operations table, so must stick to the signature dictated by
the function pointers there.
Changes
=======
- Updated the kerneldoc comments and documentation [DH].
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/CAHk-=wgkwKyNmNdKpQkqZ6DnmUL-x9hp0YBnUGjaPFEAdxDTbw@mail.gmail.com/
|
|
While randstruct was satisfied with using an open-coded "void *" offset
cast for the netfs_i_context <-> inode casting, __builtin_object_size() as
used by FORTIFY_SOURCE was not as easily fooled. This was causing the
following complaint[1] from gcc v12:
In file included from include/linux/string.h:253,
from include/linux/ceph/ceph_debug.h:7,
from fs/ceph/inode.c:2:
In function 'fortify_memset_chk',
inlined from 'netfs_i_context_init' at include/linux/netfs.h:326:2,
inlined from 'ceph_alloc_inode' at fs/ceph/inode.c:463:2:
include/linux/fortify-string.h:242:25: warning: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
242 | __write_overflow_field(p_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fix this by embedding a struct inode into struct netfs_i_context (which
should perhaps be renamed to struct netfs_inode). The struct inode
vfs_inode fields are then removed from the 9p, afs, ceph and cifs inode
structs and vfs_inode is then simply changed to "netfs.inode" in those
filesystems.
Further, rename netfs_i_context to netfs_inode, get rid of the
netfs_inode() function that converted a netfs_i_context pointer to an
inode pointer (that can now be done with &ctx->inode) and rename the
netfs_i_context() function to netfs_inode() (which is now a wrapper
around container_of()).
Most of the changes were done with:
perl -p -i -e 's/vfs_inode/netfs.inode/'g \
`git grep -l 'vfs_inode' -- fs/{9p,afs,ceph,cifs}/*.[ch]`
Kees suggested doing it with a pair structure[2] and a special
declarator to insert that into the network filesystem's inode
wrapper[3], but I think it's cleaner to embed it - and then it doesn't
matter if struct randomisation reorders things.
Dave Chinner suggested using a filesystem-specific VFS_I() function in
each filesystem to convert that filesystem's own inode wrapper struct
into the VFS inode struct[4].
Version #2:
- Fix a couple of missed name changes due to a disabled cifs option.
- Rename nfs_i_context to nfs_inode
- Use "netfs" instead of "nic" as the member name in per-fs inode wrapper
structs.
[ This also undoes commit 507160f46c55 ("netfs: gcc-12: temporarily
disable '-Wattribute-warning' for now") that is no longer needed ]
Fixes: bc899ee1c898 ("netfs: Add a netfs inode context")
Reported-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
cc: Jonathan Corbet <corbet@lwn.net>
cc: Eric Van Hensbergen <ericvh@gmail.com>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Steve French <smfrench@gmail.com>
cc: William Kucharski <william.kucharski@oracle.com>
cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
cc: Dave Chinner <david@fromorbit.com>
cc: linux-doc@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: samba-technical@lists.samba.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-hardening@vger.kernel.org
Link: https://lore.kernel.org/r/d2ad3a3d7bdd794c6efb562d2f2b655fb67756b9.camel@kernel.org/ [1]
Link: https://lore.kernel.org/r/20220517210230.864239-1-keescook@chromium.org/ [2]
Link: https://lore.kernel.org/r/20220518202212.2322058-1-keescook@chromium.org/ [3]
Link: https://lore.kernel.org/r/20220524101205.GI2306852@dread.disaster.area/ [4]
Link: https://lore.kernel.org/r/165296786831.3591209.12111293034669289733.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165305805651.4094995.7763502506786714216.stgit@warthog.procyon.org.uk # v2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pull page cache updates from Matthew Wilcox:
- Appoint myself page cache maintainer
- Fix how scsicam uses the page cache
- Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS
- Remove the AOP flags entirely
- Remove pagecache_write_begin() and pagecache_write_end()
- Documentation updates
- Convert several address_space operations to use folios:
- is_dirty_writeback
- readpage becomes read_folio
- releasepage becomes release_folio
- freepage becomes free_folio
- Change filler_t to require a struct file pointer be the first
argument like ->read_folio
* tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache: (107 commits)
nilfs2: Fix some kernel-doc comments
Appoint myself page cache maintainer
fs: Remove aops->freepage
secretmem: Convert to free_folio
nfs: Convert to free_folio
orangefs: Convert to free_folio
fs: Add free_folio address space operation
fs: Convert drop_buffers() to use a folio
fs: Change try_to_free_buffers() to take a folio
jbd2: Convert release_buffer_page() to use a folio
jbd2: Convert jbd2_journal_try_to_free_buffers to take a folio
reiserfs: Convert release_buffer_page() to use a folio
fs: Remove last vestiges of releasepage
ubifs: Convert to release_folio
reiserfs: Convert to release_folio
orangefs: Convert to release_folio
ocfs2: Convert to release_folio
nilfs2: Remove comment about releasepage
nfs: Convert to release_folio
jfs: Convert to release_folio
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs (and fscache) updates from Gao Xiang:
"After working on it on the mailing list for more than half a year, we
finally form 'erofs over fscache' feature into shape. Hopefully it
could bring more possibility to the communities.
The story mainly started from a new project what we called "RAFS v6" [1]
for Nydus image service almost a year ago, which enhances EROFS to be
a new form of one bootstrap (which includes metadata representing the
whole fs tree) + several data-deduplicated content addressable blobs
(actually treated as multiple devices). Each blob can represent one
container image layer but not quite exactly since all new data can be
fully existed in the previous blobs so no need to introduce another
new blob.
It is actually not a new idea (at least on my side it's much like a
simpilied casync [2] for now) and has many benefits over per-file
blobs or some other exist ways since typically each RAFS v6 image only
has dozens of device blobs instead of thousands of per-file blobs.
It's easy to be signed with user keys as a golden image, transfered
untouchedly with minimal overhead over the network, kept in some type
of storage conveniently, and run with (optional) runtime verification
but without involving too many irrelevant features crossing the system
beyond EROFS itself. At least it's our final goal and we're keeping
working on it. There was also a good summary of this approach from the
casync author [3].
Regardless further optimizations, this work is almost done in the
previous Linux release cycles. In this round, we'd like to introduce
on-demand load for EROFS with the fscache/cachefiles infrastructure,
considering the following advantages:
- Introduce new file-based backend to EROFS. Although each image only
contains dozens of blobs but in densely-deployed runC host for
example, there could still be massive blobs on a machine, which is
messy if each blob is treated as a device. In contrast, fscache and
cachefiles are really great interfaces for us to make them work.
- Introduce on-demand load to fscache and EROFS. Previously, fscache
is mainly used to caching network-likewise filesystems, now it can
support on-demand downloading for local fses too with the exact
localfs on-disk format. It has many advantages which we're been
described in the latest patchset cover letter [4]. In addition to
that, most importantly, the cached data is still stored in the
original local fs on-disk format so that it's still the one signed
with private keys but only could be partially available. Users can
fully trust it during running. Later, users can also back up
cachefiles easily to another machine.
- More reliable on-demand approach in principle. After data is all
available locally, user daemon can be no longer online in some use
cases, which helps daemon crash recovery (filesystems can still in
service) and hot-upgrade (user daemon can be upgraded more
frequently due to new features or protocols introduced.)
- Other format can also be converted to EROFS filesystem format over
the internet on the fly with the new on-demand load feature and
mounted. That is entirely possible with on-demand load feature as
long as such archive format metadata can be fetched in advance like
stargz.
In addition, although currently our target user is Nydus image service [5],
but laterly, it can be used for other use cases like on-demand system
booting, etc. As for the fscache on-demand load feature itself,
strictly it can be used for other local fses too. Laterly we could
promote most code to the iomap infrastructure and also enhance it in
the read-write way if other local fses are interested.
Thanks David Howells for taking so much time and patience on this
these months, many thanks with great respect here again! Thanks Jeffle
for working on this feature and Xin Yin from Bytedance for
asynchronous I/O implementation as well as Zichen Tian, Jia Zhu, and
Yan Song for testing, much appeciated. We're also exploring more
possibly over fscache cache management over FSDAX for secure
containers and working on more improvements and useful features for
fscache, cachefiles, and on-demand load.
In addition to "erofs over fscache", NFS export and idmapped mount are
also completed in this cycle for container use cases as well.
Summary:
- Add erofs on-demand load support over fscache
- Support NFS export for erofs
- Support idmapped mounts for erofs
- Don't prompt for risk any more when using big pcluster
- Fix buffer copy overflow of ztailpacking feature
- Several minor cleanups"
[1] https://lore.kernel.org/r/20210730194625.93856-1-hsiangkao@linux.alibaba.com
[2] https://github.com/systemd/casync
[3] http://0pointer.net/blog/casync-a-tool-for-distributing-file-system-images.html
[4] https://lore.kernel.org/r/20220509074028.74954-1-jefflexu@linux.alibaba.com
[5] https://github.com/dragonflyoss/image-service
* tag 'erofs-for-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: (29 commits)
erofs: scan devices from device table
erofs: change to use asynchronous io for fscache readpage/readahead
erofs: add 'fsid' mount option
erofs: implement fscache-based data readahead
erofs: implement fscache-based data read for inline layout
erofs: implement fscache-based data read for non-inline layout
erofs: implement fscache-based metadata read
erofs: register fscache context for extra data blobs
erofs: register fscache context for primary data blob
erofs: add erofs_fscache_read_folios() helper
erofs: add anonymous inode caching metadata for data blobs
erofs: add fscache context helper functions
erofs: register fscache volume
erofs: add fscache mode check helper
erofs: make erofs_map_blocks() generally available
cachefiles: document on-demand read mode
cachefiles: add tracepoints for on-demand read mode
cachefiles: enable on-demand read mode
cachefiles: implement on-demand read
cachefiles: notify the user daemon when withdrawing cookie
...
|
|
Implement the data plane of on-demand read mode.
The early implementation [1] place the entry to
cachefiles_ondemand_read() in fscache_read(). However, fscache_read()
can only detect if the requested file range is fully cache miss, whilst
we need to notify the user daemon as long as there's a hole inside the
requested file range.
Thus the entry is now placed in cachefiles_prepare_read(). When working
in on-demand read mode, once a hole detected, the read routine will send
a READ request to the user daemon. The user daemon needs to fetch the
data and write it to the cache file. After sending the READ request, the
read routine will hang there, until the READ request is handled by the
user daemon. Then it will retry to read from the same file range. If no
progress encountered, the read routine will fail then.
A new NETFS_SREQ_ONDEMAND flag is introduced to indicate that on-demand
read should be done when a cache miss encountered.
[1] https://lore.kernel.org/all/20220406075612.60298-6-jefflexu@linux.alibaba.com/ #v8
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20220425122143.56815-6-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
This is straightforward because netfs already worked in terms of folios.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
|
|
There are no more aop flags left, so remove the parameter.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Clang's structure layout randomization feature gets upset when it sees
struct inode (which is randomized) cast to struct netfs_i_context. This
is due to seeing the inode pointer as being treated as an array of inodes,
rather than "something else, following struct inode".
Since netfs can't use container_of() (since it doesn't know what the
true containing struct is), it uses this direct offset instead. Adjust
the code to better reflect what is happening: an arbitrary pointer is
being adjusted and cast to something else: use a "void *" for the math.
The resulting binary output is the same, but Clang no longer sees an
unexpected cross-structure cast:
In file included from ../fs/nfs/inode.c:50:
In file included from ../fs/nfs/fscache.h:15:
In file included from ../include/linux/fscache.h:18:
../include/linux/netfs.h:298:9: error: casting from randomized structure pointer type 'struct inode *' to 'struct netfs_i_context *'
return (struct netfs_i_context *)(inode + 1);
^
1 error generated.
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220503205503.3054173-2-keescook@chromium.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/lkml/7562f8eccd7cc0e447becfe9912179088784e3b9.camel@kernel.org
|
|
Provide a place in which to keep track of the actual remote file size in
the netfs context. This is needed because inode->i_size will be updated as
we buffer writes in the pagecache, but the server file size won't get
updated until we flush them back.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/164623013727.3564931.17659955636985232717.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/164678219305.1200972.6459431995188365134.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/164692921865.2099075.5310757978508056134.stgit@warthog.procyon.org.uk/ # v3
|
|
Add a netfs_i_context struct that should be included in the network
filesystem's own inode struct wrapper, directly after the VFS's inode
struct, e.g.:
struct my_inode {
struct {
/* These must be contiguous */
struct inode vfs_inode;
struct netfs_i_context netfs_ctx;
};
};
The netfs_i_context struct so far contains a single field for the network
filesystem to use - the cache cookie:
struct netfs_i_context {
...
struct fscache_cookie *cache;
};
Three functions are provided to help with this:
(1) void netfs_i_context_init(struct inode *inode,
const struct netfs_request_ops *ops);
Initialise the netfs context and set the operations.
(2) struct netfs_i_context *netfs_i_context(struct inode *inode);
Find the netfs context from the VFS inode.
(3) struct inode *netfs_inode(struct netfs_i_context *ctx);
Find the VFS inode from the netfs context.
Changes
=======
ver #4)
- Fix netfs_is_cache_enabled() to check cookie->cache_priv to see if a
cache is present[3].
- Fix netfs_skip_folio_read() to zero out all of the page, not just some
of it[3].
ver #3)
- Split out the bit to move ceph cap-getting on readahead into
ceph_init_request()[1].
- Stick in a comment to the netfs inode structs indicating the contiguity
requirements[2].
ver #2)
- Adjust documentation to match.
- Use "#if IS_ENABLED()" in netfs_i_cookie(), not "#ifdef".
- Move the cap check from ceph_readahead() to ceph_init_request() to be
called from netfslib.
- Remove ceph_readahead() and use netfs_readahead() directly instead.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/8af0d47f17d89c06bbf602496dd845f2b0bf25b3.camel@kernel.org/ [1]
Link: https://lore.kernel.org/r/beaf4f6a6c2575ed489adb14b257253c868f9a5c.camel@kernel.org/ [2]
Link: https://lore.kernel.org/r/3536452.1647421585@warthog.procyon.org.uk/ [3]
Link: https://lore.kernel.org/r/164622984545.3564931.15691742939278418580.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/164678213320.1200972.16807551936267647470.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/164692909854.2099075.9535537286264248057.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/306388.1647595110@warthog.procyon.org.uk/ # v4
|