From 000dbe0bec058cbf2ca9e156e4a5584f5158b0f9 Mon Sep 17 00:00:00 2001 From: Dave Wysochanski Date: Mon, 20 Feb 2023 08:43:06 -0500 Subject: NFS: Convert buffered read paths to use netfs when fscache is enabled Convert the NFS buffered read code paths to corresponding netfs APIs, but only when fscache is configured and enabled. The netfs API defines struct netfs_request_ops which must be filled in by the network filesystem. For NFS, we only need to define 5 of the functions, the main one being the issue_read() function. The issue_read() function is called by the netfs layer when a read cannot be fulfilled locally, and must be sent to the server (either the cache is not active, or it is active but the data is not available). Once the read from the server is complete, netfs requires a call to netfs_subreq_terminated() which conveys either how many bytes were read successfully, or an error. Note that issue_read() is called with a structure, netfs_io_subrequest, which defines the IO requested, and contains a start and a length (both in bytes), and assumes the underlying netfs will return a either an error on the whole region, or the number of bytes successfully read. The NFS IO path is page based and the main APIs are the pgio APIs defined in pagelist.c. For the pgio APIs, there is no way for the caller to know how many RPCs will be sent and how the pages will be broken up into underlying RPCs, each of which will have their own completion and return code. In contrast, netfs is subrequest based, a single subrequest may contain multiple pages, and a single subrequest is initiated with issue_read() and terminated with netfs_subreq_terminated(). Thus, to utilze the netfs APIs, NFS needs some way to accommodate the netfs API requirement on the single response to the whole subrequest, while also minimizing disruptive changes to the NFS pgio layer. The approach taken with this patch is to allocate a small structure for each nfs_netfs_issue_read() call, store the final error and number of bytes successfully transferred in the structure, and update these values as each RPC completes. The refcount on the structure is used as a marker for the last RPC completion, is incremented in nfs_netfs_read_initiate(), and decremented inside nfs_netfs_read_completion(), when a nfs_pgio_header contains a valid pointer to the data. On the final put (which signals the final outstanding RPC is complete) in nfs_netfs_read_completion(), call netfs_subreq_terminated() with either the final error value (if one or more READs complete with an error) or the number of bytes successfully transferred (if all RPCs complete successfully). Note that when all RPCs complete successfully, the number of bytes transferred is capped to the length of the subrequest. Capping the transferred length to the subrequest length prevents "Subreq overread" warnings from netfs. This is due to the "aligned_len" in nfs_pageio_add_page(), and the corner case where NFS requests a full page at the end of the file, even when i_size reflects only a partial page (NFS overread). Signed-off-by: Dave Wysochanski Tested-by: Daire Byrne Signed-off-by: Anna Schumaker --- fs/nfs/inode.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'fs/nfs/inode.c') diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c index 222a28320e1c..5c8027e3c961 100644 --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -2254,6 +2254,8 @@ struct inode *nfs_alloc_inode(struct super_block *sb) #ifdef CONFIG_NFS_V4_2 nfsi->xattr_cache = NULL; #endif + nfs_netfs_inode_init(nfsi); + return &nfsi->vfs_inode; } EXPORT_SYMBOL_GPL(nfs_alloc_inode); -- cgit From 3db63daabe210af32a09533fe7d8d47c711a103c Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Wed, 22 Mar 2023 09:27:04 +1100 Subject: NFSv3: handle out-of-order write replies. NFSv3 includes pre/post wcc attributes which allow the client to determine if all changes to the file have been made by the client itself, or if any might have been made by some other client. If there are gaps in the pre/post ctime sequence it must be assumed that some other client changed the file in that gap and the local cache must be suspect. The next time the file is opened the cache should be invalidated. Since Commit 1c341b777501 ("NFS: Add deferred cache invalidation for close-to-open consistency violations") in linux 5.3 the Linux client has been triggering this invalidation. The chunk in nfs_update_inode() in particularly triggers. Unfortunately Linux NFS assumes that all replies will be processed in the order sent, and will arrive in the order processed. This is not true in general. Consequently Linux NFS might ignore the wcc info in a WRITE reply because the reply is in response to a WRITE that was sent before some other request for which a reply has already been seen. This is detected by Linux using the gencount tests in nfs_inode_attr_cmp(). Also, when the gencount tests pass it is still possible that the request were processed on the server in a different order, and a gap seen in the ctime sequence might be filled in by a subsequent reply, so gaps should not immediately trigger delayed invalidation. The net result is that writing to a server and then reading the file back can result in going to the server for the read rather than serving it from cache - all because a couple of replies arrived out-of-order. This is a performance regression over kernels before 5.3, though the change in 5.3 is a correctness improvement. This has been seen with Linux writing to a Netapp server which occasionally re-orders requests. In testing the majority of requests were in-order, but a few (maybe 2 or three at a time) could be re-ordered. This patch addresses the problem by recording any gaps seen in the pre/post ctime sequence and not triggering invalidation until either there are too many gaps to fit in the table, or until there are no more active writes and the remaining gaps cannot be resolved. We allocate a table of 16 gaps on demand. If the allocation fails we revert to current behaviour which is of little cost as we are unlikely to be able to cache the writes anyway. In the table we store "start->end" pair when iversion is updated and "end<-start" pairs pre/post pairs reported by the server. Usually these exactly cancel out and so nothing is stored. When there are out-of-order replies we do store gaps and these will eventually be cancelled against later replies when this client is the only writer. If the final write is out-of-order there may be one gap remaining when the file is closed. This will be noticed and if there is precisely on gap and if the iversion can be advanced to match it, then we do so. This patch makes no attempt to handle directories correctly. The same problem potentially exists in the out-of-order replies to create/unlink requests can cause future lookup requires to be sent to the server unnecessarily. A similar scheme using the same primitives could be used to notice and handle out-of-order replies. Signed-off-by: NeilBrown Signed-off-by: Anna Schumaker --- fs/nfs/inode.c | 112 +++++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 97 insertions(+), 15 deletions(-) (limited to 'fs/nfs/inode.c') diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c index 5c8027e3c961..eb8af1e404d9 100644 --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -208,11 +208,12 @@ void nfs_set_cache_invalid(struct inode *inode, unsigned long flags) nfsi->cache_validity |= flags; - if (inode->i_mapping->nrpages == 0) - nfsi->cache_validity &= ~(NFS_INO_INVALID_DATA | - NFS_INO_DATA_INVAL_DEFER); - else if (nfsi->cache_validity & NFS_INO_INVALID_DATA) - nfsi->cache_validity &= ~NFS_INO_DATA_INVAL_DEFER; + if (inode->i_mapping->nrpages == 0) { + nfsi->cache_validity &= ~NFS_INO_INVALID_DATA; + nfs_ooo_clear(nfsi); + } else if (nfsi->cache_validity & NFS_INO_INVALID_DATA) { + nfs_ooo_clear(nfsi); + } trace_nfs_set_cache_invalid(inode, 0); } EXPORT_SYMBOL_GPL(nfs_set_cache_invalid); @@ -677,9 +678,10 @@ static int nfs_vmtruncate(struct inode * inode, loff_t offset) trace_nfs_size_truncate(inode, offset); i_size_write(inode, offset); /* Optimisation */ - if (offset == 0) - NFS_I(inode)->cache_validity &= ~(NFS_INO_INVALID_DATA | - NFS_INO_DATA_INVAL_DEFER); + if (offset == 0) { + NFS_I(inode)->cache_validity &= ~NFS_INO_INVALID_DATA; + nfs_ooo_clear(NFS_I(inode)); + } NFS_I(inode)->cache_validity &= ~NFS_INO_INVALID_SIZE; spin_unlock(&inode->i_lock); @@ -1109,7 +1111,7 @@ void nfs_inode_attach_open_context(struct nfs_open_context *ctx) spin_lock(&inode->i_lock); if (list_empty(&nfsi->open_files) && - (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER)) + nfs_ooo_test(nfsi)) nfs_set_cache_invalid(inode, NFS_INO_INVALID_DATA | NFS_INO_REVAL_FORCED); list_add_tail_rcu(&ctx->list, &nfsi->open_files); @@ -1353,8 +1355,8 @@ int nfs_clear_invalid_mapping(struct address_space *mapping) set_bit(NFS_INO_INVALIDATING, bitlock); smp_wmb(); - nfsi->cache_validity &= - ~(NFS_INO_INVALID_DATA | NFS_INO_DATA_INVAL_DEFER); + nfsi->cache_validity &= ~NFS_INO_INVALID_DATA; + nfs_ooo_clear(nfsi); spin_unlock(&inode->i_lock); trace_nfs_invalidate_mapping_enter(inode); ret = nfs_invalidate_mapping(inode, mapping); @@ -1816,6 +1818,66 @@ static int nfs_inode_finish_partial_attr_update(const struct nfs_fattr *fattr, return 0; } +static void nfs_ooo_merge(struct nfs_inode *nfsi, + u64 start, u64 end) +{ + int i, cnt; + + if (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER) + /* No point merging anything */ + return; + + if (!nfsi->ooo) { + nfsi->ooo = kmalloc(sizeof(*nfsi->ooo), GFP_ATOMIC); + if (!nfsi->ooo) { + nfsi->cache_validity |= NFS_INO_DATA_INVAL_DEFER; + return; + } + nfsi->ooo->cnt = 0; + } + + /* add this range, merging if possible */ + cnt = nfsi->ooo->cnt; + for (i = 0; i < cnt; i++) { + if (end == nfsi->ooo->gap[i].start) + end = nfsi->ooo->gap[i].end; + else if (start == nfsi->ooo->gap[i].end) + start = nfsi->ooo->gap[i].start; + else + continue; + /* Remove 'i' from table and loop to insert the new range */ + cnt -= 1; + nfsi->ooo->gap[i] = nfsi->ooo->gap[cnt]; + i = -1; + } + if (start != end) { + if (cnt >= ARRAY_SIZE(nfsi->ooo->gap)) { + nfsi->cache_validity |= NFS_INO_DATA_INVAL_DEFER; + kfree(nfsi->ooo); + nfsi->ooo = NULL; + return; + } + nfsi->ooo->gap[cnt].start = start; + nfsi->ooo->gap[cnt].end = end; + cnt += 1; + } + nfsi->ooo->cnt = cnt; +} + +static void nfs_ooo_record(struct nfs_inode *nfsi, + struct nfs_fattr *fattr) +{ + /* This reply was out-of-order, so record in the + * pre/post change id, possibly cancelling + * gaps created when iversion was jumpped forward. + */ + if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) && + (fattr->valid & NFS_ATTR_FATTR_PRECHANGE)) + nfs_ooo_merge(nfsi, + fattr->change_attr, + fattr->pre_change_attr); +} + static int nfs_refresh_inode_locked(struct inode *inode, struct nfs_fattr *fattr) { @@ -1826,8 +1888,12 @@ static int nfs_refresh_inode_locked(struct inode *inode, if (attr_cmp > 0 || nfs_inode_finish_partial_attr_update(fattr, inode)) ret = nfs_update_inode(inode, fattr); - else if (attr_cmp == 0) - ret = nfs_check_inode_attributes(inode, fattr); + else { + nfs_ooo_record(NFS_I(inode), fattr); + + if (attr_cmp == 0) + ret = nfs_check_inode_attributes(inode, fattr); + } trace_nfs_refresh_inode_exit(inode, ret); return ret; @@ -1918,6 +1984,8 @@ int nfs_post_op_update_inode_force_wcc_locked(struct inode *inode, struct nfs_fa if (attr_cmp < 0) return 0; if ((fattr->valid & NFS_ATTR_FATTR) == 0 || !attr_cmp) { + /* Record the pre/post change info before clearing PRECHANGE */ + nfs_ooo_record(NFS_I(inode), fattr); fattr->valid &= ~(NFS_ATTR_FATTR_PRECHANGE | NFS_ATTR_FATTR_PRESIZE | NFS_ATTR_FATTR_PREMTIME @@ -2072,6 +2140,15 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr) /* More cache consistency checks */ if (fattr->valid & NFS_ATTR_FATTR_CHANGE) { + if (!have_writers && nfsi->ooo && nfsi->ooo->cnt == 1 && + nfsi->ooo->gap[0].end == inode_peek_iversion_raw(inode)) { + /* There is one remaining gap that hasn't been + * merged into iversion - do that now. + */ + inode_set_iversion_raw(inode, nfsi->ooo->gap[0].start); + kfree(nfsi->ooo); + nfsi->ooo = NULL; + } if (!inode_eq_iversion_raw(inode, fattr->change_attr)) { /* Could it be a race with writeback? */ if (!(have_writers || have_delegation)) { @@ -2093,8 +2170,11 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr) dprintk("NFS: change_attr change on server for file %s/%ld\n", inode->i_sb->s_id, inode->i_ino); - } else if (!have_delegation) - nfsi->cache_validity |= NFS_INO_DATA_INVAL_DEFER; + } else if (!have_delegation) { + nfs_ooo_record(nfsi, fattr); + nfs_ooo_merge(nfsi, inode_peek_iversion_raw(inode), + fattr->change_attr); + } inode_set_iversion_raw(inode, fattr->change_attr); } } else { @@ -2248,6 +2328,7 @@ struct inode *nfs_alloc_inode(struct super_block *sb) return NULL; nfsi->flags = 0UL; nfsi->cache_validity = 0UL; + nfsi->ooo = NULL; #if IS_ENABLED(CONFIG_NFS_V4) nfsi->nfs4_acl = NULL; #endif /* CONFIG_NFS_V4 */ @@ -2262,6 +2343,7 @@ EXPORT_SYMBOL_GPL(nfs_alloc_inode); void nfs_free_inode(struct inode *inode) { + kfree(NFS_I(inode)->ooo); kmem_cache_free(nfs_inode_cachep, NFS_I(inode)); } EXPORT_SYMBOL_GPL(nfs_free_inode); -- cgit