diff options
Diffstat (limited to 'Documentation/filesystems/nfs')
22 files changed, 2090 insertions, 1502 deletions
diff --git a/Documentation/filesystems/nfs/00-INDEX b/Documentation/filesystems/nfs/00-INDEX deleted file mode 100644 index 66eb6c8c5334..000000000000 --- a/Documentation/filesystems/nfs/00-INDEX +++ /dev/null @@ -1,24 +0,0 @@ -00-INDEX - - this file (nfs-related documentation). -Exporting - - explanation of how to make filesystems exportable. -fault_injection.txt - - information for using fault injection on the server -knfsd-stats.txt - - statistics which the NFS server makes available to user space. -nfs.txt - - nfs client, and DNS resolution for fs_locations. -nfs41-server.txt - - info on the Linux server implementation of NFSv4 minor version 1. -nfs-rdma.txt - - how to install and setup the Linux NFS/RDMA client and server software -nfsroot.txt - - short guide on setting up a diskless box with NFS root filesystem. -pnfs.txt - - short explanation of some of the internals of the pnfs client code -rpc-cache.txt - - introduction to the caching mechanisms in the sunrpc layer. -idmapper.txt - - information for configuring request-keys to be used by idmapper -knfsd-rpcgss.txt - - Information on GSS authentication support in the NFS Server diff --git a/Documentation/filesystems/nfs/Exporting b/Documentation/filesystems/nfs/Exporting deleted file mode 100644 index 09994c247289..000000000000 --- a/Documentation/filesystems/nfs/Exporting +++ /dev/null @@ -1,154 +0,0 @@ - -Making Filesystems Exportable -============================= - -Overview --------- - -All filesystem operations require a dentry (or two) as a starting -point. Local applications have a reference-counted hold on suitable -dentries via open file descriptors or cwd/root. However remote -applications that access a filesystem via a remote filesystem protocol -such as NFS may not be able to hold such a reference, and so need a -different way to refer to a particular dentry. As the alternative -form of reference needs to be stable across renames, truncates, and -server-reboot (among other things, though these tend to be the most -problematic), there is no simple answer like 'filename'. - -The mechanism discussed here allows each filesystem implementation to -specify how to generate an opaque (outside of the filesystem) byte -string for any dentry, and how to find an appropriate dentry for any -given opaque byte string. -This byte string will be called a "filehandle fragment" as it -corresponds to part of an NFS filehandle. - -A filesystem which supports the mapping between filehandle fragments -and dentries will be termed "exportable". - - - -Dcache Issues -------------- - -The dcache normally contains a proper prefix of any given filesystem -tree. This means that if any filesystem object is in the dcache, then -all of the ancestors of that filesystem object are also in the dcache. -As normal access is by filename this prefix is created naturally and -maintained easily (by each object maintaining a reference count on -its parent). - -However when objects are included into the dcache by interpreting a -filehandle fragment, there is no automatic creation of a path prefix -for the object. This leads to two related but distinct features of -the dcache that are not needed for normal filesystem access. - -1/ The dcache must sometimes contain objects that are not part of the - proper prefix. i.e that are not connected to the root. -2/ The dcache must be prepared for a newly found (via ->lookup) directory - to already have a (non-connected) dentry, and must be able to move - that dentry into place (based on the parent and name in the - ->lookup). This is particularly needed for directories as - it is a dcache invariant that directories only have one dentry. - -To implement these features, the dcache has: - -a/ A dentry flag DCACHE_DISCONNECTED which is set on - any dentry that might not be part of the proper prefix. - This is set when anonymous dentries are created, and cleared when a - dentry is noticed to be a child of a dentry which is in the proper - prefix. - -b/ A per-superblock list "s_anon" of dentries which are the roots of - subtrees that are not in the proper prefix. These dentries, as - well as the proper prefix, need to be released at unmount time. As - these dentries will not be hashed, they are linked together on the - d_hash list_head. - -c/ Helper routines to allocate anonymous dentries, and to help attach - loose directory dentries at lookup time. They are: - d_alloc_anon(inode) will return a dentry for the given inode. - If the inode already has a dentry, one of those is returned. - If it doesn't, a new anonymous (IS_ROOT and - DCACHE_DISCONNECTED) dentry is allocated and attached. - In the case of a directory, care is taken that only one dentry - can ever be attached. - d_splice_alias(inode, dentry) will make sure that there is a - dentry with the same name and parent as the given dentry, and - which refers to the given inode. - If the inode is a directory and already has a dentry, then that - dentry is d_moved over the given dentry. - If the passed dentry gets attached, care is taken that this is - mutually exclusive to a d_alloc_anon operation. - If the passed dentry is used, NULL is returned, else the used - dentry is returned. This corresponds to the calling pattern of - ->lookup. - - -Filesystem Issues ------------------ - -For a filesystem to be exportable it must: - - 1/ provide the filehandle fragment routines described below. - 2/ make sure that d_splice_alias is used rather than d_add - when ->lookup finds an inode for a given parent and name. - - If inode is NULL, d_splice_alias(inode, dentry) is eqivalent to - - d_add(dentry, inode), NULL - - Similarly, d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err) - - Typically the ->lookup routine will simply end with a: - - return d_splice_alias(inode, dentry); - } - - - - A file system implementation declares that instances of the filesystem -are exportable by setting the s_export_op field in the struct -super_block. This field must point to a "struct export_operations" -struct which has the following members: - - encode_fh (optional) - Takes a dentry and creates a filehandle fragment which can later be used - to find or create a dentry for the same object. The default - implementation creates a filehandle fragment that encodes a 32bit inode - and generation number for the inode encoded, and if necessary the - same information for the parent. - - fh_to_dentry (mandatory) - Given a filehandle fragment, this should find the implied object and - create a dentry for it (possibly with d_alloc_anon). - - fh_to_parent (optional but strongly recommended) - Given a filehandle fragment, this should find the parent of the - implied object and create a dentry for it (possibly with d_alloc_anon). - May fail if the filehandle fragment is too small. - - get_parent (optional but strongly recommended) - When given a dentry for a directory, this should return a dentry for - the parent. Quite possibly the parent dentry will have been allocated - by d_alloc_anon. The default get_parent function just returns an error - so any filehandle lookup that requires finding a parent will fail. - ->lookup("..") is *not* used as a default as it can leave ".." entries - in the dcache which are too messy to work with. - - get_name (optional) - When given a parent dentry and a child dentry, this should find a name - in the directory identified by the parent dentry, which leads to the - object identified by the child dentry. If no get_name function is - supplied, a default implementation is provided which uses vfs_readdir - to find potential names, and matches inode numbers to find the correct - match. - - -A filehandle fragment consists of an array of 1 or more 4byte words, -together with a one byte "type". -The decode_fh routine should not depend on the stated size that is -passed to it. This size may be larger than the original filehandle -generated by encode_fh, in which case it will have been padded with -nuls. Rather, the encode_fh routine should choose a "type" which -indicates the decode_fh how much of the filehandle is valid, and how -it should be interpreted. diff --git a/Documentation/filesystems/nfs/client-identifier.rst b/Documentation/filesystems/nfs/client-identifier.rst new file mode 100644 index 000000000000..4804441155f5 --- /dev/null +++ b/Documentation/filesystems/nfs/client-identifier.rst @@ -0,0 +1,216 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================= +NFSv4 client identifier +======================= + +This document explains how the NFSv4 protocol identifies client +instances in order to maintain file open and lock state during +system restarts. A special identifier and principal are maintained +on each client. These can be set by administrators, scripts +provided by site administrators, or tools provided by Linux +distributors. + +There are risks if a client's NFSv4 identifier and its principal +are not chosen carefully. + + +Introduction +------------ + +The NFSv4 protocol uses "lease-based file locking". Leases help +NFSv4 servers provide file lock guarantees and manage their +resources. + +Simply put, an NFSv4 server creates a lease for each NFSv4 client. +The server collects each client's file open and lock state under +the lease for that client. + +The client is responsible for periodically renewing its leases. +While a lease remains valid, the server holding that lease +guarantees the file locks the client has created remain in place. + +If a client stops renewing its lease (for example, if it crashes), +the NFSv4 protocol allows the server to remove the client's open +and lock state after a certain period of time. When a client +restarts, it indicates to servers that open and lock state +associated with its previous leases is no longer valid and can be +destroyed immediately. + +In addition, each NFSv4 server manages a persistent list of client +leases. When the server restarts and clients attempt to recover +their state, the server uses this list to distinguish amongst +clients that held state before the server restarted and clients +sending fresh OPEN and LOCK requests. This enables file locks to +persist safely across server restarts. + +NFSv4 client identifiers +------------------------ + +Each NFSv4 client presents an identifier to NFSv4 servers so that +they can associate the client with its lease. Each client's +identifier consists of two elements: + + - co_ownerid: An arbitrary but fixed string. + + - boot verifier: A 64-bit incarnation verifier that enables a + server to distinguish successive boot epochs of the same client. + +The NFSv4.0 specification refers to these two items as an +"nfs_client_id4". The NFSv4.1 specification refers to these two +items as a "client_owner4". + +NFSv4 servers tie this identifier to the principal and security +flavor that the client used when presenting it. Servers use this +principal to authorize subsequent lease modification operations +sent by the client. Effectively this principal is a third element of +the identifier. + +As part of the identity presented to servers, a good +"co_ownerid" string has several important properties: + + - The "co_ownerid" string identifies the client during reboot + recovery, therefore the string is persistent across client + reboots. + - The "co_ownerid" string helps servers distinguish the client + from others, therefore the string is globally unique. Note + that there is no central authority that assigns "co_ownerid" + strings. + - Because it often appears on the network in the clear, the + "co_ownerid" string does not reveal private information about + the client itself. + - The content of the "co_ownerid" string is set and unchanging + before the client attempts NFSv4 mounts after a restart. + - The NFSv4 protocol places a 1024-byte limit on the size of the + "co_ownerid" string. + +Protecting NFSv4 lease state +---------------------------- + +NFSv4 servers utilize the "client_owner4" as described above to +assign a unique lease to each client. Under this scheme, there are +circumstances where clients can interfere with each other. This is +referred to as "lease stealing". + +If distinct clients present the same "co_ownerid" string and use +the same principal (for example, AUTH_SYS and UID 0), a server is +unable to tell that the clients are not the same. Each distinct +client presents a different boot verifier, so it appears to the +server as if there is one client that is rebooting frequently. +Neither client can maintain open or lock state in this scenario. + +If distinct clients present the same "co_ownerid" string and use +distinct principals, the server is likely to allow the first client +to operate normally but reject subsequent clients with the same +"co_ownerid" string. + +If a client's "co_ownerid" string or principal are not stable, +state recovery after a server or client reboot is not guaranteed. +If a client unexpectedly restarts but presents a different +"co_ownerid" string or principal to the server, the server orphans +the client's previous open and lock state. This blocks access to +locked files until the server removes the orphaned state. + +If the server restarts and a client presents a changed "co_ownerid" +string or principal to the server, the server will not allow the +client to reclaim its open and lock state, and may give those locks +to other clients in the meantime. This is referred to as "lock +stealing". + +Lease stealing and lock stealing increase the potential for denial +of service and in rare cases even data corruption. + +Selecting an appropriate client identifier +------------------------------------------ + +By default, the Linux NFSv4 client implementation constructs its +"co_ownerid" string starting with the words "Linux NFS" followed by +the client's UTS node name (the same node name, incidentally, that +is used as the "machine name" in an AUTH_SYS credential). In small +deployments, this construction is usually adequate. Often, however, +the node name by itself is not adequately unique, and can change +unexpectedly. Problematic situations include: + + - NFS-root (diskless) clients, where the local DHCP server (or + equivalent) does not provide a unique host name. + + - "Containers" within a single Linux host. If each container has + a separate network namespace, but does not use the UTS namespace + to provide a unique host name, then there can be multiple NFS + client instances with the same host name. + + - Clients across multiple administrative domains that access a + common NFS server. If hostnames are not assigned centrally + then uniqueness cannot be guaranteed unless a domain name is + included in the hostname. + +Linux provides two mechanisms to add uniqueness to its "co_ownerid" +string: + + nfs.nfs4_unique_id + This module parameter can set an arbitrary uniquifier string + via the kernel command line, or when the "nfs" module is + loaded. + + /sys/fs/nfs/net/nfs_client/identifier + This virtual file, available since Linux 5.3, is local to the + network namespace in which it is accessed and so can provide + distinction between network namespaces (containers) when the + hostname remains uniform. + +Note that this file is empty on name-space creation. If the +container system has access to some sort of per-container identity +then that uniquifier can be used. For example, a uniquifier might +be formed at boot using the container's internal identifier: + + sha256sum /etc/machine-id | awk '{print $1}' \\ + > /sys/fs/nfs/net/nfs_client/identifier + +Security considerations +----------------------- + +The use of cryptographic security for lease management operations +is strongly encouraged. + +If NFS with Kerberos is not configured, a Linux NFSv4 client uses +AUTH_SYS and UID 0 as the principal part of its client identity. +This configuration is not only insecure, it increases the risk of +lease and lock stealing. However, it might be the only choice for +client configurations that have no local persistent storage. +"co_ownerid" string uniqueness and persistence is critical in this +case. + +When a Kerberos keytab is present on a Linux NFS client, the client +attempts to use one of the principals in that keytab when +identifying itself to servers. The "sec=" mount option does not +control this behavior. Alternately, a single-user client with a +Kerberos principal can use that principal in place of the client's +host principal. + +Using Kerberos for this purpose enables the client and server to +use the same lease for operations covered by all "sec=" settings. +Additionally, the Linux NFS client uses the RPCSEC_GSS security +flavor with Kerberos and the integrity QOS to prevent in-transit +modification of lease modification requests. + +Additional notes +---------------- +The Linux NFSv4 client establishes a single lease on each NFSv4 +server it accesses. NFSv4 mounts from a Linux NFSv4 client of a +particular server then share that lease. + +Once a client establishes open and lock state, the NFSv4 protocol +enables lease state to transition to other servers, following data +that has been migrated. This hides data migration completely from +running applications. The Linux NFSv4 client facilitates state +migration by presenting the same "client_owner4" to all servers it +encounters. + +======== +See Also +======== + + - nfs(5) + - kerberos(7) + - RFC 7530 for the NFSv4.0 specification + - RFC 8881 for the NFSv4.1 specification. diff --git a/Documentation/filesystems/nfs/exporting.rst b/Documentation/filesystems/nfs/exporting.rst new file mode 100644 index 000000000000..de64d2d002a2 --- /dev/null +++ b/Documentation/filesystems/nfs/exporting.rst @@ -0,0 +1,240 @@ +:orphan: + +Making Filesystems Exportable +============================= + +Overview +-------- + +All filesystem operations require a dentry (or two) as a starting +point. Local applications have a reference-counted hold on suitable +dentries via open file descriptors or cwd/root. However remote +applications that access a filesystem via a remote filesystem protocol +such as NFS may not be able to hold such a reference, and so need a +different way to refer to a particular dentry. As the alternative +form of reference needs to be stable across renames, truncates, and +server-reboot (among other things, though these tend to be the most +problematic), there is no simple answer like 'filename'. + +The mechanism discussed here allows each filesystem implementation to +specify how to generate an opaque (outside of the filesystem) byte +string for any dentry, and how to find an appropriate dentry for any +given opaque byte string. +This byte string will be called a "filehandle fragment" as it +corresponds to part of an NFS filehandle. + +A filesystem which supports the mapping between filehandle fragments +and dentries will be termed "exportable". + + + +Dcache Issues +------------- + +The dcache normally contains a proper prefix of any given filesystem +tree. This means that if any filesystem object is in the dcache, then +all of the ancestors of that filesystem object are also in the dcache. +As normal access is by filename this prefix is created naturally and +maintained easily (by each object maintaining a reference count on +its parent). + +However when objects are included into the dcache by interpreting a +filehandle fragment, there is no automatic creation of a path prefix +for the object. This leads to two related but distinct features of +the dcache that are not needed for normal filesystem access. + +1. The dcache must sometimes contain objects that are not part of the + proper prefix. i.e that are not connected to the root. +2. The dcache must be prepared for a newly found (via ->lookup) directory + to already have a (non-connected) dentry, and must be able to move + that dentry into place (based on the parent and name in the + ->lookup). This is particularly needed for directories as + it is a dcache invariant that directories only have one dentry. + +To implement these features, the dcache has: + +a. A dentry flag DCACHE_DISCONNECTED which is set on + any dentry that might not be part of the proper prefix. + This is set when anonymous dentries are created, and cleared when a + dentry is noticed to be a child of a dentry which is in the proper + prefix. If the refcount on a dentry with this flag set + becomes zero, the dentry is immediately discarded, rather than being + kept in the dcache. If a dentry that is not already in the dcache + is repeatedly accessed by filehandle (as NFSD might do), an new dentry + will be a allocated for each access, and discarded at the end of + the access. + + Note that such a dentry can acquire children, name, ancestors, etc. + without losing DCACHE_DISCONNECTED - that flag is only cleared when + subtree is successfully reconnected to root. Until then dentries + in such subtree are retained only as long as there are references; + refcount reaching zero means immediate eviction, same as for unhashed + dentries. That guarantees that we won't need to hunt them down upon + umount. + +b. A primitive for creation of secondary roots - d_obtain_root(inode). + Those do _not_ bear DCACHE_DISCONNECTED. They are placed on the + per-superblock list (->s_roots), so they can be located at umount + time for eviction purposes. + +c. Helper routines to allocate anonymous dentries, and to help attach + loose directory dentries at lookup time. They are: + + d_obtain_alias(inode) will return a dentry for the given inode. + If the inode already has a dentry, one of those is returned. + + If it doesn't, a new anonymous (IS_ROOT and + DCACHE_DISCONNECTED) dentry is allocated and attached. + + In the case of a directory, care is taken that only one dentry + can ever be attached. + + d_splice_alias(inode, dentry) will introduce a new dentry into the tree; + either the passed-in dentry or a preexisting alias for the given inode + (such as an anonymous one created by d_obtain_alias), if appropriate. + It returns NULL when the passed-in dentry is used, following the calling + convention of ->lookup. + +Filesystem Issues +----------------- + +For a filesystem to be exportable it must: + + 1. provide the filehandle fragment routines described below. + 2. make sure that d_splice_alias is used rather than d_add + when ->lookup finds an inode for a given parent and name. + + If inode is NULL, d_splice_alias(inode, dentry) is equivalent to:: + + d_add(dentry, inode), NULL + + Similarly, d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err) + + Typically the ->lookup routine will simply end with a:: + + return d_splice_alias(inode, dentry); + } + + + +A file system implementation declares that instances of the filesystem +are exportable by setting the s_export_op field in the struct +super_block. This field must point to a "struct export_operations" +struct which has the following members: + + encode_fh (mandatory) + Takes a dentry and creates a filehandle fragment which may later be used + to find or create a dentry for the same object. + + fh_to_dentry (mandatory) + Given a filehandle fragment, this should find the implied object and + create a dentry for it (possibly with d_obtain_alias). + + fh_to_parent (optional but strongly recommended) + Given a filehandle fragment, this should find the parent of the + implied object and create a dentry for it (possibly with + d_obtain_alias). May fail if the filehandle fragment is too small. + + get_parent (optional but strongly recommended) + When given a dentry for a directory, this should return a dentry for + the parent. Quite possibly the parent dentry will have been allocated + by d_alloc_anon. The default get_parent function just returns an error + so any filehandle lookup that requires finding a parent will fail. + ->lookup("..") is *not* used as a default as it can leave ".." entries + in the dcache which are too messy to work with. + + get_name (optional) + When given a parent dentry and a child dentry, this should find a name + in the directory identified by the parent dentry, which leads to the + object identified by the child dentry. If no get_name function is + supplied, a default implementation is provided which uses vfs_readdir + to find potential names, and matches inode numbers to find the correct + match. + + flags + Some filesystems may need to be handled differently than others. The + export_operations struct also includes a flags field that allows the + filesystem to communicate such information to nfsd. See the Export + Operations Flags section below for more explanation. + +A filehandle fragment consists of an array of 1 or more 4byte words, +together with a one byte "type". +The decode_fh routine should not depend on the stated size that is +passed to it. This size may be larger than the original filehandle +generated by encode_fh, in which case it will have been padded with +nuls. Rather, the encode_fh routine should choose a "type" which +indicates the decode_fh how much of the filehandle is valid, and how +it should be interpreted. + +Export Operations Flags +----------------------- +In addition to the operation vector pointers, struct export_operations also +contains a "flags" field that allows the filesystem to communicate to nfsd +that it may want to do things differently when dealing with it. The +following flags are defined: + + EXPORT_OP_NOWCC - disable NFSv3 WCC attributes on this filesystem + RFC 1813 recommends that servers always send weak cache consistency + (WCC) data to the client after each operation. The server should + atomically collect attributes about the inode, do an operation on it, + and then collect the attributes afterward. This allows the client to + skip issuing GETATTRs in some situations but means that the server + is calling vfs_getattr for almost all RPCs. On some filesystems + (particularly those that are clustered or networked) this is expensive + and atomicity is difficult to guarantee. This flag indicates to nfsd + that it should skip providing WCC attributes to the client in NFSv3 + replies when doing operations on this filesystem. Consider enabling + this on filesystems that have an expensive ->getattr inode operation, + or when atomicity between pre and post operation attribute collection + is impossible to guarantee. + + EXPORT_OP_NOSUBTREECHK - disallow subtree checking on this fs + Many NFS operations deal with filehandles, which the server must then + vet to ensure that they live inside of an exported tree. When the + export consists of an entire filesystem, this is trivial. nfsd can just + ensure that the filehandle live on the filesystem. When only part of a + filesystem is exported however, then nfsd must walk the ancestors of the + inode to ensure that it's within an exported subtree. This is an + expensive operation and not all filesystems can support it properly. + This flag exempts the filesystem from subtree checking and causes + exportfs to get back an error if it tries to enable subtree checking + on it. + + EXPORT_OP_CLOSE_BEFORE_UNLINK - always close cached files before unlinking + On some exportable filesystems (such as NFS) unlinking a file that + is still open can cause a fair bit of extra work. For instance, + the NFS client will do a "sillyrename" to ensure that the file + sticks around while it's still open. When reexporting, that open + file is held by nfsd so we usually end up doing a sillyrename, and + then immediately deleting the sillyrenamed file just afterward when + the link count actually goes to zero. Sometimes this delete can race + with other operations (for instance an rmdir of the parent directory). + This flag causes nfsd to close any open files for this inode _before_ + calling into the vfs to do an unlink or a rename that would replace + an existing file. + + EXPORT_OP_REMOTE_FS - Backing storage for this filesystem is remote + PF_LOCAL_THROTTLE exists for loopback NFSD, where a thread needs to + write to one bdi (the final bdi) in order to free up writes queued + to another bdi (the client bdi). Such threads get a private balance + of dirty pages so that dirty pages for the client bdi do not imact + the daemon writing to the final bdi. For filesystems whose durable + storage is not local (such as exported NFS filesystems), this + constraint has negative consequences. EXPORT_OP_REMOTE_FS enables + an export to disable writeback throttling. + + EXPORT_OP_NOATOMIC_ATTR - Filesystem does not update attributes atomically + EXPORT_OP_NOATOMIC_ATTR indicates that the exported filesystem + cannot provide the semantics required by the "atomic" boolean in + NFSv4's change_info4. This boolean indicates to a client whether the + returned before and after change attributes were obtained atomically + with the respect to the requested metadata operation (UNLINK, + OPEN/CREATE, MKDIR, etc). + + EXPORT_OP_FLUSH_ON_CLOSE - Filesystem flushes file data on close(2) + On most filesystems, inodes can remain under writeback after the + file is closed. NFSD relies on client activity or local flusher + threads to handle writeback. Certain filesystems, such as NFS, flush + all of an inode's dirty data on last close. Exports that behave this + way should set EXPORT_OP_FLUSH_ON_CLOSE so that NFSD knows to skip + waiting for writeback when closing such files. diff --git a/Documentation/filesystems/nfs/fault_injection.txt b/Documentation/filesystems/nfs/fault_injection.txt deleted file mode 100644 index 426d166089a3..000000000000 --- a/Documentation/filesystems/nfs/fault_injection.txt +++ /dev/null @@ -1,69 +0,0 @@ - -Fault Injection -=============== -Fault injection is a method for forcing errors that may not normally occur, or -may be difficult to reproduce. Forcing these errors in a controlled environment -can help the developer find and fix bugs before their code is shipped in a -production system. Injecting an error on the Linux NFS server will allow us to -observe how the client reacts and if it manages to recover its state correctly. - -NFSD_FAULT_INJECTION must be selected when configuring the kernel to use this -feature. - - -Using Fault Injection -===================== -On the client, mount the fault injection server through NFS v4.0+ and do some -work over NFS (open files, take locks, ...). - -On the server, mount the debugfs filesystem to <debug_dir> and ls -<debug_dir>/nfsd. This will show a list of files that will be used for -injecting faults on the NFS server. As root, write a number n to the file -corresponding to the action you want the server to take. The server will then -process the first n items it finds. So if you want to forget 5 locks, echo '5' -to <debug_dir>/nfsd/forget_locks. A value of 0 will tell the server to forget -all corresponding items. A log message will be created containing the number -of items forgotten (check dmesg). - -Go back to work on the client and check if the client recovered from the error -correctly. - - -Available Faults -================ -forget_clients: - The NFS server keeps a list of clients that have placed a mount call. If - this list is cleared, the server will have no knowledge of who the client - is, forcing the client to reauthenticate with the server. - -forget_openowners: - The NFS server keeps a list of what files are currently opened and who - they were opened by. Clearing this list will force the client to reopen - its files. - -forget_locks: - The NFS server keeps a list of what files are currently locked in the VFS. - Clearing this list will force the client to reclaim its locks (files are - unlocked through the VFS as they are cleared from this list). - -forget_delegations: - A delegation is used to assure the client that a file, or part of a file, - has not changed since the delegation was awarded. Clearing this list will - force the client to reaquire its delegation before accessing the file - again. - -recall_delegations: - Delegations can be recalled by the server when another client attempts to - access a file. This test will notify the client that its delegation has - been revoked, forcing the client to reaquire the delegation before using - the file again. - - -tools/nfs/inject_faults.sh script -================================= -This script has been created to ease the fault injection process. This script -will detect the mounted debugfs directory and write to the files located there -based on the arguments passed by the user. For example, running -`inject_faults.sh forget_locks 1` as root will instruct the server to forget -one lock. Running `inject_faults forget_locks` will instruct the server to -forgetall locks. diff --git a/Documentation/filesystems/nfs/idmapper.txt b/Documentation/filesystems/nfs/idmapper.txt deleted file mode 100644 index fe03d10bb79a..000000000000 --- a/Documentation/filesystems/nfs/idmapper.txt +++ /dev/null @@ -1,75 +0,0 @@ - -========= -ID Mapper -========= -Id mapper is used by NFS to translate user and group ids into names, and to -translate user and group names into ids. Part of this translation involves -performing an upcall to userspace to request the information. There are two -ways NFS could obtain this information: placing a call to /sbin/request-key -or by placing a call to the rpc.idmap daemon. - -NFS will attempt to call /sbin/request-key first. If this succeeds, the -result will be cached using the generic request-key cache. This call should -only fail if /etc/request-key.conf is not configured for the id_resolver key -type, see the "Configuring" section below if you wish to use the request-key -method. - -If the call to /sbin/request-key fails (if /etc/request-key.conf is not -configured with the id_resolver key type), then the idmapper will ask the -legacy rpc.idmap daemon for the id mapping. This result will be stored -in a custom NFS idmap cache. - - -=========== -Configuring -=========== -The file /etc/request-key.conf will need to be modified so /sbin/request-key can -direct the upcall. The following line should be added: - -#OP TYPE DESCRIPTION CALLOUT INFO PROGRAM ARG1 ARG2 ARG3 ... -#====== ======= =============== =============== =============================== -create id_resolver * * /usr/sbin/nfs.idmap %k %d 600 - -This will direct all id_resolver requests to the program /usr/sbin/nfs.idmap. -The last parameter, 600, defines how many seconds into the future the key will -expire. This parameter is optional for /usr/sbin/nfs.idmap. When the timeout -is not specified, nfs.idmap will default to 600 seconds. - -id mapper uses for key descriptions: - uid: Find the UID for the given user - gid: Find the GID for the given group - user: Find the user name for the given UID - group: Find the group name for the given GID - -You can handle any of these individually, rather than using the generic upcall -program. If you would like to use your own program for a uid lookup then you -would edit your request-key.conf so it look similar to this: - -#OP TYPE DESCRIPTION CALLOUT INFO PROGRAM ARG1 ARG2 ARG3 ... -#====== ======= =============== =============== =============================== -create id_resolver uid:* * /some/other/program %k %d 600 -create id_resolver * * /usr/sbin/nfs.idmap %k %d 600 - -Notice that the new line was added above the line for the generic program. -request-key will find the first matching line and corresponding program. In -this case, /some/other/program will handle all uid lookups and -/usr/sbin/nfs.idmap will handle gid, user, and group lookups. - -See <file:Documentation/security/keys-request-key.txt> for more information -about the request-key function. - - -========= -nfs.idmap -========= -nfs.idmap is designed to be called by request-key, and should not be run "by -hand". This program takes two arguments, a serialized key and a key -description. The serialized key is first converted into a key_serial_t, and -then passed as an argument to keyctl_instantiate (both are part of keyutils.h). - -The actual lookups are performed by functions found in nfsidmap.h. nfs.idmap -determines the correct function to call by looking at the first part of the -description string. For example, a uid lookup description will appear as -"uid:user@domain". - -nfs.idmap will return 0 if the key was instantiated, and non-zero otherwise. diff --git a/Documentation/filesystems/nfs/index.rst b/Documentation/filesystems/nfs/index.rst new file mode 100644 index 000000000000..a29a212b5b4d --- /dev/null +++ b/Documentation/filesystems/nfs/index.rst @@ -0,0 +1,18 @@ +=============================== +NFS +=============================== + + +.. toctree:: + :maxdepth: 1 + + client-identifier + exporting + localio + pnfs + rpc-cache + rpc-server-gss + nfs41-server + nfsd-io-modes + knfsd-stats + reexport diff --git a/Documentation/filesystems/nfs/knfsd-stats.txt b/Documentation/filesystems/nfs/knfsd-stats.rst index 64ced5149d37..80bcf13550de 100644 --- a/Documentation/filesystems/nfs/knfsd-stats.txt +++ b/Documentation/filesystems/nfs/knfsd-stats.rst @@ -1,7 +1,9 @@ - +============================ Kernel NFS Server Statistics ============================ +:Authors: Greg Banks <gnb@sgi.com> - 26 Mar 2009 + This document describes the format and semantics of the statistics which the kernel NFS server makes available to userspace. These statistics are available in several text form pseudo files, each of @@ -18,7 +20,7 @@ by parsing routines. All other lines contain a sequence of fields separated by whitespace. /proc/fs/nfsd/pool_stats ------------------------- +======================== This file is available in kernels from 2.6.30 onwards, if the /proc/fs/nfsd filesystem is mounted (it almost always should be). @@ -68,16 +70,10 @@ sockets-enqueued rate of change for this counter is zero; significantly non-zero values may indicate a performance limitation. - This can happen either because there are too few nfsd threads in the - thread pool for the NFS workload (the workload is thread-limited), - or because the NFS workload needs more CPU time than is available in - the thread pool (the workload is CPU-limited). In the former case, - configuring more nfsd threads will probably improve the performance - of the NFS workload. In the latter case, the sunrpc server layer is - already choosing not to wake idle nfsd threads because there are too - many nfsd threads which want to run but cannot, so configuring more - nfsd threads will make no difference whatsoever. The overloads-avoided - statistic (see below) can be used to distinguish these cases. + This can happen because there are too few nfsd threads in the thread + pool for the NFS workload (the workload is thread-limited), in which + case configuring more nfsd threads will probably improve the + performance of the NFS workload. threads-woken Counts how many times an idle nfsd thread is woken to try to @@ -88,36 +84,6 @@ threads-woken thing. The ideal rate of change for this counter will be close to but less than the rate of change of the packets-arrived counter. -overloads-avoided - Counts how many times the sunrpc server layer chose not to wake an - nfsd thread, despite the presence of idle nfsd threads, because - too many nfsd threads had been recently woken but could not get - enough CPU time to actually run. - - This statistic counts a circumstance where the sunrpc layer - heuristically avoids overloading the CPU scheduler with too many - runnable nfsd threads. The ideal rate of change for this counter - is zero. Significant non-zero values indicate that the workload - is CPU limited. Usually this is associated with heavy CPU usage - on all the CPUs in the nfsd thread pool. - - If a sustained large overloads-avoided rate is detected on a pool, - the top(1) utility should be used to check for the following - pattern of CPU usage on all the CPUs associated with the given - nfsd thread pool. - - - %us ~= 0 (as you're *NOT* running applications on your NFS server) - - - %wa ~= 0 - - - %id ~= 0 - - - %sy + %hi + %si ~= 100 - - If this pattern is seen, configuring more nfsd threads will *not* - improve the performance of the workload. If this patten is not - seen, then something more subtle is wrong. - threads-timedout Counts how many times an nfsd thread triggered an idle timeout, i.e. was not woken to handle any incoming network packets for @@ -145,15 +111,12 @@ this case), or the transport can be enqueued for later attention (sockets-enqueued counts this case), or the packet can be temporarily deferred because the transport is currently being used by an nfsd thread. This last case is not very interesting and is not explicitly -counted, but can be inferred from the other counters thus: +counted, but can be inferred from the other counters thus:: -packets-deferred = packets-arrived - ( sockets-enqueued + threads-woken ) + packets-deferred = packets-arrived - ( sockets-enqueued + threads-woken ) More ----- -Descriptions of the other statistics file should go here. - +==== -Greg Banks <gnb@sgi.com> -26 Mar 2009 +Descriptions of the other statistics file should go here. diff --git a/Documentation/filesystems/nfs/localio.rst b/Documentation/filesystems/nfs/localio.rst new file mode 100644 index 000000000000..79808b37d745 --- /dev/null +++ b/Documentation/filesystems/nfs/localio.rst @@ -0,0 +1,357 @@ +=========== +NFS LOCALIO +=========== + +Overview +======== + +The LOCALIO auxiliary RPC protocol allows the Linux NFS client and +server to reliably handshake to determine if they are on the same +host. Select "NFS client and server support for LOCALIO auxiliary +protocol" in menuconfig to enable CONFIG_NFS_LOCALIO in the kernel +config (both CONFIG_NFS_FS and CONFIG_NFSD must also be enabled). + +Once an NFS client and server handshake as "local", the client will +bypass the network RPC protocol for read, write and commit operations. +Due to this XDR and RPC bypass, these operations will operate faster. + +The LOCALIO auxiliary protocol's implementation, which uses the same +connection as NFS traffic, follows the pattern established by the NFS +ACL protocol extension. + +The LOCALIO auxiliary protocol is needed to allow robust discovery of +clients local to their servers. In a private implementation that +preceded use of this LOCALIO protocol, a fragile sockaddr network +address based match against all local network interfaces was attempted. +But unlike the LOCALIO protocol, the sockaddr-based matching didn't +handle use of iptables or containers. + +The robust handshake between local client and server is just the +beginning, the ultimate use case this locality makes possible is the +client is able to open files and issue reads, writes and commits +directly to the server without having to go over the network. The +requirement is to perform these loopback NFS operations as efficiently +as possible, this is particularly useful for container use cases +(e.g. kubernetes) where it is possible to run an IO job local to the +server. + +The performance advantage realized from LOCALIO's ability to bypass +using XDR and RPC for reads, writes and commits can be extreme, e.g.: + +fio for 20 secs with directio, qd of 8, 16 libaio threads: + - With LOCALIO: + 4K read: IOPS=979k, BW=3825MiB/s (4011MB/s)(74.7GiB/20002msec) + 4K write: IOPS=165k, BW=646MiB/s (678MB/s)(12.6GiB/20002msec) + 128K read: IOPS=402k, BW=49.1GiB/s (52.7GB/s)(982GiB/20002msec) + 128K write: IOPS=11.5k, BW=1433MiB/s (1503MB/s)(28.0GiB/20004msec) + + - Without LOCALIO: + 4K read: IOPS=79.2k, BW=309MiB/s (324MB/s)(6188MiB/20003msec) + 4K write: IOPS=59.8k, BW=234MiB/s (245MB/s)(4671MiB/20002msec) + 128K read: IOPS=33.9k, BW=4234MiB/s (4440MB/s)(82.7GiB/20004msec) + 128K write: IOPS=11.5k, BW=1434MiB/s (1504MB/s)(28.0GiB/20011msec) + +fio for 20 secs with directio, qd of 8, 1 libaio thread: + - With LOCALIO: + 4K read: IOPS=230k, BW=898MiB/s (941MB/s)(17.5GiB/20001msec) + 4K write: IOPS=22.6k, BW=88.3MiB/s (92.6MB/s)(1766MiB/20001msec) + 128K read: IOPS=38.8k, BW=4855MiB/s (5091MB/s)(94.8GiB/20001msec) + 128K write: IOPS=11.4k, BW=1428MiB/s (1497MB/s)(27.9GiB/20001msec) + + - Without LOCALIO: + 4K read: IOPS=77.1k, BW=301MiB/s (316MB/s)(6022MiB/20001msec) + 4K write: IOPS=32.8k, BW=128MiB/s (135MB/s)(2566MiB/20001msec) + 128K read: IOPS=24.4k, BW=3050MiB/s (3198MB/s)(59.6GiB/20001msec) + 128K write: IOPS=11.4k, BW=1430MiB/s (1500MB/s)(27.9GiB/20001msec) + +FAQ +=== + +1. What are the use cases for LOCALIO? + + a. Workloads where the NFS client and server are on the same host + realize improved IO performance. In particular, it is common when + running containerised workloads for jobs to find themselves + running on the same host as the knfsd server being used for + storage. + +2. What are the requirements for LOCALIO? + + a. Bypass use of the network RPC protocol as much as possible. This + includes bypassing XDR and RPC for open, read, write and commit + operations. + b. Allow client and server to autonomously discover if they are + running local to each other without making any assumptions about + the local network topology. + c. Support the use of containers by being compatible with relevant + namespaces (e.g. network, user, mount). + d. Support all versions of NFS. NFSv3 is of particular importance + because it has wide enterprise usage and pNFS flexfiles makes use + of it for the data path. + +3. Why doesn’t LOCALIO just compare IP addresses or hostnames when + deciding if the NFS client and server are co-located on the same + host? + + Since one of the main use cases is containerised workloads, we cannot + assume that IP addresses will be shared between the client and + server. This sets up a requirement for a handshake protocol that + needs to go over the same connection as the NFS traffic in order to + identify that the client and the server really are running on the + same host. The handshake uses a secret that is sent over the wire, + and can be verified by both parties by comparing with a value stored + in shared kernel memory if they are truly co-located. + +4. Does LOCALIO improve pNFS flexfiles? + + Yes, LOCALIO complements pNFS flexfiles by allowing it to take + advantage of NFS client and server locality. Policy that initiates + client IO as closely to the server where the data is stored naturally + benefits from the data path optimization LOCALIO provides. + +5. Why not develop a new pNFS layout to enable LOCALIO? + + A new pNFS layout could be developed, but doing so would put the + onus on the server to somehow discover that the client is co-located + when deciding to hand out the layout. + There is value in a simpler approach (as provided by LOCALIO) that + allows the NFS client to negotiate and leverage locality without + requiring more elaborate modeling and discovery of such locality in a + more centralized manner. + +6. Why is having the client perform a server-side file OPEN, without + using RPC, beneficial? Is the benefit pNFS specific? + + Avoiding the use of XDR and RPC for file opens is beneficial to + performance regardless of whether pNFS is used. Especially when + dealing with small files its best to avoid going over the wire + whenever possible, otherwise it could reduce or even negate the + benefits of avoiding the wire for doing the small file I/O itself. + Given LOCALIO's requirements the current approach of having the + client perform a server-side file open, without using RPC, is ideal. + If in the future requirements change then we can adapt accordingly. + +7. Why is LOCALIO only supported with UNIX Authentication (AUTH_UNIX)? + + Strong authentication is usually tied to the connection itself. It + works by establishing a context that is cached by the server, and + that acts as the key for discovering the authorisation token, which + can then be passed to rpc.mountd to complete the authentication + process. On the other hand, in the case of AUTH_UNIX, the credential + that was passed over the wire is used directly as the key in the + upcall to rpc.mountd. This simplifies the authentication process, and + so makes AUTH_UNIX easier to support. + +8. How do export options that translate RPC user IDs behave for LOCALIO + operations (eg. root_squash, all_squash)? + + Export options that translate user IDs are managed by nfsd_setuser() + which is called by nfsd_setuser_and_check_port() which is called by + __fh_verify(). So they get handled exactly the same way for LOCALIO + as they do for non-LOCALIO. + +9. How does LOCALIO make certain that object lifetimes are managed + properly given NFSD and NFS operate in different contexts? + + See the detailed "NFS Client and Server Interlock" section below. + +RPC +=== + +The LOCALIO auxiliary RPC protocol consists of a single "UUID_IS_LOCAL" +RPC method that allows the Linux NFS client to verify the local Linux +NFS server can see the nonce (single-use UUID) the client generated and +made available in nfs_common. This protocol isn't part of an IETF +standard, nor does it need to be considering it is Linux-to-Linux +auxiliary RPC protocol that amounts to an implementation detail. + +The UUID_IS_LOCAL method encodes the client generated uuid_t in terms of +the fixed UUID_SIZE (16 bytes). The fixed size opaque encode and decode +XDR methods are used instead of the less efficient variable sized +methods. + +The RPC program number for the NFS_LOCALIO_PROGRAM is 400122 (as assigned +by IANA, see https://www.iana.org/assignments/rpc-program-numbers/ ): +Linux Kernel Organization 400122 nfslocalio + +The LOCALIO protocol spec in rpcgen syntax is:: + + /* raw RFC 9562 UUID */ + #define UUID_SIZE 16 + typedef u8 uuid_t<UUID_SIZE>; + + program NFS_LOCALIO_PROGRAM { + version LOCALIO_V1 { + void + NULL(void) = 0; + + void + UUID_IS_LOCAL(uuid_t) = 1; + } = 1; + } = 400122; + +LOCALIO uses the same transport connection as NFS traffic. As such, +LOCALIO is not registered with rpcbind. + +NFS Common and Client/Server Handshake +====================================== + +fs/nfs_common/nfslocalio.c provides interfaces that enable an NFS client +to generate a nonce (single-use UUID) and associated short-lived +nfs_uuid_t struct, register it with nfs_common for subsequent lookup and +verification by the NFS server and if matched the NFS server populates +members in the nfs_uuid_t struct. The NFS client then uses nfs_common to +transfer the nfs_uuid_t from its nfs_uuids to the nn->nfsd_serv +clients_list from the nfs_common's uuids_list. See: +fs/nfs/localio.c:nfs_local_probe() + +nfs_common's nfs_uuids list is the basis for LOCALIO enablement, as such +it has members that point to nfsd memory for direct use by the client +(e.g. 'net' is the server's network namespace, through it the client can +access nn->nfsd_serv with proper rcu read access). It is this client +and server synchronization that enables advanced usage and lifetime of +objects to span from the host kernel's nfsd to per-container knfsd +instances that are connected to nfs client's running on the same local +host. + +NFS Client and Server Interlock +=============================== + +LOCALIO provides the nfs_uuid_t object and associated interfaces to +allow proper network namespace (net-ns) and NFSD object refcounting. + +LOCALIO required the introduction and use of NFSD's percpu nfsd_net_ref +to interlock nfsd_shutdown_net() and nfsd_open_local_fh(), to ensure +each net-ns is not destroyed while in use by nfsd_open_local_fh(), and +warrants a more detailed explanation: + + nfsd_open_local_fh() uses nfsd_net_try_get() before opening its + nfsd_file handle and then the caller (NFS client) must drop the + reference for the nfsd_file and associated net-ns using + nfsd_file_put_local() once it has completed its IO. + + This interlock working relies heavily on nfsd_open_local_fh() being + afforded the ability to safely deal with the possibility that the + NFSD's net-ns (and nfsd_net by association) may have been destroyed + by nfsd_destroy_serv() via nfsd_shutdown_net(). + +This interlock of the NFS client and server has been verified to fix an +easy to hit crash that would occur if an NFSD instance running in a +container, with a LOCALIO client mounted, is shutdown. Upon restart of +the container and associated NFSD, the client would go on to crash due +to NULL pointer dereference that occurred due to the LOCALIO client's +attempting to nfsd_open_local_fh() without having a proper reference on +NFSD's net-ns. + +NFS Client issues IO instead of Server +====================================== + +Because LOCALIO is focused on protocol bypass to achieve improved IO +performance, alternatives to the traditional NFS wire protocol (SUNRPC +with XDR) must be provided to access the backing filesystem. + +See fs/nfs/localio.c:nfs_local_open_fh() and +fs/nfsd/localio.c:nfsd_open_local_fh() for the interface that makes +focused use of select nfs server objects to allow a client local to a +server to open a file pointer without needing to go over the network. + +The client's fs/nfs/localio.c:nfs_local_open_fh() will call into the +server's fs/nfsd/localio.c:nfsd_open_local_fh() and carefully access +both the associated nfsd network namespace and nn->nfsd_serv in terms of +RCU. If nfsd_open_local_fh() finds that the client no longer sees valid +nfsd objects (be it struct net or nn->nfsd_serv) it returns -ENXIO +to nfs_local_open_fh() and the client will try to reestablish the +LOCALIO resources needed by calling nfs_local_probe() again. This +recovery is needed if/when an nfsd instance running in a container were +to reboot while a LOCALIO client is connected to it. + +Once the client has an open nfsd_file pointer it will issue reads, +writes and commits directly to the underlying local filesystem (normally +done by the nfs server). As such, for these operations, the NFS client +is issuing IO to the underlying local filesystem that it is sharing with +the NFS server. See: fs/nfs/localio.c:nfs_local_doio() and +fs/nfs/localio.c:nfs_local_commit(). + +With normal NFS that makes use of RPC to issue IO to the server, if an +application uses O_DIRECT the NFS client will bypass the pagecache but +the NFS server will not. The NFS server's use of buffered IO affords +applications to be less precise with their alignment when issuing IO to +the NFS client. But if all applications properly align their IO, LOCALIO +can be configured to use end-to-end O_DIRECT semantics from the NFS +client to the underlying local filesystem, that it is sharing with +the NFS server, by setting the 'localio_O_DIRECT_semantics' nfs module +parameter to Y, e.g.: + + echo Y > /sys/module/nfs/parameters/localio_O_DIRECT_semantics + +Once enabled, it will cause LOCALIO to use end-to-end O_DIRECT semantics +(but again, this may cause IO to fail if applications do not properly +align their IO). + +Security +======== + +LOCALIO is only supported when UNIX-style authentication (AUTH_UNIX, aka +AUTH_SYS) is used. + +Care is taken to ensure the same NFS security mechanisms are used +(authentication, etc) regardless of whether LOCALIO or regular NFS +access is used. The auth_domain established as part of the traditional +NFS client access to the NFS server is also used for LOCALIO. + +Relative to containers, LOCALIO gives the client access to the network +namespace the server has. This is required to allow the client to access +the server's per-namespace nfsd_net struct. With traditional NFS, the +client is afforded this same level of access (albeit in terms of the NFS +protocol via SUNRPC). No other namespaces (user, mount, etc) have been +altered or purposely extended from the server to the client. + +Module Parameters +================= + +/sys/module/nfs/parameters/localio_enabled (bool) +controls if LOCALIO is enabled, defaults to Y. If client and server are +local but 'localio_enabled' is set to N then LOCALIO will not be used. + +/sys/module/nfs/parameters/localio_O_DIRECT_semantics (bool) +controls if O_DIRECT extends down to the underlying filesystem, defaults +to N. Application IO must be logical blocksize aligned, otherwise +O_DIRECT will fail. + +/sys/module/nfsv3/parameters/nfs3_localio_probe_throttle (uint) +controls if NFSv3 read and write IOs will trigger (re)enabling of +LOCALIO every N (nfs3_localio_probe_throttle) IOs, defaults to 0 +(disabled). Must be power-of-2, admin keeps all the pieces if they +misconfigure (too low a value or non-power-of-2). + +Testing +======= + +The LOCALIO auxiliary protocol and associated NFS LOCALIO read, write +and commit access have proven stable against various test scenarios: + +- Client and server both on the same host. + +- All permutations of client and server support enablement for both + local and remote client and server. + +- Testing against NFS storage products that don't support the LOCALIO + protocol was also performed. + +- Client on host, server within a container (for both v3 and v4.2). + The container testing was in terms of podman managed containers and + includes successful container stop/restart scenario. + +- Formalizing these test scenarios in terms of existing test + infrastructure is on-going. Initial regular coverage is provided in + terms of ktest running xfstests against a LOCALIO-enabled NFS loopback + mount configuration, and includes lockdep and KASAN coverage, see: + https://evilpiepirate.org/~testdashboard/ci?user=snitzer&branch=snitm-nfs-next + https://github.com/koverstreet/ktest + +- Various kdevops testing (in terms of "Chuck's BuildBot") has been + performed to regularly verify the LOCALIO changes haven't caused any + regressions to non-LOCALIO NFS use cases. + +- All of Hammerspace's various sanity tests pass with LOCALIO enabled + (this includes numerous pNFS and flexfiles tests). diff --git a/Documentation/filesystems/nfs/nfs-rdma.txt b/Documentation/filesystems/nfs/nfs-rdma.txt deleted file mode 100644 index e386f7e4bcee..000000000000 --- a/Documentation/filesystems/nfs/nfs-rdma.txt +++ /dev/null @@ -1,271 +0,0 @@ -################################################################################ -# # -# NFS/RDMA README # -# # -################################################################################ - - Author: NetApp and Open Grid Computing - Date: May 29, 2008 - -Table of Contents -~~~~~~~~~~~~~~~~~ - - Overview - - Getting Help - - Installation - - Check RDMA and NFS Setup - - NFS/RDMA Setup - -Overview -~~~~~~~~ - - This document describes how to install and setup the Linux NFS/RDMA client - and server software. - - The NFS/RDMA client was first included in Linux 2.6.24. The NFS/RDMA server - was first included in the following release, Linux 2.6.25. - - In our testing, we have obtained excellent performance results (full 10Gbit - wire bandwidth at minimal client CPU) under many workloads. The code passes - the full Connectathon test suite and operates over both Infiniband and iWARP - RDMA adapters. - -Getting Help -~~~~~~~~~~~~ - - If you get stuck, you can ask questions on the - - nfs-rdma-devel@lists.sourceforge.net - - mailing list. - -Installation -~~~~~~~~~~~~ - - These instructions are a step by step guide to building a machine for - use with NFS/RDMA. - - - Install an RDMA device - - Any device supported by the drivers in drivers/infiniband/hw is acceptable. - - Testing has been performed using several Mellanox-based IB cards, the - Ammasso AMS1100 iWARP adapter, and the Chelsio cxgb3 iWARP adapter. - - - Install a Linux distribution and tools - - The first kernel release to contain both the NFS/RDMA client and server was - Linux 2.6.25 Therefore, a distribution compatible with this and subsequent - Linux kernel release should be installed. - - The procedures described in this document have been tested with - distributions from Red Hat's Fedora Project (http://fedora.redhat.com/). - - - Install nfs-utils-1.1.2 or greater on the client - - An NFS/RDMA mount point can be obtained by using the mount.nfs command in - nfs-utils-1.1.2 or greater (nfs-utils-1.1.1 was the first nfs-utils - version with support for NFS/RDMA mounts, but for various reasons we - recommend using nfs-utils-1.1.2 or greater). To see which version of - mount.nfs you are using, type: - - $ /sbin/mount.nfs -V - - If the version is less than 1.1.2 or the command does not exist, - you should install the latest version of nfs-utils. - - Download the latest package from: - - http://www.kernel.org/pub/linux/utils/nfs - - Uncompress the package and follow the installation instructions. - - If you will not need the idmapper and gssd executables (you do not need - these to create an NFS/RDMA enabled mount command), the installation - process can be simplified by disabling these features when running - configure: - - $ ./configure --disable-gss --disable-nfsv4 - - To build nfs-utils you will need the tcp_wrappers package installed. For - more information on this see the package's README and INSTALL files. - - After building the nfs-utils package, there will be a mount.nfs binary in - the utils/mount directory. This binary can be used to initiate NFS v2, v3, - or v4 mounts. To initiate a v4 mount, the binary must be called - mount.nfs4. The standard technique is to create a symlink called - mount.nfs4 to mount.nfs. - - This mount.nfs binary should be installed at /sbin/mount.nfs as follows: - - $ sudo cp utils/mount/mount.nfs /sbin/mount.nfs - - In this location, mount.nfs will be invoked automatically for NFS mounts - by the system mount command. - - NOTE: mount.nfs and therefore nfs-utils-1.1.2 or greater is only needed - on the NFS client machine. You do not need this specific version of - nfs-utils on the server. Furthermore, only the mount.nfs command from - nfs-utils-1.1.2 is needed on the client. - - - Install a Linux kernel with NFS/RDMA - - The NFS/RDMA client and server are both included in the mainline Linux - kernel version 2.6.25 and later. This and other versions of the 2.6 Linux - kernel can be found at: - - ftp://ftp.kernel.org/pub/linux/kernel/v2.6/ - - Download the sources and place them in an appropriate location. - - - Configure the RDMA stack - - Make sure your kernel configuration has RDMA support enabled. Under - Device Drivers -> InfiniBand support, update the kernel configuration - to enable InfiniBand support [NOTE: the option name is misleading. Enabling - InfiniBand support is required for all RDMA devices (IB, iWARP, etc.)]. - - Enable the appropriate IB HCA support (mlx4, mthca, ehca, ipath, etc.) or - iWARP adapter support (amso, cxgb3, etc.). - - If you are using InfiniBand, be sure to enable IP-over-InfiniBand support. - - - Configure the NFS client and server - - Your kernel configuration must also have NFS file system support and/or - NFS server support enabled. These and other NFS related configuration - options can be found under File Systems -> Network File Systems. - - - Build, install, reboot - - The NFS/RDMA code will be enabled automatically if NFS and RDMA - are turned on. The NFS/RDMA client and server are configured via the hidden - SUNRPC_XPRT_RDMA config option that depends on SUNRPC and INFINIBAND. The - value of SUNRPC_XPRT_RDMA will be: - - - N if either SUNRPC or INFINIBAND are N, in this case the NFS/RDMA client - and server will not be built - - M if both SUNRPC and INFINIBAND are on (M or Y) and at least one is M, - in this case the NFS/RDMA client and server will be built as modules - - Y if both SUNRPC and INFINIBAND are Y, in this case the NFS/RDMA client - and server will be built into the kernel - - Therefore, if you have followed the steps above and turned no NFS and RDMA, - the NFS/RDMA client and server will be built. - - Build a new kernel, install it, boot it. - -Check RDMA and NFS Setup -~~~~~~~~~~~~~~~~~~~~~~~~ - - Before configuring the NFS/RDMA software, it is a good idea to test - your new kernel to ensure that the kernel is working correctly. - In particular, it is a good idea to verify that the RDMA stack - is functioning as expected and standard NFS over TCP/IP and/or UDP/IP - is working properly. - - - Check RDMA Setup - - If you built the RDMA components as modules, load them at - this time. For example, if you are using a Mellanox Tavor/Sinai/Arbel - card: - - $ modprobe ib_mthca - $ modprobe ib_ipoib - - If you are using InfiniBand, make sure there is a Subnet Manager (SM) - running on the network. If your IB switch has an embedded SM, you can - use it. Otherwise, you will need to run an SM, such as OpenSM, on one - of your end nodes. - - If an SM is running on your network, you should see the following: - - $ cat /sys/class/infiniband/driverX/ports/1/state - 4: ACTIVE - - where driverX is mthca0, ipath5, ehca3, etc. - - To further test the InfiniBand software stack, use IPoIB (this - assumes you have two IB hosts named host1 and host2): - - host1$ ifconfig ib0 a.b.c.x - host2$ ifconfig ib0 a.b.c.y - host1$ ping a.b.c.y - host2$ ping a.b.c.x - - For other device types, follow the appropriate procedures. - - - Check NFS Setup - - For the NFS components enabled above (client and/or server), - test their functionality over standard Ethernet using TCP/IP or UDP/IP. - -NFS/RDMA Setup -~~~~~~~~~~~~~~ - - We recommend that you use two machines, one to act as the client and - one to act as the server. - - One time configuration: - - - On the server system, configure the /etc/exports file and - start the NFS/RDMA server. - - Exports entries with the following formats have been tested: - - /vol0 192.168.0.47(fsid=0,rw,async,insecure,no_root_squash) - /vol0 192.168.0.0/255.255.255.0(fsid=0,rw,async,insecure,no_root_squash) - - The IP address(es) is(are) the client's IPoIB address for an InfiniBand - HCA or the cleint's iWARP address(es) for an RNIC. - - NOTE: The "insecure" option must be used because the NFS/RDMA client does - not use a reserved port. - - Each time a machine boots: - - - Load and configure the RDMA drivers - - For InfiniBand using a Mellanox adapter: - - $ modprobe ib_mthca - $ modprobe ib_ipoib - $ ifconfig ib0 a.b.c.d - - NOTE: use unique addresses for the client and server - - - Start the NFS server - - If the NFS/RDMA server was built as a module (CONFIG_SUNRPC_XPRT_RDMA=m in - kernel config), load the RDMA transport module: - - $ modprobe svcrdma - - Regardless of how the server was built (module or built-in), start the - server: - - $ /etc/init.d/nfs start - - or - - $ service nfs start - - Instruct the server to listen on the RDMA transport: - - $ echo rdma 20049 > /proc/fs/nfsd/portlist - - - On the client system - - If the NFS/RDMA client was built as a module (CONFIG_SUNRPC_XPRT_RDMA=m in - kernel config), load the RDMA client module: - - $ modprobe xprtrdma.ko - - Regardless of how the client was built (module or built-in), use this - command to mount the NFS/RDMA server: - - $ mount -o rdma,port=20049 <IPoIB-server-name-or-address>:/<export> /mnt - - To verify that the mount is using RDMA, run "cat /proc/mounts" and check - the "proto" field for the given mount. - - Congratulations! You're using NFS/RDMA! diff --git a/Documentation/filesystems/nfs/nfs.txt b/Documentation/filesystems/nfs/nfs.txt deleted file mode 100644 index f2571c8bef74..000000000000 --- a/Documentation/filesystems/nfs/nfs.txt +++ /dev/null @@ -1,136 +0,0 @@ - -The NFS client -============== - -The NFS version 2 protocol was first documented in RFC1094 (March 1989). -Since then two more major releases of NFS have been published, with NFSv3 -being documented in RFC1813 (June 1995), and NFSv4 in RFC3530 (April -2003). - -The Linux NFS client currently supports all the above published versions, -and work is in progress on adding support for minor version 1 of the NFSv4 -protocol. - -The purpose of this document is to provide information on some of the -special features of the NFS client that can be configured by system -administrators. - - -The nfs4_unique_id parameter -============================ - -NFSv4 requires clients to identify themselves to servers with a unique -string. File open and lock state shared between one client and one server -is associated with this identity. To support robust NFSv4 state recovery -and transparent state migration, this identity string must not change -across client reboots. - -Without any other intervention, the Linux client uses a string that contains -the local system's node name. System administrators, however, often do not -take care to ensure that node names are fully qualified and do not change -over the lifetime of a client system. Node names can have other -administrative requirements that require particular behavior that does not -work well as part of an nfs_client_id4 string. - -The nfs.nfs4_unique_id boot parameter specifies a unique string that can be -used instead of a system's node name when an NFS client identifies itself to -a server. Thus, if the system's node name is not unique, or it changes, its -nfs.nfs4_unique_id stays the same, preventing collision with other clients -or loss of state during NFS reboot recovery or transparent state migration. - -The nfs.nfs4_unique_id string is typically a UUID, though it can contain -anything that is believed to be unique across all NFS clients. An -nfs4_unique_id string should be chosen when a client system is installed, -just as a system's root file system gets a fresh UUID in its label at -install time. - -The string should remain fixed for the lifetime of the client. It can be -changed safely if care is taken that the client shuts down cleanly and all -outstanding NFSv4 state has expired, to prevent loss of NFSv4 state. - -This string can be stored in an NFS client's grub.conf, or it can be provided -via a net boot facility such as PXE. It may also be specified as an nfs.ko -module parameter. Specifying a uniquifier string is not support for NFS -clients running in containers. - - -The DNS resolver -================ - -NFSv4 allows for one server to refer the NFS client to data that has been -migrated onto another server by means of the special "fs_locations" -attribute. See - http://tools.ietf.org/html/rfc3530#section-6 -and - http://tools.ietf.org/html/draft-ietf-nfsv4-referrals-00 - -The fs_locations information can take the form of either an ip address and -a path, or a DNS hostname and a path. The latter requires the NFS client to -do a DNS lookup in order to mount the new volume, and hence the need for an -upcall to allow userland to provide this service. - -Assuming that the user has the 'rpc_pipefs' filesystem mounted in the usual -/var/lib/nfs/rpc_pipefs, the upcall consists of the following steps: - - (1) The process checks the dns_resolve cache to see if it contains a - valid entry. If so, it returns that entry and exits. - - (2) If no valid entry exists, the helper script '/sbin/nfs_cache_getent' - (may be changed using the 'nfs.cache_getent' kernel boot parameter) - is run, with two arguments: - - the cache name, "dns_resolve" - - the hostname to resolve - - (3) After looking up the corresponding ip address, the helper script - writes the result into the rpc_pipefs pseudo-file - '/var/lib/nfs/rpc_pipefs/cache/dns_resolve/channel' - in the following (text) format: - - "<ip address> <hostname> <ttl>\n" - - Where <ip address> is in the usual IPv4 (123.456.78.90) or IPv6 - (ffee:ddcc:bbaa:9988:7766:5544:3322:1100, ffee::1100, ...) format. - <hostname> is identical to the second argument of the helper - script, and <ttl> is the 'time to live' of this cache entry (in - units of seconds). - - Note: If <ip address> is invalid, say the string "0", then a negative - entry is created, which will cause the kernel to treat the hostname - as having no valid DNS translation. - - - - -A basic sample /sbin/nfs_cache_getent -===================================== - -#!/bin/bash -# -ttl=600 -# -cut=/usr/bin/cut -getent=/usr/bin/getent -rpc_pipefs=/var/lib/nfs/rpc_pipefs -# -die() -{ - echo "Usage: $0 cache_name entry_name" - exit 1 -} - -[ $# -lt 2 ] && die -cachename="$1" -cache_path=${rpc_pipefs}/cache/${cachename}/channel - -case "${cachename}" in - dns_resolve) - name="$2" - result="$(${getent} hosts ${name} | ${cut} -f1 -d\ )" - [ -z "${result}" ] && result="0" - ;; - *) - die - ;; -esac -echo "${result} ${name} ${ttl}" >${cache_path} - diff --git a/Documentation/filesystems/nfs/nfs41-server.rst b/Documentation/filesystems/nfs/nfs41-server.rst new file mode 100644 index 000000000000..16b5f02f81c3 --- /dev/null +++ b/Documentation/filesystems/nfs/nfs41-server.rst @@ -0,0 +1,256 @@ +============================= +NFSv4.1 Server Implementation +============================= + +Server support for minorversion 1 can be controlled using the +/proc/fs/nfsd/versions control file. The string output returned +by reading this file will contain either "+4.1" or "-4.1" +correspondingly. + +Currently, server support for minorversion 1 is enabled by default. +It can be disabled at run time by writing the string "-4.1" to +the /proc/fs/nfsd/versions control file. Note that to write this +control file, the nfsd service must be taken down. You can use rpc.nfsd +for this; see rpc.nfsd(8). + +(Warning: older servers will interpret "+4.1" and "-4.1" as "+4" and +"-4", respectively. Therefore, code meant to work on both new and old +kernels must turn 4.1 on or off *before* turning support for version 4 +on or off; rpc.nfsd does this correctly.) + +The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based +on RFC 5661. + +From the many new features in NFSv4.1 the current implementation +focuses on the mandatory-to-implement NFSv4.1 Sessions, providing +"exactly once" semantics and better control and throttling of the +resources allocated for each client. + +The table below, taken from the NFSv4.1 document, lists +the operations that are mandatory to implement (REQ), optional +(OPT), and NFSv4.0 operations that are required not to implement (MNI) +in minor version 1. The first column indicates the operations that +are not supported yet by the linux server implementation. + +The OPTIONAL features identified and their abbreviations are as follows: + +- **pNFS** Parallel NFS +- **FDELG** File Delegations +- **DDELG** Directory Delegations + +The following abbreviations indicate the linux server implementation status. + +- **I** Implemented NFSv4.1 operations. +- **NS** Not Supported. +- **NS\*** Unimplemented optional feature. + +Operations +========== + ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| Implementation status | Operation | REQ,REC, OPT or NMI | Feature (REQ, REC or OPT) | Definition | ++=======================+======================+=====================+===========================+================+ +| | ACCESS | REQ | | Section 18.1 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | BACKCHANNEL_CTL | REQ | | Section 18.33 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | BIND_CONN_TO_SESSION | REQ | | Section 18.34 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | CLOSE | REQ | | Section 18.2 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | COMMIT | REQ | | Section 18.3 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | CREATE | REQ | | Section 18.4 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | CREATE_SESSION | REQ | | Section 18.36 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| NS* | DELEGPURGE | OPT | FDELG (REQ) | Section 18.5 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | DELEGRETURN | OPT | FDELG, | Section 18.6 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | | | DDELG, pNFS | | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | | | (REQ) | | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | DESTROY_CLIENTID | REQ | | Section 18.50 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | DESTROY_SESSION | REQ | | Section 18.37 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | EXCHANGE_ID | REQ | | Section 18.35 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | FREE_STATEID | REQ | | Section 18.38 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | GETATTR | REQ | | Section 18.7 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | GETDEVICEINFO | OPT | pNFS (REQ) | Section 18.40 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| NS* | GETDEVICELIST | OPT | pNFS (OPT) | Section 18.41 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | GETFH | REQ | | Section 18.8 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| NS* | GET_DIR_DELEGATION | OPT | DDELG (REQ) | Section 18.39 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | LAYOUTCOMMIT | OPT | pNFS (REQ) | Section 18.42 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | LAYOUTGET | OPT | pNFS (REQ) | Section 18.43 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | LAYOUTRETURN | OPT | pNFS (REQ) | Section 18.44 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | LINK | OPT | | Section 18.9 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | LOCK | REQ | | Section 18.10 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | LOCKT | REQ | | Section 18.11 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | LOCKU | REQ | | Section 18.12 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | LOOKUP | REQ | | Section 18.13 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | LOOKUPP | REQ | | Section 18.14 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | NVERIFY | REQ | | Section 18.15 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | OPEN | REQ | | Section 18.16 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| NS* | OPENATTR | OPT | | Section 18.17 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | OPEN_CONFIRM | MNI | | N/A | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | OPEN_DOWNGRADE | REQ | | Section 18.18 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | PUTFH | REQ | | Section 18.19 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | PUTPUBFH | REQ | | Section 18.20 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | PUTROOTFH | REQ | | Section 18.21 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | READ | REQ | | Section 18.22 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | READDIR | REQ | | Section 18.23 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | READLINK | OPT | | Section 18.24 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | RECLAIM_COMPLETE | REQ | | Section 18.51 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | RELEASE_LOCKOWNER | MNI | | N/A | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | REMOVE | REQ | | Section 18.25 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | RENAME | REQ | | Section 18.26 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | RENEW | MNI | | N/A | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | RESTOREFH | REQ | | Section 18.27 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | SAVEFH | REQ | | Section 18.28 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | SECINFO | REQ | | Section 18.29 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | SECINFO_NO_NAME | REC | pNFS files | Section 18.45, | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | | | layout (REQ) | Section 13.12 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | SEQUENCE | REQ | | Section 18.46 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | SETATTR | REQ | | Section 18.30 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | SETCLIENTID | MNI | | N/A | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | SETCLIENTID_CONFIRM | MNI | | N/A | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| NS | SET_SSV | REQ | | Section 18.47 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | TEST_STATEID | REQ | | Section 18.48 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | VERIFY | REQ | | Section 18.31 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| NS* | WANT_DELEGATION | OPT | FDELG (OPT) | Section 18.49 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | WRITE | REQ | | Section 18.32 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ + + +Callback Operations +=================== ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| Implementation status | Operation | REQ,REC, OPT or NMI | Feature (REQ, REC or OPT) | Definition | ++=======================+=========================+=====================+===========================+===============+ +| | CB_GETATTR | OPT | FDELG (REQ) | Section 20.1 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| I | CB_LAYOUTRECALL | OPT | pNFS (REQ) | Section 20.3 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| NS* | CB_NOTIFY | OPT | DDELG (REQ) | Section 20.4 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| NS* | CB_NOTIFY_DEVICEID | OPT | pNFS (OPT) | Section 20.12 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| NS* | CB_NOTIFY_LOCK | OPT | | Section 20.11 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| NS* | CB_PUSH_DELEG | OPT | FDELG (OPT) | Section 20.5 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| | CB_RECALL | OPT | FDELG, | Section 20.2 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| | | | DDELG, pNFS | | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| | | | (REQ) | | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| NS* | CB_RECALL_ANY | OPT | FDELG, | Section 20.6 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| | | | DDELG, pNFS | | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| | | | (REQ) | | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| NS | CB_RECALL_SLOT | REQ | | Section 20.8 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| NS* | CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS | Section 20.7 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| | | | (REQ) | | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| I | CB_SEQUENCE | OPT | FDELG, | Section 20.9 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| | | | DDELG, pNFS | | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| | | | (REQ) | | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| NS* | CB_WANTS_CANCELLED | OPT | FDELG, | Section 20.10 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| | | | DDELG, pNFS | | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| | | | (REQ) | | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ + + +Implementation notes: +===================== + +SSV: + The spec claims this is mandatory, but we don't actually know of any + implementations, so we're ignoring it for now. The server returns + NFS4ERR_ENCR_ALG_UNSUPP on EXCHANGE_ID, which should be future-proof. + +GSS on the backchannel: + Again, theoretically required but not widely implemented (in + particular, the current Linux client doesn't request it). We return + NFS4ERR_ENCR_ALG_UNSUPP on CREATE_SESSION. + +DELEGPURGE: + mandatory only for servers that support CLAIM_DELEGATE_PREV and/or + CLAIM_DELEG_PREV_FH (which allows clients to keep delegations that + persist across client reboots). Thus we need not implement this for + now. + +EXCHANGE_ID: + implementation ids are ignored + +CREATE_SESSION: + backchannel attributes are ignored + +SEQUENCE: + no support for dynamic slot table renegotiation (optional) + +Nonstandard compound limitations: + No support for a sessions fore channel RPC compound that requires both a + ca_maxrequestsize request and a ca_maxresponsesize reply, so we may + fail to live up to the promise we made in CREATE_SESSION fore channel + negotiation. + +See also http://wiki.linux-nfs.org/wiki/index.php/Server_4.0_and_4.1_issues. diff --git a/Documentation/filesystems/nfs/nfs41-server.txt b/Documentation/filesystems/nfs/nfs41-server.txt deleted file mode 100644 index 01c2db769791..000000000000 --- a/Documentation/filesystems/nfs/nfs41-server.txt +++ /dev/null @@ -1,196 +0,0 @@ -NFSv4.1 Server Implementation - -Server support for minorversion 1 can be controlled using the -/proc/fs/nfsd/versions control file. The string output returned -by reading this file will contain either "+4.1" or "-4.1" -correspondingly. - -Currently, server support for minorversion 1 is disabled by default. -It can be enabled at run time by writing the string "+4.1" to -the /proc/fs/nfsd/versions control file. Note that to write this -control file, the nfsd service must be taken down. Use your user-mode -nfs-utils to set this up; see rpc.nfsd(8) - -(Warning: older servers will interpret "+4.1" and "-4.1" as "+4" and -"-4", respectively. Therefore, code meant to work on both new and old -kernels must turn 4.1 on or off *before* turning support for version 4 -on or off; rpc.nfsd does this correctly.) - -The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based -on RFC 5661. - -From the many new features in NFSv4.1 the current implementation -focuses on the mandatory-to-implement NFSv4.1 Sessions, providing -"exactly once" semantics and better control and throttling of the -resources allocated for each client. - -Other NFSv4.1 features, Parallel NFS operations in particular, -are still under development out of tree. -See http://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_design -for more information. - -The current implementation is intended for developers only: while it -does support ordinary file operations on clients we have tested against -(including the linux client), it is incomplete in ways which may limit -features unexpectedly, cause known bugs in rare cases, or cause -interoperability problems with future clients. Known issues: - - - gss support is questionable: currently mounts with kerberos - from a linux client are possible, but we aren't really - conformant with the spec (for example, we don't use kerberos - on the backchannel correctly). - - We do not support SSV, which provides security for shared - client-server state (thus preventing unauthorized tampering - with locks and opens, for example). It is mandatory for - servers to support this, though no clients use it yet. - -In addition, some limitations are inherited from the current NFSv4 -implementation: - - - Incomplete delegation enforcement: if a file is renamed or - unlinked by a local process, a client holding a delegation may - continue to indefinitely allow opens of the file under the old - name. - -The table below, taken from the NFSv4.1 document, lists -the operations that are mandatory to implement (REQ), optional -(OPT), and NFSv4.0 operations that are required not to implement (MNI) -in minor version 1. The first column indicates the operations that -are not supported yet by the linux server implementation. - -The OPTIONAL features identified and their abbreviations are as follows: - pNFS Parallel NFS - FDELG File Delegations - DDELG Directory Delegations - -The following abbreviations indicate the linux server implementation status. - I Implemented NFSv4.1 operations. - NS Not Supported. - NS* unimplemented optional feature. - P pNFS features implemented out of tree. - PNS pNFS features that are not supported yet (out of tree). - -Operations - - +----------------------+------------+--------------+----------------+ - | Operation | REQ, REC, | Feature | Definition | - | | OPT, or | (REQ, REC, | | - | | MNI | or OPT) | | - +----------------------+------------+--------------+----------------+ - | ACCESS | REQ | | Section 18.1 | -I | BACKCHANNEL_CTL | REQ | | Section 18.33 | -I | BIND_CONN_TO_SESSION | REQ | | Section 18.34 | - | CLOSE | REQ | | Section 18.2 | - | COMMIT | REQ | | Section 18.3 | - | CREATE | REQ | | Section 18.4 | -I | CREATE_SESSION | REQ | | Section 18.36 | -NS*| DELEGPURGE | OPT | FDELG (REQ) | Section 18.5 | - | DELEGRETURN | OPT | FDELG, | Section 18.6 | - | | | DDELG, pNFS | | - | | | (REQ) | | -I | DESTROY_CLIENTID | REQ | | Section 18.50 | -I | DESTROY_SESSION | REQ | | Section 18.37 | -I | EXCHANGE_ID | REQ | | Section 18.35 | -I | FREE_STATEID | REQ | | Section 18.38 | - | GETATTR | REQ | | Section 18.7 | -P | GETDEVICEINFO | OPT | pNFS (REQ) | Section 18.40 | -P | GETDEVICELIST | OPT | pNFS (OPT) | Section 18.41 | - | GETFH | REQ | | Section 18.8 | -NS*| GET_DIR_DELEGATION | OPT | DDELG (REQ) | Section 18.39 | -P | LAYOUTCOMMIT | OPT | pNFS (REQ) | Section 18.42 | -P | LAYOUTGET | OPT | pNFS (REQ) | Section 18.43 | -P | LAYOUTRETURN | OPT | pNFS (REQ) | Section 18.44 | - | LINK | OPT | | Section 18.9 | - | LOCK | REQ | | Section 18.10 | - | LOCKT | REQ | | Section 18.11 | - | LOCKU | REQ | | Section 18.12 | - | LOOKUP | REQ | | Section 18.13 | - | LOOKUPP | REQ | | Section 18.14 | - | NVERIFY | REQ | | Section 18.15 | - | OPEN | REQ | | Section 18.16 | -NS*| OPENATTR | OPT | | Section 18.17 | - | OPEN_CONFIRM | MNI | | N/A | - | OPEN_DOWNGRADE | REQ | | Section 18.18 | - | PUTFH | REQ | | Section 18.19 | - | PUTPUBFH | REQ | | Section 18.20 | - | PUTROOTFH | REQ | | Section 18.21 | - | READ | REQ | | Section 18.22 | - | READDIR | REQ | | Section 18.23 | - | READLINK | OPT | | Section 18.24 | - | RECLAIM_COMPLETE | REQ | | Section 18.51 | - | RELEASE_LOCKOWNER | MNI | | N/A | - | REMOVE | REQ | | Section 18.25 | - | RENAME | REQ | | Section 18.26 | - | RENEW | MNI | | N/A | - | RESTOREFH | REQ | | Section 18.27 | - | SAVEFH | REQ | | Section 18.28 | - | SECINFO | REQ | | Section 18.29 | -I | SECINFO_NO_NAME | REC | pNFS files | Section 18.45, | - | | | layout (REQ) | Section 13.12 | -I | SEQUENCE | REQ | | Section 18.46 | - | SETATTR | REQ | | Section 18.30 | - | SETCLIENTID | MNI | | N/A | - | SETCLIENTID_CONFIRM | MNI | | N/A | -NS | SET_SSV | REQ | | Section 18.47 | -I | TEST_STATEID | REQ | | Section 18.48 | - | VERIFY | REQ | | Section 18.31 | -NS*| WANT_DELEGATION | OPT | FDELG (OPT) | Section 18.49 | - | WRITE | REQ | | Section 18.32 | - -Callback Operations - - +-------------------------+-----------+-------------+---------------+ - | Operation | REQ, REC, | Feature | Definition | - | | OPT, or | (REQ, REC, | | - | | MNI | or OPT) | | - +-------------------------+-----------+-------------+---------------+ - | CB_GETATTR | OPT | FDELG (REQ) | Section 20.1 | -P | CB_LAYOUTRECALL | OPT | pNFS (REQ) | Section 20.3 | -NS*| CB_NOTIFY | OPT | DDELG (REQ) | Section 20.4 | -P | CB_NOTIFY_DEVICEID | OPT | pNFS (OPT) | Section 20.12 | -NS*| CB_NOTIFY_LOCK | OPT | | Section 20.11 | -NS*| CB_PUSH_DELEG | OPT | FDELG (OPT) | Section 20.5 | - | CB_RECALL | OPT | FDELG, | Section 20.2 | - | | | DDELG, pNFS | | - | | | (REQ) | | -NS*| CB_RECALL_ANY | OPT | FDELG, | Section 20.6 | - | | | DDELG, pNFS | | - | | | (REQ) | | -NS | CB_RECALL_SLOT | REQ | | Section 20.8 | -NS*| CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS | Section 20.7 | - | | | (REQ) | | -I | CB_SEQUENCE | OPT | FDELG, | Section 20.9 | - | | | DDELG, pNFS | | - | | | (REQ) | | -NS*| CB_WANTS_CANCELLED | OPT | FDELG, | Section 20.10 | - | | | DDELG, pNFS | | - | | | (REQ) | | - +-------------------------+-----------+-------------+---------------+ - -Implementation notes: - -DELEGPURGE: -* mandatory only for servers that support CLAIM_DELEGATE_PREV and/or - CLAIM_DELEG_PREV_FH (which allows clients to keep delegations that - persist across client reboots). Thus we need not implement this for - now. - -EXCHANGE_ID: -* only SP4_NONE state protection supported -* implementation ids are ignored - -CREATE_SESSION: -* backchannel attributes are ignored - -SEQUENCE: -* no support for dynamic slot table renegotiation (optional) - -Nonstandard compound limitations: -* No support for a sessions fore channel RPC compound that requires both a - ca_maxrequestsize request and a ca_maxresponsesize reply, so we may - fail to live up to the promise we made in CREATE_SESSION fore channel - negotiation. -* No more than one read-like operation allowed per compound; encoding - replies that cross page boundaries (except for read data) not handled. - -See also http://wiki.linux-nfs.org/wiki/index.php/Server_4.0_and_4.1_issues. diff --git a/Documentation/filesystems/nfs/nfsd-admin-interfaces.txt b/Documentation/filesystems/nfs/nfsd-admin-interfaces.txt deleted file mode 100644 index 56a96fb08a73..000000000000 --- a/Documentation/filesystems/nfs/nfsd-admin-interfaces.txt +++ /dev/null @@ -1,41 +0,0 @@ -Administrative interfaces for nfsd -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Note that normally these interfaces are used only by the utilities in -nfs-utils. - -nfsd is controlled mainly by pseudofiles under the "nfsd" filesystem, -which is normally mounted at /proc/fs/nfsd/. - -The server is always started by the first write of a nonzero value to -nfsd/threads. - -Before doing that, NFSD can be told which sockets to listen on by -writing to nfsd/portlist; that write may be: - - - an ascii-encoded file descriptor, which should refer to a - bound (and listening, for tcp) socket, or - - "transportname port", where transportname is currently either - "udp", "tcp", or "rdma". - -If nfsd is started without doing any of these, then it will create one -udp and one tcp listener at port 2049 (see nfsd_init_socks). - -On startup, nfsd and lockd grace periods start. - -nfsd is shut down by a write of 0 to nfsd/threads. All locks and state -are thrown away at that point. - -Between startup and shutdown, the number of threads may be adjusted up -or down by additional writes to nfsd/threads or by writes to -nfsd/pool_threads. - -For more detail about files under nfsd/ and what they control, see -fs/nfsd/nfsctl.c; most of them have detailed comments. - -Implementation notes -^^^^^^^^^^^^^^^^^^^^ - -Note that the rpc server requires the caller to serialize addition and -removal of listening sockets, and startup and shutdown of the server. -For nfsd this is done using nfsd_mutex. diff --git a/Documentation/filesystems/nfs/nfsd-io-modes.rst b/Documentation/filesystems/nfs/nfsd-io-modes.rst new file mode 100644 index 000000000000..0fd6e82478fe --- /dev/null +++ b/Documentation/filesystems/nfs/nfsd-io-modes.rst @@ -0,0 +1,153 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============= +NFSD IO MODES +============= + +Overview +======== + +NFSD has historically always used buffered IO when servicing READ and +WRITE operations. BUFFERED is NFSD's default IO mode, but it is possible +to override that default to use either DONTCACHE or DIRECT IO modes. + +Experimental NFSD debugfs interfaces are available to allow the NFSD IO +mode used for READ and WRITE to be configured independently. See both: + +- /sys/kernel/debug/nfsd/io_cache_read +- /sys/kernel/debug/nfsd/io_cache_write + +The default value for both io_cache_read and io_cache_write reflects +NFSD's default IO mode (which is NFSD_IO_BUFFERED=0). + +Based on the configured settings, NFSD's IO will either be: + +- cached using page cache (NFSD_IO_BUFFERED=0) +- cached but removed from page cache on completion (NFSD_IO_DONTCACHE=1) +- not cached stable_how=NFS_UNSTABLE (NFSD_IO_DIRECT=2) + +To set an NFSD IO mode, write a supported value (0 - 2) to the +corresponding IO operation's debugfs interface, e.g.:: + + echo 2 > /sys/kernel/debug/nfsd/io_cache_read + echo 2 > /sys/kernel/debug/nfsd/io_cache_write + +To check which IO mode NFSD is using for READ or WRITE, simply read the +corresponding IO operation's debugfs interface, e.g.:: + + cat /sys/kernel/debug/nfsd/io_cache_read + cat /sys/kernel/debug/nfsd/io_cache_write + +If you experiment with NFSD's IO modes on a recent kernel and have +interesting results, please report them to linux-nfs@vger.kernel.org + +NFSD DONTCACHE +============== + +DONTCACHE offers a hybrid approach to servicing IO that aims to offer +the benefits of using DIRECT IO without any of the strict alignment +requirements that DIRECT IO imposes. To achieve this buffered IO is used +but the IO is flagged to "drop behind" (meaning associated pages are +dropped from the page cache) when IO completes. + +DONTCACHE aims to avoid what has proven to be a fairly significant +limition of Linux's memory management subsystem if/when large amounts of +data is infrequently accessed (e.g. read once _or_ written once but not +read until much later). Such use-cases are particularly problematic +because the page cache will eventually become a bottleneck to servicing +new IO requests. + +For more context on DONTCACHE, please see these Linux commit headers: + +- Overview: 9ad6344568cc3 ("mm/filemap: change filemap_create_folio() + to take a struct kiocb") +- for READ: 8026e49bff9b1 ("mm/filemap: add read support for + RWF_DONTCACHE") +- for WRITE: 974c5e6139db3 ("xfs: flag as supporting FOP_DONTCACHE") + +NFSD_IO_DONTCACHE will fall back to NFSD_IO_BUFFERED if the underlying +filesystem doesn't indicate support by setting FOP_DONTCACHE. + +NFSD DIRECT +=========== + +DIRECT IO doesn't make use of the page cache, as such it is able to +avoid the Linux memory management's page reclaim scalability problems +without resorting to the hybrid use of page cache that DONTCACHE does. + +Some workloads benefit from NFSD avoiding the page cache, particularly +those with a working set that is significantly larger than available +system memory. The pathological worst-case workload that NFSD DIRECT has +proven to help most is: NFS client issuing large sequential IO to a file +that is 2-3 times larger than the NFS server's available system memory. +The reason for such improvement is NFSD DIRECT eliminates a lot of work +that the memory management subsystem would otherwise be required to +perform (e.g. page allocation, dirty writeback, page reclaim). When +using NFSD DIRECT, kswapd and kcompactd are no longer commanding CPU +time trying to find adequate free pages so that forward IO progress can +be made. + +The performance win associated with using NFSD DIRECT was previously +discussed on linux-nfs, see: +https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/ + +But in summary: + +- NFSD DIRECT can significantly reduce memory requirements +- NFSD DIRECT can reduce CPU load by avoiding costly page reclaim work +- NFSD DIRECT can offer more deterministic IO performance + +As always, your mileage may vary and so it is important to carefully +consider if/when it is beneficial to make use of NFSD DIRECT. When +assessing comparative performance of your workload please be sure to log +relevant performance metrics during testing (e.g. memory usage, cpu +usage, IO performance). Using perf to collect perf data that may be used +to generate a "flamegraph" for work Linux must perform on behalf of your +test is a really meaningful way to compare the relative health of the +system and how switching NFSD's IO mode changes what is observed. + +If NFSD_IO_DIRECT is specified by writing 2 (or 3 and 4 for WRITE) to +NFSD's debugfs interfaces, ideally the IO will be aligned relative to +the underlying block device's logical_block_size. Also the memory buffer +used to store the READ or WRITE payload must be aligned relative to the +underlying block device's dma_alignment. + +But NFSD DIRECT does handle misaligned IO in terms of O_DIRECT as best +it can: + +Misaligned READ: + If NFSD_IO_DIRECT is used, expand any misaligned READ to the next + DIO-aligned block (on either end of the READ). The expanded READ is + verified to have proper offset/len (logical_block_size) and + dma_alignment checking. + +Misaligned WRITE: + If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start, + middle and end as needed. The large middle segment is DIO-aligned + and the start and/or end are misaligned. Buffered IO is used for the + misaligned segments and O_DIRECT is used for the middle DIO-aligned + segment. DONTCACHE buffered IO is _not_ used for the misaligned + segments because using normal buffered IO offers significant RMW + performance benefit when handling streaming misaligned WRITEs. + +Tracing: + The nfsd_read_direct trace event shows how NFSD expands any + misaligned READ to the next DIO-aligned block (on either end of the + original READ, as needed). + + This combination of trace events is useful for READs:: + + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_direct/enable + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable + echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable + + The nfsd_write_direct trace event shows how NFSD splits a given + misaligned WRITE into a DIO-aligned middle segment. + + This combination of trace events is useful for WRITEs:: + + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_direct/enable + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable + echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable diff --git a/Documentation/filesystems/nfs/nfsd-maintainer-entry-profile.rst b/Documentation/filesystems/nfs/nfsd-maintainer-entry-profile.rst new file mode 100644 index 000000000000..4d6b57dbab2a --- /dev/null +++ b/Documentation/filesystems/nfs/nfsd-maintainer-entry-profile.rst @@ -0,0 +1,547 @@ +NFSD Maintainer Entry Profile +============================= + +A Maintainer Entry Profile supplements the top-level process +documents (found in Documentation/process/) with customs that are +specific to a subsystem and its maintainers. A contributor may use +this document to set their expectations and avoid common mistakes. +A maintainer may use these profiles to look across subsystems for +opportunities to converge on best common practices. + +Overview +-------- +The Network File System (NFS) is a standardized family of network +protocols that enable access to files across a set of network- +connected peer hosts. Applications on NFS clients access files that +reside on file systems that are shared by NFS servers. A single +network peer can act as both an NFS client and an NFS server. + +NFSD refers to the NFS server implementation included in the Linux +kernel. An in-kernel NFS server has fast access to files stored +in file systems local to that server. NFSD can share files stored +on most of the file system types native to Linux, including xfs, +ext4, btrfs, and tmpfs. + +Mailing list +------------ +The linux-nfs@vger.kernel.org mailing list is a public list. Its +purpose is to enable collaboration among developers working on the +Linux NFS stack, both client and server. It is not a place for +conversations that are not related directly to the Linux NFS stack. + +The linux-nfs mailing list is archived on `lore.kernel.org <https://lore.kernel.org/linux-nfs/>`_. + +The Linux NFS community does not have any chat room. + +Reporting bugs +-------------- +If you experience an NFSD-related bug on a distribution-built +kernel, please start by working with your Linux distributor. + +Bug reports against upstream Linux code bases are welcome on the +linux-nfs@vger.kernel.org mailing list, where some active triage +can be done. NFSD bugs may also be reported in the Linux kernel +community's bugzilla at: + + https://bugzilla.kernel.org + +Please file NFSD-related bugs under the "Filesystems/NFSD" +component. In general, including as much detail as possible is a +good start, including pertinent system log messages from both +the client and server. + +User space software related to NFSD, such as mountd or the exportfs +command, is contained in the nfs-utils package. Report problems +with those components to linux-nfs@vger.kernel.org. You might be +directed to move the report to a specific bug tracker. + +Contributor's Guide +------------------- + +Standards compliance +~~~~~~~~~~~~~~~~~~~~ +The priority is for NFSD to interoperate fully with the Linux NFS +client. We also test against other popular NFS client implementa- +tions regularly at NFS bake-a-thon events (also known as plug- +fests). Non-Linux NFS clients are not part of upstream NFSD CI/CD. + +The NFSD community strives to provide an NFS server implementation +that interoperates with all standards-compliant NFS client +implementations. This is done by staying as close as is sensible to +the normative mandates in the IETF's published NFS, RPC, and GSS-API +standards. + +It is always useful to reference an RFC and section number in a code +comment where behavior deviates from the standard (and even when the +behavior is compliant but the implementation is obfuscatory). + +On the rare occasion when a deviation from standard-mandated +behavior is needed, brief documentation of the use case or +deficiencies in the standard is a required part of in-code +documentation. + +Care must always be taken to avoid leaking local error codes (ie, +errnos) to clients of NFSD. A proper NFS status code is always +required in NFS protocol replies. + +NFSD administrative interfaces +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +NFSD administrative interfaces include: + +- an NFSD or SUNRPC module parameter + +- export options in /etc/exports + +- files under /proc/fs/nfsd/ or /proc/sys/sunrpc/ + +- the NFSD netlink protocol + +Frequently, a request is made to introduce or modify one of NFSD's +traditional administrative interfaces. Certainly it is technically +easy to introduce a new administrative setting. However, there are +good reasons why the NFSD maintainers prefer to leave that as a last +resort: + +- As with any API, administrative interfaces are difficult to get + right. + +- Once they are documented and have a legacy of use, administrative + interfaces become difficult to modify or remove. + +- Every new administrative setting multiplies the NFSD test matrix. + +- The cost of one administrative interface is incremental, but costs + add up across all of the existing interfaces. + +It is often better for everyone if effort is made up front to +understanding the underlying requirement of the new setting, and +then trying to make it tune itself (or to become otherwise +unnecessary). + +If a new setting is indeed necessary, first consider adding it to +the NFSD netlink protocol. Or if it doesn't need to be a reliable +long term user space feature, it can be added to NFSD's menagerie of +experimental settings which reside under /sys/kernel/debug/nfsd/ . + +Field observability +~~~~~~~~~~~~~~~~~~~ +NFSD employs several different mechanisms for observing operation, +including counters, printks, WARNings, and static trace points. Each +have their strengths and weaknesses. Contributors should select the +most appropriate tool for their task. + +- BUG must be avoided if at all possible, as it will frequently + result in a full system crash. + +- WARN is appropriate only when a full stack trace is useful. + +- printk can show detailed information. These must not be used + in code paths where they can be triggered repeatedly by remote + users. + +- dprintk can show detailed information, but can be enabled only + in pre-set groups. The overhead of emitting output makes dprintk + inappropriate for frequent operations like I/O. + +- Counters are always on, but provide little information about + individual events other than how frequently they occur. + +- static trace points can be enabled individually or in groups + (via a glob). These are generally low overhead, and thus are + favored for use in hot paths. + +- dynamic tracing, such as kprobes or eBPF, are quite flexible but + cannot be used in certain environments (eg, full kernel lock- + down). + +Testing +~~~~~~~ +The kdevops project + + https://github.com/linux-kdevops/kdevops + +contains several NFS-specific workflows, as well as the community +standard fstests suite. These workflows are based on open source +testing tools such as ltp and fio. Contributors are encouraged to +use these tools without kdevops, or contributors should install and +use kdevops themselves to verify their patches before submission. + +Coding style +~~~~~~~~~~~~ +Follow the coding style preferences described in + + Documentation/process/coding-style.rst + +with the following exceptions: + +- Add new local variables to a function in reverse Christmas tree + order + +- Use the kdoc comment style for + + non-static functions + + static inline functions + + static functions that are callbacks/virtual functions + +- All new function names start with ``nfsd_`` for non-NFS-version- + specific functions. + +- New function names that are specific to NFSv2 or NFSv3, or are + used by all minor versions of NFSv4, use ``nfsdN_`` where N is + the version. + +- New function names specific to an NFSv4 minor version can be + named with ``nfsd4M_`` where M is the minor version. + +Patch preparation +~~~~~~~~~~~~~~~~~ +Read and follow all guidelines in + + Documentation/process/submitting-patches.rst + +Use tagging to identify all patch authors. However, reviewers and +testers should be added by replying to the email patch submission. +Email is extensively used in order to publicly archive review and +testing attributions. These tags are automatically inserted into +your patches when they are applied. + +The code in the body of the diff already shows /what/ is being +changed. Thus it is not necessary to repeat that in the patch +description. Instead, the description should contain one or more +of: + +- A brief problem statement ("what is this patch trying to fix?") + with a root-cause analysis. + +- End-user visible symptoms or items that a support engineer might + use to search for the patch, like stack traces. + +- A brief explanation of why the patch is the best way to address + the problem. + +- Any context that reviewers might need to understand the changes + made by the patch. + +- Any relevant benchmarking results, and/or functional test results. + +As detailed in Documentation/process/submitting-patches.rst, +identify the point in history that the issue being addressed was +introduced by using a Fixes: tag. + +Mention in the patch description if that point in history cannot be +determined -- that is, no Fixes: tag can be provided. In this case, +please make it clear to maintainers whether an LTS backport is +needed even though there is no Fixes: tag. + +The NFSD maintainers prefer to add stable tagging themselves, after +public discussion in response to the patch submission. Contributors +may suggest stable tagging, but be aware that many version +management tools add such stable Cc's when you post your patches. +Don't add "Cc: stable" unless you are absolutely sure the patch +needs to go to stable during the initial submission process. + +Patch submission +~~~~~~~~~~~~~~~~ +Patches to NFSD are submitted via the kernel's email-based review +process that is common to most other kernel subsystems. + +Just before each submission, rebase your patch or series on the +nfsd-testing branch at + + https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git + +The NFSD subsystem is maintained separately from the Linux in-kernel +NFS client. The NFSD maintainers do not normally take submissions +for client changes, nor can they respond authoritatively to bug +reports or feature requests for NFS client code. + +This means that contributors might be asked to resubmit patches if +they were emailed to the incorrect set of maintainers and reviewers. +This is not a rejection, but simply a correction of the submission +process. + +When in doubt, consult the NFSD entry in the MAINTAINERS file to +see which files and directories fall under the NFSD subsystem. + +The proper set of email addresses for NFSD patches are: + +To: the NFSD maintainers and reviewers listed in MAINTAINERS +Cc: linux-nfs@vger.kernel.org and optionally linux-kernel@ + +If there are other subsystems involved in the patches (for example +MM or RDMA) their primary mailing list address can be included in +the Cc: field. Other contributors and interested parties may be +included there as well. + +In general we prefer that contributors use common patch email tools +such as "git send-email" or "stg email format/send", which tend to +get the details right without a lot of fuss. + +A series consisting of a single patch is not required to have a +cover letter. However, a cover letter can be included if there is +substantial context that is not appropriate to include in the +patch description. + +Please note that, with an e-mail based submission process, series +cover letters are not part of the work that is committed to the +kernel source code base or its commit history. Therefore always try +to keep pertinent information in the patch descriptions. + +Design documentation is welcome, but as cover letters are not +preserved, a perhaps better option is to include a patch that adds +such documentation under Documentation/filesystems/nfs/. + +Reviewers will ask about test coverage and what use cases the +patches are expected to address. Please be prepared to answer these +questions. + +Review comments from maintainers might be politely stated, but in +general, these are not optional to address when they are actionable. +If necessary, the maintainers retain the right to not apply patches +when contributors refuse to address reasonable requests. + +Post changes to kernel source code and user space source code as +separate series. You can connect the two series with comments in +your cover letters. + +Generally the NFSD maintainers ask for a reposts even for simple +modifications in order to publicly archive the request and the +resulting repost before it is pulled into the NFSD trees. This +also enables us to rebuild a patch series quickly without missing +changes that might have been discussed via email. + +Avoid frequently reposting large series with only small changes. As +a rule of thumb, posting substantial changes more than once a week +will result in reviewer overload. + +Remember, there are only a handful of subsystem maintainers and +reviewers, but potentially many sources of contributions. The +maintainers and reviewers, therefore, are always the less scalable +resource. Be kind to your friendly neighborhood maintainer. + +Patch Acceptance +~~~~~~~~~~~~~~~~ +There isn't a formal review process for NFSD, but we like to see +at least two Reviewed-by: notices for patches that are more than +simple clean-ups. Reviews are done in public on +linux-nfs@vger.kernel.org and are archived on lore.kernel.org. + +Currently the NFSD patch queues are maintained in branches here: + + https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git + +The NFSD maintainers apply patches initially to the nfsd-testing +branch, which is always open to new submissions. Patches can be +applied while review is ongoing. nfsd-testing is a topic branch, +so it can change frequently, it will be rebased, and your patch +might get dropped if there is a problem with it. + +Generally a script-generated "thank you" email will indicate when +your patch has been added to the nfsd-testing branch. You can track +the progress of your patch using the linux-nfs patchworks instance: + + https://patchwork.kernel.org/project/linux-nfs/list/ + +While your patch is in nfsd-testing, it is exposed to a variety of +test environments, including community zero-day bots, static +analysis tools, and NFSD continuous integration testing. The soak +period is three to four weeks. + +Each patch that survives in nfsd-testing for the soak period without +changes is moved to the nfsd-next branch. + +The nfsd-next branch is automatically merged into linux-next and +fs-next on a nightly basis. + +Patches that survive in nfsd-next are included in the next NFSD +merge window pull request. These windows typically occur once every +63 days (nine weeks). + +When the upstream merge window closes, the nfsd-next branch is +renamed nfsd-fixes, and a new nfsd-next branch is created, based on +the upstream -rc1 tag. + +Fixes that are destined for an upstream -rc release also run the +nfsd-testing gauntlet, but are then applied to the nfsd-fixes +branch. That branch is made available for Linus to pull after a +short time. In order to limit the risk of introducing regressions, +we limit such fixes to emergency situations or fixes to breakage +that occurred during the most recent upstream merge. + +Please make it clear when submitting an emergency patch that +immediate action (either application to -rc or LTS backport) is +needed. + +Sensitive patch submissions and bug reports +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +CVEs are generated by specific members of the Linux kernel community +and several external entities. The Linux NFS community does not emit +or assign CVEs. CVEs are assigned after an issue and its fix are +known. + +However, the NFSD maintainers sometimes receive sensitive security +reports, and at times these are significant enough to need to be +embargoed. In such rare cases, fixes can be developed and reviewed +out of the public eye. + +Please be aware that many version management tools add the stable +Cc's when you post your patches. This is generally a nuisance, but +it can result in outing an embargoed security issue accidentally. +Don't add "Cc: stable" unless you are absolutely sure the patch +needs to go to stable@ during the initial submission process. + +Patches that are merged without ever appearing on any list, and +which carry a Reported-by: or Fixes: tag are detected as suspicious +by security-focused people. We encourage that, after any private +review, security-sensitive patches should be posted to linux-nfs@ +for the usual public review, archiving, and test period. + +LLM-generated submissions +~~~~~~~~~~~~~~~~~~~~~~~~~ +The Linux kernel community as a whole is still exploring the new +world of LLM-generated code. The NFSD maintainers will entertain +submission of patches that are partially or wholly generated by +LLM-based development tools. Such submissions are held to the +same standards as submissions created entirely by human authors: + +- The human contributor identifies themselves via a Signed-off-by: + tag. This tag counts as a DoC. + +- The human contributor is solely responsible for code provenance + and any contamination by inadvertently-included code with a + conflicting license, as usual. + +- The human contributor must be able to answer and address review + questions. A patch description such as "This fixed my problem + but I don't know why" is not acceptable. + +- The contribution is subjected to the same test regimen as all + other submissions. + +- An indication (via a Generated-by: tag or otherwise) that the + contribution is LLM-generated is not required. + +It is easy to address review comments and fix requests in LLM +generated code. So easy, in fact, that it becomes tempting to repost +refreshed code immediately. Please resist that temptation. + +As always, please avoid reposting series revisions more than once +every 24 hours. + +Clean-up patches +~~~~~~~~~~~~~~~~ +The NFSD maintainers discourage patches which perform simple clean- +ups, which are not in the context of other work. For example: + +* Addressing ``checkpatch.pl`` warnings after merge +* Addressing :ref:`Local variable ordering<rcs>` issues +* Addressing long-standing whitespace damage + +This is because it is felt that the churn that such changes produce +comes at a greater cost than the value of such clean-ups. + +Conversely, spelling and grammar fixes are encouraged. + +Stable and LTS support +---------------------- +Upstream NFSD continuous integration testing runs against LTS trees +whenever they are updated. + +Please indicate when a patch containing a fix needs to be considered +for LTS kernels, either via a Fixes: tag or explicit mention. + +Feature requests +---------------- +There is no one way to make an official feature request, but +discussion about the request should eventually make its way to +the linux-nfs@vger.kernel.org mailing list for public review by +the community. + +Subsystem boundaries +~~~~~~~~~~~~~~~~~~~~ +NFSD itself is not much more than a protocol engine. This means its +primary responsibility is to translate the NFS protocol into API +calls in the Linux kernel. For example, NFSD is not responsible for +knowing exactly how bytes or file attributes are managed on a block +device. It relies on other kernel subsystems for that. + +If the subsystems on which NFSD relies do not implement a particular +feature, even if the standard NFS protocols do support that feature, +that usually means NFSD cannot provide that feature without +substantial development work in other areas of the kernel. + +Specificity +~~~~~~~~~~~ +Feature requests can come from anywhere, and thus can often be +nebulous. A requester might not understand what a "use case" or +"user story" is. These descriptive paradigms are often used by +developers and architects to understand what is required of a +design, but are terms of art in the software trade, not used in +the everyday world. + +In order to prevent contributors and maintainers from becoming +overwhelmed, we won't be afraid of saying "no" politely to +underspecified requests. + +Community roles and their authority +----------------------------------- +The purpose of Linux subsystem communities is to provide expertise +and active stewardship of a narrow set of source files in the Linux +kernel. This can include managing user space tooling as well. + +To contextualize the structure of the Linux NFS community that +is responsible for stewardship of the NFS server code base, we +define the community roles here. + +- **Contributor** : Anyone who submits a code change, bug fix, + recommendation, documentation fix, and so on. A contributor can + submit regularly or infrequently. + +- **Outside Contributor** : A contributor who is not a regular actor + in the Linux NFS community. This can mean someone who contributes + to other parts of the kernel, or someone who just noticed a + misspelling in a comment and sent a patch. + +- **Reviewer** : Someone who is named in the MAINTAINERS file as a + reviewer is an area expert who can request changes to contributed + code, and expects that contributors will address the request. + +- **External Reviewer** : Someone who is not named in the + MAINTAINERS file as a reviewer, but who is an area expert. + Examples include Linux kernel contributors with networking, + security, or persistent storage expertise, or developers who + contribute primarily to other NFS implementations. + +One or more people will take on the following roles. These people +are often generically referred to as "maintainers", and are +identified in the MAINTAINERS file with the "M:" tag under the NFSD +subsystem. + +- **Upstream Release Manager** : This role is responsible for + curating contributions into a branch, reviewing test results, and + then sending a pull request during merge windows. There is a + trust relationship between the release manager and Linus. + +- **Bug Triager** : Someone who is a first responder to bug reports + submitted to the linux-nfs mailing list or bug trackers, and helps + troubleshoot and identify next steps. + +- **Security Lead** : The security lead handles contacts from the + security community to resolve immediate issues, as well as dealing + with long-term security issues such as supply chain concerns. For + upstream, that's usually whether contributions violate licensing + or other intellectual property agreements. + +- **Testing Lead** : The testing lead builds and runs the test + infrastructure for the subsystem. The testing lead may ask for + patches to be dropped because of ongoing high defect rates. + +- **LTS Maintainer** : The LTS maintainer is responsible for managing + the Fixes: and Cc: stable annotations on patches, and seeing that + patches that cannot be automatically applied to LTS kernels get + proper manual backports as necessary. + +- **Community Manager** : This umpire role can be asked to call balls + and strikes during conflicts, but is also responsible for ensuring + the health of the relationships within the community and for + facilitating discussions on long-term topics such as how to manage + growing technical debt. diff --git a/Documentation/filesystems/nfs/nfsroot.txt b/Documentation/filesystems/nfs/nfsroot.txt deleted file mode 100644 index 2d66ed688125..000000000000 --- a/Documentation/filesystems/nfs/nfsroot.txt +++ /dev/null @@ -1,302 +0,0 @@ -Mounting the root filesystem via NFS (nfsroot) -=============================================== - -Written 1996 by Gero Kuhlmann <gero@gkminix.han.de> -Updated 1997 by Martin Mares <mj@atrey.karlin.mff.cuni.cz> -Updated 2006 by Nico Schottelius <nico-kernel-nfsroot@schottelius.org> -Updated 2006 by Horms <horms@verge.net.au> - - - -In order to use a diskless system, such as an X-terminal or printer server -for example, it is necessary for the root filesystem to be present on a -non-disk device. This may be an initramfs (see Documentation/filesystems/ -ramfs-rootfs-initramfs.txt), a ramdisk (see Documentation/initrd.txt) or a -filesystem mounted via NFS. The following text describes on how to use NFS -for the root filesystem. For the rest of this text 'client' means the -diskless system, and 'server' means the NFS server. - - - - -1.) Enabling nfsroot capabilities - ----------------------------- - -In order to use nfsroot, NFS client support needs to be selected as -built-in during configuration. Once this has been selected, the nfsroot -option will become available, which should also be selected. - -In the networking options, kernel level autoconfiguration can be selected, -along with the types of autoconfiguration to support. Selecting all of -DHCP, BOOTP and RARP is safe. - - - - -2.) Kernel command line - ------------------- - -When the kernel has been loaded by a boot loader (see below) it needs to be -told what root fs device to use. And in the case of nfsroot, where to find -both the server and the name of the directory on the server to mount as root. -This can be established using the following kernel command line parameters: - - -root=/dev/nfs - - This is necessary to enable the pseudo-NFS-device. Note that it's not a - real device but just a synonym to tell the kernel to use NFS instead of - a real device. - - -nfsroot=[<server-ip>:]<root-dir>[,<nfs-options>] - - If the `nfsroot' parameter is NOT given on the command line, - the default "/tftpboot/%s" will be used. - - <server-ip> Specifies the IP address of the NFS server. - The default address is determined by the `ip' parameter - (see below). This parameter allows the use of different - servers for IP autoconfiguration and NFS. - - <root-dir> Name of the directory on the server to mount as root. - If there is a "%s" token in the string, it will be - replaced by the ASCII-representation of the client's - IP address. - - <nfs-options> Standard NFS options. All options are separated by commas. - The following defaults are used: - port = as given by server portmap daemon - rsize = 4096 - wsize = 4096 - timeo = 7 - retrans = 3 - acregmin = 3 - acregmax = 60 - acdirmin = 30 - acdirmax = 60 - flags = hard, nointr, noposix, cto, ac - - -ip=<client-ip>:<server-ip>:<gw-ip>:<netmask>:<hostname>:<device>:<autoconf>: - <dns0-ip>:<dns1-ip> - - This parameter tells the kernel how to configure IP addresses of devices - and also how to set up the IP routing table. It was originally called - `nfsaddrs', but now the boot-time IP configuration works independently of - NFS, so it was renamed to `ip' and the old name remained as an alias for - compatibility reasons. - - If this parameter is missing from the kernel command line, all fields are - assumed to be empty, and the defaults mentioned below apply. In general - this means that the kernel tries to configure everything using - autoconfiguration. - - The <autoconf> parameter can appear alone as the value to the `ip' - parameter (without all the ':' characters before). If the value is - "ip=off" or "ip=none", no autoconfiguration will take place, otherwise - autoconfiguration will take place. The most common way to use this - is "ip=dhcp". - - <client-ip> IP address of the client. - - Default: Determined using autoconfiguration. - - <server-ip> IP address of the NFS server. If RARP is used to determine - the client address and this parameter is NOT empty only - replies from the specified server are accepted. - - Only required for NFS root. That is autoconfiguration - will not be triggered if it is missing and NFS root is not - in operation. - - Default: Determined using autoconfiguration. - The address of the autoconfiguration server is used. - - <gw-ip> IP address of a gateway if the server is on a different subnet. - - Default: Determined using autoconfiguration. - - <netmask> Netmask for local network interface. If unspecified - the netmask is derived from the client IP address assuming - classful addressing. - - Default: Determined using autoconfiguration. - - <hostname> Name of the client. May be supplied by autoconfiguration, - but its absence will not trigger autoconfiguration. - If specified and DHCP is used, the user provided hostname will - be carried in the DHCP request to hopefully update DNS record. - - Default: Client IP address is used in ASCII notation. - - <device> Name of network device to use. - - Default: If the host only has one device, it is used. - Otherwise the device is determined using - autoconfiguration. This is done by sending - autoconfiguration requests out of all devices, - and using the device that received the first reply. - - <autoconf> Method to use for autoconfiguration. In the case of options - which specify multiple autoconfiguration protocols, - requests are sent using all protocols, and the first one - to reply is used. - - Only autoconfiguration protocols that have been compiled - into the kernel will be used, regardless of the value of - this option. - - off or none: don't use autoconfiguration - (do static IP assignment instead) - on or any: use any protocol available in the kernel - (default) - dhcp: use DHCP - bootp: use BOOTP - rarp: use RARP - both: use both BOOTP and RARP but not DHCP - (old option kept for backwards compatibility) - - Default: any - - <dns0-ip> IP address of first nameserver. - Value gets exported by /proc/net/pnp which is often linked - on embedded systems by /etc/resolv.conf. - - <dns1-ip> IP address of secound nameserver. - Same as above. - - -nfsrootdebug - - This parameter enables debugging messages to appear in the kernel - log at boot time so that administrators can verify that the correct - NFS mount options, server address, and root path are passed to the - NFS client. - - -rdinit=<executable file> - - To specify which file contains the program that starts system - initialization, administrators can use this command line parameter. - The default value of this parameter is "/init". If the specified - file exists and the kernel can execute it, root filesystem related - kernel command line parameters, including `nfsroot=', are ignored. - - A description of the process of mounting the root file system can be - found in: - - Documentation/early-userspace/README - - - - -3.) Boot Loader - ---------- - -To get the kernel into memory different approaches can be used. -They depend on various facilities being available: - - -3.1) Booting from a floppy using syslinux - - When building kernels, an easy way to create a boot floppy that uses - syslinux is to use the zdisk or bzdisk make targets which use zimage - and bzimage images respectively. Both targets accept the - FDARGS parameter which can be used to set the kernel command line. - - e.g. - make bzdisk FDARGS="root=/dev/nfs" - - Note that the user running this command will need to have - access to the floppy drive device, /dev/fd0 - - For more information on syslinux, including how to create bootdisks - for prebuilt kernels, see http://syslinux.zytor.com/ - - N.B: Previously it was possible to write a kernel directly to - a floppy using dd, configure the boot device using rdev, and - boot using the resulting floppy. Linux no longer supports this - method of booting. - -3.2) Booting from a cdrom using isolinux - - When building kernels, an easy way to create a bootable cdrom that - uses isolinux is to use the isoimage target which uses a bzimage - image. Like zdisk and bzdisk, this target accepts the FDARGS - parameter which can be used to set the kernel command line. - - e.g. - make isoimage FDARGS="root=/dev/nfs" - - The resulting iso image will be arch/<ARCH>/boot/image.iso - This can be written to a cdrom using a variety of tools including - cdrecord. - - e.g. - cdrecord dev=ATAPI:1,0,0 arch/x86/boot/image.iso - - For more information on isolinux, including how to create bootdisks - for prebuilt kernels, see http://syslinux.zytor.com/ - -3.2) Using LILO - When using LILO all the necessary command line parameters may be - specified using the 'append=' directive in the LILO configuration - file. - - However, to use the 'root=' directive you also need to create - a dummy root device, which may be removed after LILO is run. - - mknod /dev/boot255 c 0 255 - - For information on configuring LILO, please refer to its documentation. - -3.3) Using GRUB - When using GRUB, kernel parameter are simply appended after the kernel - specification: kernel <kernel> <parameters> - -3.4) Using loadlin - loadlin may be used to boot Linux from a DOS command prompt without - requiring a local hard disk to mount as root. This has not been - thoroughly tested by the authors of this document, but in general - it should be possible configure the kernel command line similarly - to the configuration of LILO. - - Please refer to the loadlin documentation for further information. - -3.5) Using a boot ROM - This is probably the most elegant way of booting a diskless client. - With a boot ROM the kernel is loaded using the TFTP protocol. The - authors of this document are not aware of any no commercial boot - ROMs that support booting Linux over the network. However, there - are two free implementations of a boot ROM, netboot-nfs and - etherboot, both of which are available on sunsite.unc.edu, and both - of which contain everything you need to boot a diskless Linux client. - -3.6) Using pxelinux - Pxelinux may be used to boot linux using the PXE boot loader - which is present on many modern network cards. - - When using pxelinux, the kernel image is specified using - "kernel <relative-path-below /tftpboot>". The nfsroot parameters - are passed to the kernel by adding them to the "append" line. - It is common to use serial console in conjunction with pxeliunx, - see Documentation/serial-console.txt for more information. - - For more information on isolinux, including how to create bootdisks - for prebuilt kernels, see http://syslinux.zytor.com/ - - - - -4.) Credits - ------- - - The nfsroot code in the kernel and the RARP support have been written - by Gero Kuhlmann <gero@gkminix.han.de>. - - The rest of the IP layer autoconfiguration code has been written - by Martin Mares <mj@atrey.karlin.mff.cuni.cz>. - - In order to write the initial version of nfsroot I would like to thank - Jens-Uwe Mager <jum@anubis.han.de> for his help. diff --git a/Documentation/filesystems/nfs/pnfs.rst b/Documentation/filesystems/nfs/pnfs.rst new file mode 100644 index 000000000000..7c470ecdc3a9 --- /dev/null +++ b/Documentation/filesystems/nfs/pnfs.rst @@ -0,0 +1,78 @@ +========================== +Reference counting in pnfs +========================== + +The are several inter-related caches. We have layouts which can +reference multiple devices, each of which can reference multiple data servers. +Each data server can be referenced by multiple devices. Each device +can be referenced by multiple layouts. To keep all of this straight, +we need to reference count. + + +struct pnfs_layout_hdr +====================== + +The on-the-wire command LAYOUTGET corresponds to struct +pnfs_layout_segment, usually referred to by the variable name lseg. +Each nfs_inode may hold a pointer to a cache of these layout +segments in nfsi->layout, of type struct pnfs_layout_hdr. + +We reference the header for the inode pointing to it, across each +outstanding RPC call that references it (LAYOUTGET, LAYOUTRETURN, +LAYOUTCOMMIT), and for each lseg held within. + +Each header is also (when non-empty) put on a list associated with +struct nfs_client (cl_layouts). Being put on this list does not bump +the reference count, as the layout is kept around by the lseg that +keeps it in the list. + +deviceid_cache +============== + +lsegs reference device ids, which are resolved per nfs_client and +layout driver type. The device ids are held in a RCU cache (struct +nfs4_deviceid_cache). The cache itself is referenced across each +mount. The entries (struct nfs4_deviceid) themselves are held across +the lifetime of each lseg referencing them. + +RCU is used because the deviceid is basically a write once, read many +data structure. The hlist size of 32 buckets needs better +justification, but seems reasonable given that we can have multiple +deviceid's per filesystem, and multiple filesystems per nfs_client. + +The hash code is copied from the nfsd code base. A discussion of +hashing and variations of this algorithm can be found `here. +<http://groups.google.com/group/comp.lang.c/browse_thread/thread/9522965e2b8d3809>`_ + +data server cache +================= + +file driver devices refer to data servers, which are kept in a module +level cache. Its reference is held over the lifetime of the deviceid +pointing to it. + +lseg +==== + +lseg maintains an extra reference corresponding to the NFS_LSEG_VALID +bit which holds it in the pnfs_layout_hdr's list. When the final lseg +is removed from the pnfs_layout_hdr's list, the NFS_LAYOUT_DESTROYED +bit is set, preventing any new lsegs from being added. + +layout drivers +============== + +PNFS utilizes what is called layout drivers. The STD defines 4 basic +layout types: "files", "objects", "blocks", and "flexfiles". For each +of these types there is a layout-driver with a common function-vectors +table which are called by the nfs-client pnfs-core to implement the +different layout types. + +Files-layout-driver code is in: fs/nfs/filelayout/.. directory +Blocks-layout-driver code is in: fs/nfs/blocklayout/.. directory +Flexfiles-layout-driver code is in: fs/nfs/flexfilelayout/.. directory + +blocks-layout setup +=================== + +TODO: Document the setup needs of the blocks layout driver diff --git a/Documentation/filesystems/nfs/pnfs.txt b/Documentation/filesystems/nfs/pnfs.txt deleted file mode 100644 index 52ae07f5f578..000000000000 --- a/Documentation/filesystems/nfs/pnfs.txt +++ /dev/null @@ -1,109 +0,0 @@ -Reference counting in pnfs: -========================== - -The are several inter-related caches. We have layouts which can -reference multiple devices, each of which can reference multiple data servers. -Each data server can be referenced by multiple devices. Each device -can be referenced by multiple layouts. To keep all of this straight, -we need to reference count. - - -struct pnfs_layout_hdr ----------------------- -The on-the-wire command LAYOUTGET corresponds to struct -pnfs_layout_segment, usually referred to by the variable name lseg. -Each nfs_inode may hold a pointer to a cache of of these layout -segments in nfsi->layout, of type struct pnfs_layout_hdr. - -We reference the header for the inode pointing to it, across each -outstanding RPC call that references it (LAYOUTGET, LAYOUTRETURN, -LAYOUTCOMMIT), and for each lseg held within. - -Each header is also (when non-empty) put on a list associated with -struct nfs_client (cl_layouts). Being put on this list does not bump -the reference count, as the layout is kept around by the lseg that -keeps it in the list. - -deviceid_cache --------------- -lsegs reference device ids, which are resolved per nfs_client and -layout driver type. The device ids are held in a RCU cache (struct -nfs4_deviceid_cache). The cache itself is referenced across each -mount. The entries (struct nfs4_deviceid) themselves are held across -the lifetime of each lseg referencing them. - -RCU is used because the deviceid is basically a write once, read many -data structure. The hlist size of 32 buckets needs better -justification, but seems reasonable given that we can have multiple -deviceid's per filesystem, and multiple filesystems per nfs_client. - -The hash code is copied from the nfsd code base. A discussion of -hashing and variations of this algorithm can be found at: -http://groups.google.com/group/comp.lang.c/browse_thread/thread/9522965e2b8d3809 - -data server cache ------------------ -file driver devices refer to data servers, which are kept in a module -level cache. Its reference is held over the lifetime of the deviceid -pointing to it. - -lseg ----- -lseg maintains an extra reference corresponding to the NFS_LSEG_VALID -bit which holds it in the pnfs_layout_hdr's list. When the final lseg -is removed from the pnfs_layout_hdr's list, the NFS_LAYOUT_DESTROYED -bit is set, preventing any new lsegs from being added. - -layout drivers --------------- - -PNFS utilizes what is called layout drivers. The STD defines 3 basic -layout types: "files" "objects" and "blocks". For each of these types -there is a layout-driver with a common function-vectors table which -are called by the nfs-client pnfs-core to implement the different layout -types. - -Files-layout-driver code is in: fs/nfs/nfs4filelayout.c && nfs4filelayoutdev.c -Objects-layout-deriver code is in: fs/nfs/objlayout/.. directory -Blocks-layout-deriver code is in: fs/nfs/blocklayout/.. directory - -objects-layout setup --------------------- - -As part of the full STD implementation the objlayoutdriver.ko needs, at times, -to automatically login to yet undiscovered iscsi/osd devices. For this the -driver makes up-calles to a user-mode script called *osd_login* - -The path_name of the script to use is by default: - /sbin/osd_login. -This name can be overridden by the Kernel module parameter: - objlayoutdriver.osd_login_prog - -If Kernel does not find the osd_login_prog path it will zero it out -and will not attempt farther logins. An admin can then write new value -to the objlayoutdriver.osd_login_prog Kernel parameter to re-enable it. - -The /sbin/osd_login is part of the nfs-utils package, and should usually -be installed on distributions that support this Kernel version. - -The API to the login script is as follows: - Usage: $0 -u <URI> -o <OSDNAME> -s <SYSTEMID> - Options: - -u target uri e.g. iscsi://<ip>:<port> - (allways exists) - (More protocols can be defined in the future. - The client does not interpret this string it is - passed unchanged as received from the Server) - -o osdname of the requested target OSD - (Might be empty) - (A string which denotes the OSD name, there is a - limit of 64 chars on this string) - -s systemid of the requested target OSD - (Might be empty) - (This string, if not empty is always an hex - representation of the 20 bytes osd_system_id) - -blocks-layout setup -------------------- - -TODO: Document the setup needs of the blocks layout driver diff --git a/Documentation/filesystems/nfs/reexport.rst b/Documentation/filesystems/nfs/reexport.rst new file mode 100644 index 000000000000..044be965d75e --- /dev/null +++ b/Documentation/filesystems/nfs/reexport.rst @@ -0,0 +1,117 @@ +Reexporting NFS filesystems +=========================== + +Overview +-------- + +It is possible to reexport an NFS filesystem over NFS. However, this +feature comes with a number of limitations. Before trying it, we +recommend some careful research to determine whether it will work for +your purposes. + +A discussion of current known limitations follows. + +"fsid=" required, crossmnt broken +--------------------------------- + +We require the "fsid=" export option on any reexport of an NFS +filesystem. You can use "uuidgen -r" to generate a unique argument. + +The "crossmnt" export does not propagate "fsid=", so it will not allow +traversing into further nfs filesystems; if you wish to export nfs +filesystems mounted under the exported filesystem, you'll need to export +them explicitly, assigning each its own unique "fsid= option. + +Reboot recovery +--------------- + +The NFS protocol's normal reboot recovery mechanisms don't work for the +case when the reexport server reboots because the source server has not +rebooted, and so it is not in grace. Since the source server is not in +grace, it cannot offer any guarantees that the file won't have been +changed between the locks getting lost and any attempt to recover them. +The same applies to delegations and any associated locks. Clients are +not allowed to get file locks or delegations from a reexport server, any +attempts will fail with operation not supported. + +Filehandle limits +----------------- + +If the original server uses an X byte filehandle for a given object, the +reexport server's filehandle for the reexported object will be X+22 +bytes, rounded up to the nearest multiple of four bytes. + +The result must fit into the RFC-mandated filehandle size limits: + ++-------+-----------+ +| NFSv2 | 32 bytes | ++-------+-----------+ +| NFSv3 | 64 bytes | ++-------+-----------+ +| NFSv4 | 128 bytes | ++-------+-----------+ + +So, for example, you will only be able to reexport a filesystem over +NFSv2 if the original server gives you filehandles that fit in 10 +bytes--which is unlikely. + +In general there's no way to know the maximum filehandle size given out +by an NFS server without asking the server vendor. + +But the following table gives a few examples. The first column is the +typical length of the filehandle from a Linux server exporting the given +filesystem, the second is the length after that nfs export is reexported +by another Linux host: + ++--------+-------------------+----------------+ +| | filehandle length | after reexport | ++========+===================+================+ +| ext4: | 28 bytes | 52 bytes | ++--------+-------------------+----------------+ +| xfs: | 32 bytes | 56 bytes | ++--------+-------------------+----------------+ +| btrfs: | 40 bytes | 64 bytes | ++--------+-------------------+----------------+ + +All will therefore fit in an NFSv3 or NFSv4 filehandle after reexport, +but none are reexportable over NFSv2. + +Linux server filehandles are a bit more complicated than this, though; +for example: + + - The (non-default) "subtreecheck" export option generally + requires another 4 to 8 bytes in the filehandle. + - If you export a subdirectory of a filesystem (instead of + exporting the filesystem root), that also usually adds 4 to 8 + bytes. + - If you export over NFSv2, knfsd usually uses a shorter + filesystem identifier that saves 8 bytes. + - The root directory of an export uses a filehandle that is + shorter. + +As you can see, the 128-byte NFSv4 filehandle is large enough that +you're unlikely to have trouble using NFSv4 to reexport any filesystem +exported from a Linux server. In general, if the original server is +something that also supports NFSv3, you're *probably* OK. Re-exporting +over NFSv3 may be dicier, and reexporting over NFSv2 will probably +never work. + +For more details of Linux filehandle structure, the best reference is +the source code and comments; see in particular: + + - include/linux/exportfs.h:enum fid_type + - include/uapi/linux/nfsd/nfsfh.h:struct nfs_fhbase_new + - fs/nfsd/nfsfh.c:set_version_and_fsid_type + - fs/nfs/export.c:nfs_encode_fh + +Open DENY bits ignored +---------------------- + +NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which +allow you, for example, to open a file in a mode which forbids other +read opens or write opens. The Linux client doesn't use them, and the +server's support has always been incomplete: they are enforced only +against other NFS users, not against processes accessing the exported +filesystem locally. A reexport server will also not pass them along to +the original server, so they will not be enforced between clients of +different reexport servers. diff --git a/Documentation/filesystems/nfs/rpc-cache.txt b/Documentation/filesystems/nfs/rpc-cache.rst index ebcaaee21616..339efd75016a 100644 --- a/Documentation/filesystems/nfs/rpc-cache.txt +++ b/Documentation/filesystems/nfs/rpc-cache.rst @@ -1,9 +1,14 @@ - This document gives a brief introduction to the caching +========= +RPC Cache +========= + +This document gives a brief introduction to the caching mechanisms in the sunrpc layer that is used, in particular, for NFS authentication. -CACHES +Caches ====== + The caching replaces the old exports table and allows for a wide variety of values to be caches. @@ -12,6 +17,7 @@ quite possibly very different in content and use. There is a corpus of common code for managing these caches. Examples of caches that are likely to be needed are: + - mapping from IP address to client name - mapping from client name and filesystem to export options - mapping from UID to list of GIDs, to work around NFS's limitation @@ -21,6 +27,7 @@ Examples of caches that are likely to be needed are: - mapping from network identify to public key for crypto authentication. The common code handles such things as: + - general cache lookup with correct locking - supporting 'NEGATIVE' as well as positive entries - allowing an EXPIRED time on cache items, and removing @@ -35,67 +42,73 @@ The common code handles such things as: Creating a Cache ---------------- -1/ A cache needs a datum to store. This is in the form of a - structure definition that must contain a - struct cache_head +- A cache needs a datum to store. This is in the form of a + structure definition that must contain a struct cache_head as an element, usually the first. It will also contain a key and some content. Each cache element is reference counted and contains expiry and update times for use in cache management. -2/ A cache needs a "cache_detail" structure that +- A cache needs a "cache_detail" structure that describes the cache. This stores the hash table, some parameters for cache management, and some operations detailing how to work with particular cache items. - The operations requires are: - struct cache_head *alloc(void) - This simply allocates appropriate memory and returns - a pointer to the cache_detail embedded within the - structure - void cache_put(struct kref *) - This is called when the last reference to an item is - dropped. The pointer passed is to the 'ref' field - in the cache_head. cache_put should release any - references create by 'cache_init' and, if CACHE_VALID - is set, any references created by cache_update. - It should then release the memory allocated by - 'alloc'. - int match(struct cache_head *orig, struct cache_head *new) - test if the keys in the two structures match. Return - 1 if they do, 0 if they don't. - void init(struct cache_head *orig, struct cache_head *new) - Set the 'key' fields in 'new' from 'orig'. This may - include taking references to shared objects. - void update(struct cache_head *orig, struct cache_head *new) - Set the 'content' fileds in 'new' from 'orig'. - int cache_show(struct seq_file *m, struct cache_detail *cd, - struct cache_head *h) - Optional. Used to provide a /proc file that lists the - contents of a cache. This should show one item, - usually on just one line. - int cache_request(struct cache_detail *cd, struct cache_head *h, - char **bpp, int *blen) - Format a request to be send to user-space for an item - to be instantiated. *bpp is a buffer of size *blen. - bpp should be moved forward over the encoded message, - and *blen should be reduced to show how much free - space remains. Return 0 on success or <0 if not - enough room or other problem. - int cache_parse(struct cache_detail *cd, char *buf, int len) - A message from user space has arrived to fill out a - cache entry. It is in 'buf' of length 'len'. - cache_parse should parse this, find the item in the - cache with sunrpc_cache_lookup, and update the item - with sunrpc_cache_update. - - -3/ A cache needs to be registered using cache_register(). This + + The operations are: + + struct cache_head \*alloc(void) + This simply allocates appropriate memory and returns + a pointer to the cache_detail embedded within the + structure + + void cache_put(struct kref \*) + This is called when the last reference to an item is + dropped. The pointer passed is to the 'ref' field + in the cache_head. cache_put should release any + references create by 'cache_init' and, if CACHE_VALID + is set, any references created by cache_update. + It should then release the memory allocated by + 'alloc'. + + int match(struct cache_head \*orig, struct cache_head \*new) + test if the keys in the two structures match. Return + 1 if they do, 0 if they don't. + + void init(struct cache_head \*orig, struct cache_head \*new) + Set the 'key' fields in 'new' from 'orig'. This may + include taking references to shared objects. + + void update(struct cache_head \*orig, struct cache_head \*new) + Set the 'content' fields in 'new' from 'orig'. + + int cache_show(struct seq_file \*m, struct cache_detail \*cd, struct cache_head \*h) + Optional. Used to provide a /proc file that lists the + contents of a cache. This should show one item, + usually on just one line. + + int cache_request(struct cache_detail \*cd, struct cache_head \*h, char \*\*bpp, int \*blen) + Format a request to be send to user-space for an item + to be instantiated. \*bpp is a buffer of size \*blen. + bpp should be moved forward over the encoded message, + and \*blen should be reduced to show how much free + space remains. Return 0 on success or <0 if not + enough room or other problem. + + int cache_parse(struct cache_detail \*cd, char \*buf, int len) + A message from user space has arrived to fill out a + cache entry. It is in 'buf' of length 'len'. + cache_parse should parse this, find the item in the + cache with sunrpc_cache_lookup_rcu, and update the item + with sunrpc_cache_update. + + +- A cache needs to be registered using cache_register(). This includes it on a list of caches that will be regularly cleaned to discard old data. Using a cache ------------- -To find a value in a cache, call sunrpc_cache_lookup passing a pointer +To find a value in a cache, call sunrpc_cache_lookup_rcu passing a pointer to the cache_head in a sample item with the 'key' fields filled in. This will be passed to ->match to identify the target entry. If no entry is found, a new entry will be create, added to the cache, and @@ -107,7 +120,7 @@ cache_check will return -ENOENT in the entry is negative or if an up call is needed but not possible, -EAGAIN if an upcall is pending, or 0 if the data is valid; -cache_check can be passed a "struct cache_req *". This structure is +cache_check can be passed a "struct cache_req\*". This structure is typically embedded in the actual request and can be used to create a deferred copy of the request (struct cache_deferred_req). This is done when the found cache item is not uptodate, but the is reason to @@ -116,7 +129,7 @@ item does become valid, the deferred copy of the request will be revisited (->revisit). It is expected that this method will reschedule the request for processing. -The value returned by sunrpc_cache_lookup can also be passed to +The value returned by sunrpc_cache_lookup_rcu can also be passed to sunrpc_cache_update to set the content for the item. A second item is passed which should hold the content. If the item found by _lookup has valid data, then it is discarded and a new item is created. This @@ -139,9 +152,11 @@ The 'channel' works a bit like a datagram socket. Each 'write' is passed as a whole to the cache for parsing and interpretation. Each cache can treat the write requests differently, but it is expected that a message written will contain: + - a key - an expiry time - a content. + with the intention that an item in the cache with the give key should be create or updated to have the given content, and the expiry time should be set on that item. @@ -156,7 +171,8 @@ If there are no more requests to return, read will return EOF, but a select or poll for read will block waiting for another request to be added. -Thus a user-space helper is likely to: +Thus a user-space helper is likely to:: + open the channel. select for readable read a request @@ -175,12 +191,13 @@ Each cache should also define a "cache_request" method which takes a cache item and encodes a request into the buffer provided. -Note: If a cache has no active readers on the channel, and has had not -active readers for more than 60 seconds, further requests will not be -added to the channel but instead all lookups that do not find a valid -entry will fail. This is partly for backward compatibility: The -previous nfs exports table was deemed to be authoritative and a -failed lookup meant a definite 'no'. +.. note:: + If a cache has no active readers on the channel, and has had not + active readers for more than 60 seconds, further requests will not be + added to the channel but instead all lookups that do not find a valid + entry will fail. This is partly for backward compatibility: The + previous nfs exports table was deemed to be authoritative and a + failed lookup meant a definite 'no'. request/response format ----------------------- @@ -193,10 +210,11 @@ with precisely one newline character which should be at the end. Fields within the record should be separated by spaces, normally one. If spaces, newlines, or nul characters are needed in a field they much be quoted. two mechanisms are available: -1/ If a field begins '\x' then it must contain an even number of + +- If a field begins '\x' then it must contain an even number of hex digits, and pairs of these digits provide the bytes in the field. -2/ otherwise a \ in the field must be followed by 3 octal digits +- otherwise a \ in the field must be followed by 3 octal digits which give the code for a byte. Other characters are treated as them selves. At the very least, space, newline, nul, and '\' must be quoted in this way. diff --git a/Documentation/filesystems/nfs/rpc-server-gss.txt b/Documentation/filesystems/nfs/rpc-server-gss.rst index 716f4be8e8b3..5c1a1c58fc27 100644 --- a/Documentation/filesystems/nfs/rpc-server-gss.txt +++ b/Documentation/filesystems/nfs/rpc-server-gss.rst @@ -1,4 +1,4 @@ - +========================================= rpcsec_gss support for kernel RPC servers ========================================= @@ -9,14 +9,16 @@ NFSv4.1 and higher don't require the client to act as a server for the purposes of authentication.) RPCGSS is specified in a few IETF documents: - - RFC2203 v1: http://tools.ietf.org/rfc/rfc2203.txt - - RFC5403 v2: http://tools.ietf.org/rfc/rfc5403.txt -and there is a 3rd version being proposed: - - http://tools.ietf.org/id/draft-williams-rpcsecgssv3.txt - (At draft n. 02 at the time of writing) + + - RFC2203 v1: https://tools.ietf.org/rfc/rfc2203.txt + - RFC5403 v2: https://tools.ietf.org/rfc/rfc5403.txt + +There is a third version that we don't currently implement: + + - RFC7861 v3: https://tools.ietf.org/rfc/rfc7861.txt Background ----------- +========== The RPCGSS Authentication method describes a way to perform GSSAPI Authentication for NFS. Although GSSAPI is itself completely mechanism @@ -27,8 +29,9 @@ The Linux kernel, at the moment, supports only the KRB5 mechanism, and depends on GSSAPI extensions that are KRB5 specific. GSSAPI is a complex library, and implementing it completely in kernel is -unwarranted. However GSSAPI operations are fundementally separable in 2 +unwarranted. However GSSAPI operations are fundamentally separable in 2 parts: + - initial context establishment - integrity/privacy protection (signing and encrypting of individual packets) @@ -41,7 +44,7 @@ kernel, but leave the initial context establishment to userspace. We need upcalls to request userspace to perform context establishment. NFS Server Legacy Upcall Mechanism ----------------------------------- +================================== The classic upcall mechanism uses a custom text based upcall mechanism to talk to a custom daemon called rpc.svcgssd that is provide by the @@ -57,26 +60,25 @@ the Kerberos tickets, that needs to be sent through the GSS layer in order to perform context establishment. B) It does not properly handle creds where the user is member of more -than a few housand groups (the current hard limit in the kernel is 65K +than a few thousand groups (the current hard limit in the kernel is 65K groups) due to limitation on the size of the buffer that can be send back to the kernel (4KiB). NFS Server New RPC Upcall Mechanism ------------------------------------ +=================================== The newer upcall mechanism uses RPC over a unix socket to a daemon called gss-proxy, implemented by a userspace program called Gssproxy. -The gss_proxy RPC protocol is currently documented here: - - https://fedorahosted.org/gss-proxy/wiki/ProtocolDocumentation +The gss_proxy RPC protocol is currently documented `here +<https://fedorahosted.org/gss-proxy/wiki/ProtocolDocumentation>`_. This upcall mechanism uses the kernel rpc client and connects to the gssproxy userspace program over a regular unix socket. The gssproxy protocol does not suffer from the size limitations of the legacy protocol. Negotiating Upcall Mechanisms ------------------------------ +============================= To provide backward compatibility, the kernel defaults to using the legacy mechanism. To switch to the new mechanism, gss-proxy must bind |
