summaryrefslogtreecommitdiff
path: root/Documentation/filesystems
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/9p.rst68
-rw-r--r--Documentation/filesystems/afs.rst2
-rw-r--r--Documentation/filesystems/api-summary.rst3
-rw-r--r--Documentation/filesystems/autofs.rst6
-rw-r--r--Documentation/filesystems/befs.rst4
-rw-r--r--Documentation/filesystems/btrfs.rst1
-rw-r--r--Documentation/filesystems/buffer.rst12
-rw-r--r--Documentation/filesystems/caching/cachefiles.rst4
-rw-r--r--Documentation/filesystems/caching/fscache.rst8
-rw-r--r--Documentation/filesystems/caching/netfs-api.rst6
-rw-r--r--Documentation/filesystems/ceph.rst25
-rw-r--r--Documentation/filesystems/coda.rst2
-rw-r--r--Documentation/filesystems/configfs.rst2
-rw-r--r--Documentation/filesystems/dax.rst4
-rw-r--r--Documentation/filesystems/debugfs.rst31
-rw-r--r--Documentation/filesystems/devpts.rst4
-rw-r--r--Documentation/filesystems/directory-locking.rst349
-rw-r--r--Documentation/filesystems/dlmfs.rst2
-rw-r--r--Documentation/filesystems/efivarfs.rst2
-rw-r--r--Documentation/filesystems/erofs.rst51
-rw-r--r--Documentation/filesystems/ext4/atomic_writes.rst225
-rw-r--r--Documentation/filesystems/ext4/bitmaps.rst7
-rw-r--r--Documentation/filesystems/ext4/blockgroup.rst11
-rw-r--r--Documentation/filesystems/ext4/directory.rst63
-rw-r--r--Documentation/filesystems/ext4/dynamic.rst10
-rw-r--r--Documentation/filesystems/ext4/globals.rst15
-rw-r--r--Documentation/filesystems/ext4/index.rst2
-rw-r--r--Documentation/filesystems/ext4/inode_table.rst9
-rw-r--r--Documentation/filesystems/ext4/inodes.rst2
-rw-r--r--Documentation/filesystems/ext4/overview.rst21
-rw-r--r--Documentation/filesystems/ext4/super.rst26
-rw-r--r--Documentation/filesystems/f2fs.rst253
-rw-r--r--Documentation/filesystems/fiemap.rst49
-rw-r--r--Documentation/filesystems/files.rst53
-rw-r--r--Documentation/filesystems/fscrypt.rst527
-rw-r--r--Documentation/filesystems/fsverity.rst50
-rw-r--r--Documentation/filesystems/fuse/fuse-io-uring.rst99
-rw-r--r--Documentation/filesystems/fuse/fuse-io.rst (renamed from Documentation/filesystems/fuse-io.rst)5
-rw-r--r--Documentation/filesystems/fuse/fuse-passthrough.rst133
-rw-r--r--Documentation/filesystems/fuse/fuse.rst (renamed from Documentation/filesystems/fuse.rst)20
-rw-r--r--Documentation/filesystems/fuse/index.rst14
-rw-r--r--Documentation/filesystems/gfs2/glocks.rst (renamed from Documentation/filesystems/gfs2-glocks.rst)62
-rw-r--r--Documentation/filesystems/gfs2/index.rst (renamed from Documentation/filesystems/gfs2.rst)12
-rw-r--r--Documentation/filesystems/gfs2/uevents.rst (renamed from Documentation/filesystems/gfs2-uevents.rst)0
-rw-r--r--Documentation/filesystems/hpfs.rst2
-rw-r--r--Documentation/filesystems/idmappings.rst40
-rw-r--r--Documentation/filesystems/index.rst17
-rw-r--r--Documentation/filesystems/iomap/design.rst459
-rw-r--r--Documentation/filesystems/iomap/index.rst13
-rw-r--r--Documentation/filesystems/iomap/operations.rst785
-rw-r--r--Documentation/filesystems/iomap/porting.rst120
-rw-r--r--Documentation/filesystems/journalling.rst10
-rw-r--r--Documentation/filesystems/locking.rst157
-rw-r--r--Documentation/filesystems/mount_api.rst38
-rw-r--r--Documentation/filesystems/multigrain-ts.rst125
-rw-r--r--Documentation/filesystems/netfs_library.rst1031
-rw-r--r--Documentation/filesystems/nfs/client-identifier.rst2
-rw-r--r--Documentation/filesystems/nfs/exporting.rst33
-rw-r--r--Documentation/filesystems/nfs/index.rst2
-rw-r--r--Documentation/filesystems/nfs/localio.rst357
-rw-r--r--Documentation/filesystems/nfs/nfsd-io-modes.rst153
-rw-r--r--Documentation/filesystems/nfs/nfsd-maintainer-entry-profile.rst547
-rw-r--r--Documentation/filesystems/nfs/reexport.rst10
-rw-r--r--Documentation/filesystems/nfs/rpc-cache.rst2
-rw-r--r--Documentation/filesystems/nfs/rpc-server-gss.rst2
-rw-r--r--Documentation/filesystems/nilfs2.rst2
-rw-r--r--Documentation/filesystems/ntfs.rst466
-rw-r--r--Documentation/filesystems/ntfs3.rst2
-rw-r--r--Documentation/filesystems/ocfs2-online-filecheck.rst20
-rw-r--r--Documentation/filesystems/orangefs.rst2
-rw-r--r--Documentation/filesystems/overlayfs.rst305
-rw-r--r--Documentation/filesystems/path-lookup.rst2
-rw-r--r--Documentation/filesystems/path-lookup.txt2
-rw-r--r--Documentation/filesystems/porting.rst437
-rw-r--r--Documentation/filesystems/proc.rst296
-rw-r--r--Documentation/filesystems/propagate_umount.txt484
-rw-r--r--Documentation/filesystems/qnx6.rst2
-rw-r--r--Documentation/filesystems/ramfs-rootfs-initramfs.rst14
-rw-r--r--Documentation/filesystems/relay.rst36
-rw-r--r--Documentation/filesystems/resctrl.rst1932
-rw-r--r--Documentation/filesystems/seq_file.rst4
-rw-r--r--Documentation/filesystems/sharedsubtree.rst1347
-rw-r--r--Documentation/filesystems/smb/index.rst1
-rw-r--r--Documentation/filesystems/smb/ksmbd.rst35
-rw-r--r--Documentation/filesystems/smb/smbdirect.rst103
-rw-r--r--Documentation/filesystems/squashfs.rst72
-rw-r--r--Documentation/filesystems/sysfs.rst27
-rw-r--r--Documentation/filesystems/sysv-fs.rst264
-rw-r--r--Documentation/filesystems/tmpfs.rst62
-rw-r--r--Documentation/filesystems/ubifs-authentication.rst4
-rw-r--r--Documentation/filesystems/vfat.rst2
-rw-r--r--Documentation/filesystems/vfs.rst222
-rw-r--r--Documentation/filesystems/xfs/index.rst14
-rw-r--r--Documentation/filesystems/xfs/xfs-delayed-logging-design.rst (renamed from Documentation/filesystems/xfs-delayed-logging-design.rst)2
-rw-r--r--Documentation/filesystems/xfs/xfs-maintainer-entry-profile.rst194
-rw-r--r--Documentation/filesystems/xfs/xfs-online-fsck-design.rst (renamed from Documentation/filesystems/xfs-online-fsck-design.rst)912
-rw-r--r--Documentation/filesystems/xfs/xfs-self-describing-metadata.rst (renamed from Documentation/filesystems/xfs-self-describing-metadata.rst)0
-rw-r--r--Documentation/filesystems/zonefs.rst2
98 files changed, 10191 insertions, 3237 deletions
diff --git a/Documentation/filesystems/9p.rst b/Documentation/filesystems/9p.rst
index 1b5f0cc3e4ca..be3504ca034a 100644
--- a/Documentation/filesystems/9p.rst
+++ b/Documentation/filesystems/9p.rst
@@ -31,7 +31,7 @@ Other applications are described in the following papers:
* PROSE I/O: Using 9p to enable Application Partitions
http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf
* VirtFS: A Virtualization Aware File System pass-through
- http://goo.gl/3WPDg
+ https://kernel.org/doc/ols/2010/ols2010-pages-109-120.pdf
Usage
=====
@@ -40,7 +40,7 @@ For remote file server::
mount -t 9p 10.10.1.2 /mnt/9
-For Plan 9 From User Space applications (http://swtch.com/plan9)::
+For Plan 9 From User Space applications (https://9fans.github.io/plan9port/)::
mount -t 9p `namespace`/acme /mnt/9 -o trans=unix,uname=$USER
@@ -48,11 +48,66 @@ For server running on QEMU host with virtio transport::
mount -t 9p -o trans=virtio <mount_tag> /mnt/9
-where mount_tag is the tag associated by the server to each of the exported
+where mount_tag is the tag generated by the server to each of the exported
mount points. Each 9P export is seen by the client as a virtio device with an
associated "mount_tag" property. Available mount tags can be
seen by reading /sys/bus/virtio/drivers/9pnet_virtio/virtio<n>/mount_tag files.
+USBG Usage
+==========
+
+To mount a 9p FS on a USB Host accessible via the gadget at runtime::
+
+ mount -t 9p -o trans=usbg,aname=/path/to/fs <device> /mnt/9
+
+To mount a 9p FS on a USB Host accessible via the gadget as root filesystem::
+
+ root=<device> rootfstype=9p rootflags=trans=usbg,cache=loose,uname=root,access=0,dfltuid=0,dfltgid=0,aname=/path/to/rootfs
+
+where <device> is the tag associated by the usb gadget transport.
+It is defined by the configfs instance name.
+
+USBG Example
+============
+
+The USB host exports a filesystem, while the gadget on the USB device
+side makes it mountable.
+
+Diod (9pfs server) and the forwarder are on the development host, where
+the root filesystem is actually stored. The gadget is initialized during
+boot (or later) on the embedded board. Then the forwarder will find it
+on the USB bus and start forwarding requests.
+
+In this case the 9p requests come from the device and are handled by the
+host. The reason is that USB device ports are normally not available on
+PCs, so a connection in the other direction would not work.
+
+When using the usbg transport, for now there is no native usb host
+service capable to handle the requests from the gadget driver. For
+this we have to use the extra python tool p9_fwd.py from tools/usb.
+
+Just start the 9pfs capable network server like diod/nfs-ganesha e.g.::
+
+ $ diod -f -n -d 0 -S -l 0.0.0.0:9999 -e $PWD
+
+Optionally scan your bus if there are more then one usbg gadgets to find their path::
+
+ $ python $kernel_dir/tools/usb/p9_fwd.py list
+
+ Bus | Addr | Manufacturer | Product | ID | Path
+ --- | ---- | ---------------- | ---------------- | --------- | ----
+ 2 | 67 | unknown | unknown | 1d6b:0109 | 2-1.1.2
+ 2 | 68 | unknown | unknown | 1d6b:0109 | 2-1.1.3
+
+Then start the python transport::
+
+ $ python $kernel_dir/tools/usb/p9_fwd.py --path 2-1.1.2 connect -p 9999
+
+After that the gadget driver can be used as described above.
+
+One use-case is to use it as an alternative to NFS root booting during
+the development of embedded Linux devices.
+
Options
=======
@@ -68,6 +123,7 @@ Options
virtio connect to the next virtio channel available
(from QEMU with trans_virtio module)
rdma connect to a specified RDMA channel
+ usbg connect to a specified usb gadget channel
======== ============================================
uname=name user name to attempt mount as on the remote server. The
@@ -79,7 +135,7 @@ Options
cache=mode specifies a caching policy. By default, no caches are used.
The mode can be specified as a bitmask or by using one of the
- prexisting common 'shortcuts'.
+ preexisting common 'shortcuts'.
The bitmask is described below: (unspecified bits are reserved)
========== ====================================================
@@ -109,8 +165,8 @@ Options
do not necessarily validate cached values on the server. In other
words changes on the server are not guaranteed to be reflected
on the client system. Only use this mode of operation if you
- have an exclusive mount and the server will modify the filesystem
- underneath you.
+ have an exclusive mount and the server will not modify the
+ filesystem underneath you.
debug=n specifies debug level. The debug level is a bitmask.
diff --git a/Documentation/filesystems/afs.rst b/Documentation/filesystems/afs.rst
index ca062a7f8ee2..f15ba388bbde 100644
--- a/Documentation/filesystems/afs.rst
+++ b/Documentation/filesystems/afs.rst
@@ -44,7 +44,7 @@ options::
CONFIG_AF_RXRPC - The RxRPC protocol transport
CONFIG_RXKAD - The RxRPC Kerberos security handler
- CONFIG_AFS - The AFS filesystem
+ CONFIG_AFS_FS - The AFS filesystem
Additionally, the following can be turned on to aid debugging::
diff --git a/Documentation/filesystems/api-summary.rst b/Documentation/filesystems/api-summary.rst
index 98db2ea5fa12..cc5cc7f3fbd8 100644
--- a/Documentation/filesystems/api-summary.rst
+++ b/Documentation/filesystems/api-summary.rst
@@ -56,9 +56,6 @@ Other Functions
.. kernel-doc:: fs/namei.c
:export:
-.. kernel-doc:: fs/buffer.c
- :export:
-
.. kernel-doc:: block/bio.c
:export:
diff --git a/Documentation/filesystems/autofs.rst b/Documentation/filesystems/autofs.rst
index 3b6e38e646cd..5eb02394fcc3 100644
--- a/Documentation/filesystems/autofs.rst
+++ b/Documentation/filesystems/autofs.rst
@@ -18,7 +18,7 @@ key advantages:
2. The names and locations of filesystems can be stored in
a remote database and can change at any time. The content
- in that data base at the time of access will be used to provide
+ in that database at the time of access will be used to provide
a target for the access. The interpretation of names in the
filesystem can even be programmatic rather than database-backed,
allowing wildcards for example, and can vary based on the user who
@@ -423,7 +423,7 @@ The available ioctl commands are:
and objects are expired if the are not in use.
**AUTOFS_EXP_FORCED** causes the in use status to be ignored
- and objects are expired ieven if they are in use. This assumes
+ and objects are expired even if they are in use. This assumes
that the daemon has requested this because it is capable of
performing the umount.
@@ -442,7 +442,7 @@ which can be used to communicate directly with the autofs filesystem.
It requires CAP_SYS_ADMIN for access.
The 'ioctl's that can be used on this device are described in a separate
-document `autofs-mount-control.txt`, and are summarised briefly here.
+document `autofs-mount-control.rst`, and are summarised briefly here.
Each ioctl is passed a pointer to an `autofs_dev_ioctl` structure::
struct autofs_dev_ioctl {
diff --git a/Documentation/filesystems/befs.rst b/Documentation/filesystems/befs.rst
index 79f9740d76ff..a22f603b2938 100644
--- a/Documentation/filesystems/befs.rst
+++ b/Documentation/filesystems/befs.rst
@@ -106,8 +106,8 @@ iocharset=xxx Use xxx as the name of the NLS translation table.
debug The driver will output debugging information to the syslog.
============= ===========================================================
-How to Get Lastest Version
-==========================
+How to Get Latest Version
+=========================
The latest version is currently available at:
<http://befs-driver.sourceforge.net/>
diff --git a/Documentation/filesystems/btrfs.rst b/Documentation/filesystems/btrfs.rst
index 992eddb0e11b..a81db8f54d68 100644
--- a/Documentation/filesystems/btrfs.rst
+++ b/Documentation/filesystems/btrfs.rst
@@ -37,7 +37,6 @@ For more information please refer to the documentation site or wiki
https://btrfs.readthedocs.io
- https://btrfs.wiki.kernel.org
that maintains information about administration tasks, frequently asked
questions, use cases, mount options, comprehensible changelogs, features,
diff --git a/Documentation/filesystems/buffer.rst b/Documentation/filesystems/buffer.rst
new file mode 100644
index 000000000000..ae24faf68efa
--- /dev/null
+++ b/Documentation/filesystems/buffer.rst
@@ -0,0 +1,12 @@
+Buffer Heads
+============
+
+Linux uses buffer heads to maintain state about individual filesystem blocks.
+Buffer heads are deprecated and new filesystems should use iomap instead.
+
+Functions
+---------
+
+.. kernel-doc:: include/linux/buffer_head.h
+.. kernel-doc:: fs/buffer.c
+ :export:
diff --git a/Documentation/filesystems/caching/cachefiles.rst b/Documentation/filesystems/caching/cachefiles.rst
index fc7abf712315..b3ccc782cb3b 100644
--- a/Documentation/filesystems/caching/cachefiles.rst
+++ b/Documentation/filesystems/caching/cachefiles.rst
@@ -115,7 +115,7 @@ set up cache ready for use. The following script commands are available:
This mask can also be set through sysfs, eg::
- echo 5 >/sys/modules/cachefiles/parameters/debug
+ echo 5 > /sys/module/cachefiles/parameters/debug
Starting the Cache
@@ -416,7 +416,7 @@ process is the target of an operation by some other process (SIGKILL for
example).
The subjective security holds the active security properties of a process, and
-may be overridden. This is not seen externally, and is used whan a process
+may be overridden. This is not seen externally, and is used when a process
acts upon another object, for example SIGKILLing another process or opening a
file.
diff --git a/Documentation/filesystems/caching/fscache.rst b/Documentation/filesystems/caching/fscache.rst
index a74d7b052dc1..de1f32526cc1 100644
--- a/Documentation/filesystems/caching/fscache.rst
+++ b/Documentation/filesystems/caching/fscache.rst
@@ -318,10 +318,10 @@ where the columns are:
Debugging
=========
-If CONFIG_FSCACHE_DEBUG is enabled, the FS-Cache facility can have runtime
-debugging enabled by adjusting the value in::
+If CONFIG_NETFS_DEBUG is enabled, the FS-Cache facility and NETFS support can
+have runtime debugging enabled by adjusting the value in::
- /sys/module/fscache/parameters/debug
+ /sys/module/netfs/parameters/debug
This is a bitmask of debugging streams to enable:
@@ -343,6 +343,6 @@ This is a bitmask of debugging streams to enable:
The appropriate set of values should be OR'd together and the result written to
the control file. For example::
- echo $((1|8|512)) >/sys/module/fscache/parameters/debug
+ echo $((1|8|512)) >/sys/module/netfs/parameters/debug
will turn on all function entry debugging.
diff --git a/Documentation/filesystems/caching/netfs-api.rst b/Documentation/filesystems/caching/netfs-api.rst
index 1d18e9def183..665b27f1556e 100644
--- a/Documentation/filesystems/caching/netfs-api.rst
+++ b/Documentation/filesystems/caching/netfs-api.rst
@@ -59,7 +59,7 @@ A filesystem would typically have a volume cookie for each superblock.
The filesystem then acquires a cookie for each file within that volume using an
object key. Object keys are binary blobs and only need to be unique within
-their parent volume. The cache backend is reponsible for rendering the binary
+their parent volume. The cache backend is responsible for rendering the binary
blob into something it can use and may employ hash tables, trees or whatever to
improve its ability to find an object. This is transparent to the network
filesystem.
@@ -91,7 +91,7 @@ actually required and it can use the fscache I/O API directly.
Volume Registration
===================
-The first step for a network filsystem is to acquire a volume cookie for the
+The first step for a network filesystem is to acquire a volume cookie for the
volume it wants to access::
struct fscache_volume *
@@ -119,7 +119,7 @@ is provided. If the coherency data doesn't match, the entire cache volume will
be invalidated.
This function can return errors such as EBUSY if the volume key is already in
-use by an acquired volume or ENOMEM if an allocation failure occured. It may
+use by an acquired volume or ENOMEM if an allocation failure occurred. It may
also return a NULL volume cookie if fscache is not enabled. It is safe to
pass a NULL cookie to any function that takes a volume cookie. This will
cause that function to do nothing.
diff --git a/Documentation/filesystems/ceph.rst b/Documentation/filesystems/ceph.rst
index 76ce938e7024..6d2276a87a5a 100644
--- a/Documentation/filesystems/ceph.rst
+++ b/Documentation/filesystems/ceph.rst
@@ -57,12 +57,25 @@ a snapshot on any subdirectory (and its nested contents) in the
system. Snapshot creation and deletion are as simple as 'mkdir
.snap/foo' and 'rmdir .snap/foo'.
-Ceph also provides some recursive accounting on directories for nested
-files and bytes. That is, a 'getfattr -d foo' on any directory in the
-system will reveal the total number of nested regular files and
-subdirectories, and a summation of all nested file sizes. This makes
-the identification of large disk space consumers relatively quick, as
-no 'du' or similar recursive scan of the file system is required.
+Snapshot names have two limitations:
+
+* They can not start with an underscore ('_'), as these names are reserved
+ for internal usage by the MDS.
+* They can not exceed 240 characters in size. This is because the MDS makes
+ use of long snapshot names internally, which follow the format:
+ `_<SNAPSHOT-NAME>_<INODE-NUMBER>`. Since filenames in general can't have
+ more than 255 characters, and `<node-id>` takes 13 characters, the long
+ snapshot names can take as much as 255 - 1 - 1 - 13 = 240.
+
+Ceph also provides some recursive accounting on directories for nested files
+and bytes. You can run the commands::
+
+ getfattr -n ceph.dir.rfiles /some/dir
+ getfattr -n ceph.dir.rbytes /some/dir
+
+to get the total number of nested files and their combined size, respectively.
+This makes the identification of large disk space consumers relatively quick,
+as no 'du' or similar recursive scan of the file system is required.
Finally, Ceph also allows quotas to be set on any directory in the system.
The quota can restrict the number of bytes or the number of files stored
diff --git a/Documentation/filesystems/coda.rst b/Documentation/filesystems/coda.rst
index bdde7e4e010b..0db3c83a50e5 100644
--- a/Documentation/filesystems/coda.rst
+++ b/Documentation/filesystems/coda.rst
@@ -141,7 +141,7 @@ kernel support.
a process P which accessing a Coda file. It makes a system call which
traps to the OS kernel. Examples of such calls trapping to the kernel
are ``read``, ``write``, ``open``, ``close``, ``create``, ``mkdir``,
- ``rmdir``, ``chmod`` in a Unix ontext. Similar calls exist in the Win32
+ ``rmdir``, ``chmod`` in a Unix context. Similar calls exist in the Win32
environment, and are named ``CreateFile``.
Generally the operating system handles the request in a virtual
diff --git a/Documentation/filesystems/configfs.rst b/Documentation/filesystems/configfs.rst
index 8c9342ed6d25..ac22138de6a4 100644
--- a/Documentation/filesystems/configfs.rst
+++ b/Documentation/filesystems/configfs.rst
@@ -253,7 +253,7 @@ to be used.
If binary attribute is readable and the config_item provides a
ct_item_ops->read_bin_attribute() method, that method will be called
whenever userspace asks for a read(2) on the attribute. The converse
-will happen for write(2). The reads/writes are bufferred so only a
+will happen for write(2). The reads/writes are buffered so only a
single read/write will occur; the attributes' need not concern itself
with it.
diff --git a/Documentation/filesystems/dax.rst b/Documentation/filesystems/dax.rst
index c04609d8ee24..5b283f3d1eb1 100644
--- a/Documentation/filesystems/dax.rst
+++ b/Documentation/filesystems/dax.rst
@@ -206,8 +206,6 @@ stall the CPU for an extended period, you should also not attempt to
implement direct_access.
These block devices may be used for inspiration:
-- brd: RAM backed block device driver
-- dcssblk: s390 dcss block device driver
- pmem: NVDIMM persistent memory driver
@@ -291,7 +289,7 @@ The DAX code does not work correctly on architectures which have virtually
mapped caches such as ARM, MIPS and SPARC.
Calling :c:func:`get_user_pages()` on a range of user memory that has been
-mmaped from a `DAX` file will fail when there are no 'struct page' to describe
+mmapped from a `DAX` file will fail when there are no 'struct page' to describe
those pages. This problem has been addressed in some device drivers
by adding optional struct page support for pages under the control of
the driver (see `CONFIG_NVDIMM_PFN` in ``drivers/nvdimm`` for an example of
diff --git a/Documentation/filesystems/debugfs.rst b/Documentation/filesystems/debugfs.rst
index dc35da8b8792..55f807293924 100644
--- a/Documentation/filesystems/debugfs.rst
+++ b/Documentation/filesystems/debugfs.rst
@@ -211,18 +211,16 @@ seq_file content.
There are a couple of other directory-oriented helper functions::
- struct dentry *debugfs_rename(struct dentry *old_dir,
- struct dentry *old_dentry,
- struct dentry *new_dir,
- const char *new_name);
+ struct dentry *debugfs_change_name(struct dentry *dentry,
+ const char *fmt, ...);
struct dentry *debugfs_create_symlink(const char *name,
struct dentry *parent,
const char *target);
-A call to debugfs_rename() will give a new name to an existing debugfs
-file, possibly in a different directory. The new_name must not exist prior
-to the call; the return value is old_dentry with updated information.
+A call to debugfs_change_name() will give a new name to an existing debugfs
+file, always in the same directory. The new_name must not exist prior
+to the call; the return value is 0 on success and -E... on failure.
Symbolic links can be created with debugfs_create_symlink().
There is one important thing that all debugfs users must take into account:
@@ -231,22 +229,15 @@ module is unloaded without explicitly removing debugfs entries, the result
will be a lot of stale pointers and no end of highly antisocial behavior.
So all debugfs users - at least those which can be built as modules - must
be prepared to remove all files and directories they create there. A file
-can be removed with::
+or directory can be removed with::
void debugfs_remove(struct dentry *dentry);
The dentry value can be NULL or an error value, in which case nothing will
-be removed.
-
-Once upon a time, debugfs users were required to remember the dentry
-pointer for every debugfs file they created so that all files could be
-cleaned up. We live in more civilized times now, though, and debugfs users
-can call::
-
- void debugfs_remove_recursive(struct dentry *dentry);
-
-If this function is passed a pointer for the dentry corresponding to the
-top-level directory, the entire hierarchy below that directory will be
-removed.
+be removed. Note that this function will recursively remove all files and
+directories underneath it. Previously, debugfs_remove_recursive() was used
+to perform that task, but this function is now just an alias to
+debugfs_remove(). debugfs_remove_recursive() should be considered
+deprecated.
.. [1] http://lwn.net/Articles/309298/
diff --git a/Documentation/filesystems/devpts.rst b/Documentation/filesystems/devpts.rst
index a03248ddfb4c..b6324ab1960d 100644
--- a/Documentation/filesystems/devpts.rst
+++ b/Documentation/filesystems/devpts.rst
@@ -5,8 +5,8 @@ The Devpts Filesystem
=====================
Each mount of the devpts filesystem is now distinct such that ptys
-and their indicies allocated in one mount are independent from ptys
-and their indicies in all other mounts.
+and their indices allocated in one mount are independent from ptys
+and their indices in all other mounts.
All mounts of the devpts filesystem now create a ``/dev/pts/ptmx`` node
with permissions ``0000``.
diff --git a/Documentation/filesystems/directory-locking.rst b/Documentation/filesystems/directory-locking.rst
index dccd61c7c5c3..6fdf0b02df43 100644
--- a/Documentation/filesystems/directory-locking.rst
+++ b/Documentation/filesystems/directory-locking.rst
@@ -11,129 +11,268 @@ When taking the i_rwsem on multiple non-directory objects, we
always acquire the locks in order by increasing address. We'll call
that "inode pointer" order in the following.
-For our purposes all operations fall in 5 classes:
-1) read access. Locking rules: caller locks directory we are accessing.
-The lock is taken shared.
+Primitives
+==========
-2) object creation. Locking rules: same as above, but the lock is taken
-exclusive.
+For our purposes all operations fall in 6 classes:
-3) object removal. Locking rules: caller locks parent, finds victim,
-locks victim and calls the method. Locks are exclusive.
+1. read access. Locking rules:
-4) rename() that is _not_ cross-directory. Locking rules: caller locks the
-parent and finds source and target. We lock both (provided they exist). If we
-need to lock two inodes of different type (dir vs non-dir), we lock directory
-first. If we need to lock two inodes of the same type, lock them in inode
-pointer order. Then call the method. All locks are exclusive.
-NB: we might get away with locking the source (and target in exchange
-case) shared.
+ * lock the directory we are accessing (shared)
-5) link creation. Locking rules:
+2. object creation. Locking rules:
- * lock parent
- * check that source is not a directory
- * lock source
- * call the method.
+ * lock the directory we are accessing (exclusive)
-All locks are exclusive.
+3. object removal. Locking rules:
-6) cross-directory rename. The trickiest in the whole bunch. Locking
-rules:
+ * lock the parent (exclusive)
+ * find the victim
+ * lock the victim (exclusive)
- * lock the filesystem
- * lock parents in "ancestors first" order. If one is not ancestor of
- the other, lock them in inode pointer order.
- * find source and target.
- * if old parent is equal to or is a descendent of target
- fail with -ENOTEMPTY
- * if new parent is equal to or is a descendent of source
- fail with -ELOOP
- * Lock both the source and the target provided they exist. If we
- need to lock two inodes of different type (dir vs non-dir), we lock
- the directory first. If we need to lock two inodes of the same type,
- lock them in inode pointer order.
- * call the method.
-
-All ->i_rwsem are taken exclusive. Again, we might get away with locking
-the source (and target in exchange case) shared.
-
-The rules above obviously guarantee that all directories that are going to be
-read, modified or removed by method will be locked by caller.
+4. link creation. Locking rules:
+
+ * lock the parent (exclusive)
+ * check that the source is not a directory
+ * lock the source (exclusive; probably could be weakened to shared)
+
+5. rename that is _not_ cross-directory. Locking rules:
+
+ * lock the parent (exclusive)
+ * find the source and target
+ * decide which of the source and target need to be locked.
+ The source needs to be locked if it's a non-directory, target - if it's
+ a non-directory or about to be removed.
+ * take the locks that need to be taken (exclusive), in inode pointer order
+ if need to take both (that can happen only when both source and target
+ are non-directories - the source because it wouldn't need to be locked
+ otherwise and the target because mixing directory and non-directory is
+ allowed only with RENAME_EXCHANGE, and that won't be removing the target).
+6. cross-directory rename. The trickiest in the whole bunch. Locking rules:
+
+ * lock the filesystem
+ * if the parents don't have a common ancestor, fail the operation.
+ * lock the parents in "ancestors first" order (exclusive). If neither is an
+ ancestor of the other, lock the parent of source first.
+ * find the source and target.
+ * verify that the source is not a descendent of the target and
+ target is not a descendent of source; fail the operation otherwise.
+ * lock the subdirectories involved (exclusive), source before target.
+ * lock the non-directories involved (exclusive), in inode pointer order.
+
+The rules above obviously guarantee that all directories that are going
+to be read, modified or removed by method will be locked by the caller.
+
+
+Splicing
+========
+
+There is one more thing to consider - splicing. It's not an operation
+in its own right; it may happen as part of lookup. We speak of the
+operations on directory trees, but we obviously do not have the full
+picture of those - especially for network filesystems. What we have
+is a bunch of subtrees visible in dcache and locking happens on those.
+Trees grow as we do operations; memory pressure prunes them. Normally
+that's not a problem, but there is a nasty twist - what should we do
+when one growing tree reaches the root of another? That can happen in
+several scenarios, starting from "somebody mounted two nested subtrees
+from the same NFS4 server and doing lookups in one of them has reached
+the root of another"; there's also open-by-fhandle stuff, and there's a
+possibility that directory we see in one place gets moved by the server
+to another and we run into it when we do a lookup.
+
+For a lot of reasons we want to have the same directory present in dcache
+only once. Multiple aliases are not allowed. So when lookup runs into
+a subdirectory that already has an alias, something needs to be done with
+dcache trees. Lookup is already holding the parent locked. If alias is
+a root of separate tree, it gets attached to the directory we are doing a
+lookup in, under the name we'd been looking for. If the alias is already
+a child of the directory we are looking in, it changes name to the one
+we'd been looking for. No extra locking is involved in these two cases.
+However, if it's a child of some other directory, the things get trickier.
+First of all, we verify that it is *not* an ancestor of our directory
+and fail the lookup if it is. Then we try to lock the filesystem and the
+current parent of the alias. If either trylock fails, we fail the lookup.
+If trylocks succeed, we detach the alias from its current parent and
+attach to our directory, under the name we are looking for.
+
+Note that splicing does *not* involve any modification of the filesystem;
+all we change is the view in dcache. Moreover, holding a directory locked
+exclusive prevents such changes involving its children and holding the
+filesystem lock prevents any changes of tree topology, other than having a
+root of one tree becoming a child of directory in another. In particular,
+if two dentries have been found to have a common ancestor after taking
+the filesystem lock, their relationship will remain unchanged until
+the lock is dropped. So from the directory operations' point of view
+splicing is almost irrelevant - the only place where it matters is one
+step in cross-directory renames; we need to be careful when checking if
+parents have a common ancestor.
+
+
+Multiple-filesystem stuff
+=========================
+
+For some filesystems a method can involve a directory operation on
+another filesystem; it may be ecryptfs doing operation in the underlying
+filesystem, overlayfs doing something to the layers, network filesystem
+using a local one as a cache, etc. In all such cases the operations
+on other filesystems must follow the same locking rules. Moreover, "a
+directory operation on this filesystem might involve directory operations
+on that filesystem" should be an asymmetric relation (or, if you will,
+it should be possible to rank the filesystems so that directory operation
+on a filesystem could trigger directory operations only on higher-ranked
+ones - in these terms overlayfs ranks lower than its layers, network
+filesystem ranks lower than whatever it caches on, etc.)
+
+
+Deadlock avoidance
+==================
If no directory is its own ancestor, the scheme above is deadlock-free.
Proof:
- First of all, at any moment we have a linear ordering of the
- objects - A < B iff (A is an ancestor of B) or (B is not an ancestor
- of A and ptr(A) < ptr(B)).
-
- That ordering can change. However, the following is true:
-
-(1) if object removal or non-cross-directory rename holds lock on A and
- attempts to acquire lock on B, A will remain the parent of B until we
- acquire the lock on B. (Proof: only cross-directory rename can change
- the parent of object and it would have to lock the parent).
-
-(2) if cross-directory rename holds the lock on filesystem, order will not
- change until rename acquires all locks. (Proof: other cross-directory
- renames will be blocked on filesystem lock and we don't start changing
- the order until we had acquired all locks).
-
-(3) locks on non-directory objects are acquired only after locks on
- directory objects, and are acquired in inode pointer order.
- (Proof: all operations but renames take lock on at most one
- non-directory object, except renames, which take locks on source and
- target in inode pointer order in the case they are not directories.)
-
-Now consider the minimal deadlock. Each process is blocked on
-attempt to acquire some lock and already holds at least one lock. Let's
-consider the set of contended locks. First of all, filesystem lock is
-not contended, since any process blocked on it is not holding any locks.
-Thus all processes are blocked on ->i_rwsem.
-
-By (3), any process holding a non-directory lock can only be
-waiting on another non-directory lock with a larger address. Therefore
-the process holding the "largest" such lock can always make progress, and
-non-directory objects are not included in the set of contended locks.
-
-Thus link creation can't be a part of deadlock - it can't be
-blocked on source and it means that it doesn't hold any locks.
-
-Any contended object is either held by cross-directory rename or
-has a child that is also contended. Indeed, suppose that it is held by
-operation other than cross-directory rename. Then the lock this operation
-is blocked on belongs to child of that object due to (1).
-
-It means that one of the operations is cross-directory rename.
-Otherwise the set of contended objects would be infinite - each of them
-would have a contended child and we had assumed that no object is its
-own descendent. Moreover, there is exactly one cross-directory rename
-(see above).
-
-Consider the object blocking the cross-directory rename. One
-of its descendents is locked by cross-directory rename (otherwise we
-would again have an infinite set of contended objects). But that
-means that cross-directory rename is taking locks out of order. Due
-to (2) the order hadn't changed since we had acquired filesystem lock.
-But locking rules for cross-directory rename guarantee that we do not
-try to acquire lock on descendent before the lock on ancestor.
-Contradiction. I.e. deadlock is impossible. Q.E.D.
-
+There is a ranking on the locks, such that all primitives take
+them in order of non-decreasing rank. Namely,
+
+ * rank ->i_rwsem of non-directories on given filesystem in inode pointer
+ order.
+ * put ->i_rwsem of all directories on a filesystem at the same rank,
+ lower than ->i_rwsem of any non-directory on the same filesystem.
+ * put ->s_vfs_rename_mutex at rank lower than that of any ->i_rwsem
+ on the same filesystem.
+ * among the locks on different filesystems use the relative
+ rank of those filesystems.
+
+For example, if we have NFS filesystem caching on a local one, we have
+
+ 1. ->s_vfs_rename_mutex of NFS filesystem
+ 2. ->i_rwsem of directories on that NFS filesystem, same rank for all
+ 3. ->i_rwsem of non-directories on that filesystem, in order of
+ increasing address of inode
+ 4. ->s_vfs_rename_mutex of local filesystem
+ 5. ->i_rwsem of directories on the local filesystem, same rank for all
+ 6. ->i_rwsem of non-directories on local filesystem, in order of
+ increasing address of inode.
+
+It's easy to verify that operations never take a lock with rank
+lower than that of an already held lock.
+
+Suppose deadlocks are possible. Consider the minimal deadlocked
+set of threads. It is a cycle of several threads, each blocked on a lock
+held by the next thread in the cycle.
+
+Since the locking order is consistent with the ranking, all
+contended locks in the minimal deadlock will be of the same rank,
+i.e. they all will be ->i_rwsem of directories on the same filesystem.
+Moreover, without loss of generality we can assume that all operations
+are done directly to that filesystem and none of them has actually
+reached the method call.
+
+In other words, we have a cycle of threads, T1,..., Tn,
+and the same number of directories (D1,...,Dn) such that
+
+ T1 is blocked on D1 which is held by T2
+
+ T2 is blocked on D2 which is held by T3
+
+ ...
+
+ Tn is blocked on Dn which is held by T1.
+
+Each operation in the minimal cycle must have locked at least
+one directory and blocked on attempt to lock another. That leaves
+only 3 possible operations: directory removal (locks parent, then
+child), same-directory rename killing a subdirectory (ditto) and
+cross-directory rename of some sort.
+
+There must be a cross-directory rename in the set; indeed,
+if all operations had been of the "lock parent, then child" sort
+we would have Dn a parent of D1, which is a parent of D2, which is
+a parent of D3, ..., which is a parent of Dn. Relationships couldn't
+have changed since the moment directory locks had been acquired,
+so they would all hold simultaneously at the deadlock time and
+we would have a loop.
+
+Since all operations are on the same filesystem, there can't be
+more than one cross-directory rename among them. Without loss of
+generality we can assume that T1 is the one doing a cross-directory
+rename and everything else is of the "lock parent, then child" sort.
+
+In other words, we have a cross-directory rename that locked
+Dn and blocked on attempt to lock D1, which is a parent of D2, which is
+a parent of D3, ..., which is a parent of Dn. Relationships between
+D1,...,Dn all hold simultaneously at the deadlock time. Moreover,
+cross-directory rename does not get to locking any directories until it
+has acquired filesystem lock and verified that directories involved have
+a common ancestor, which guarantees that ancestry relationships between
+all of them had been stable.
+
+Consider the order in which directories are locked by the
+cross-directory rename; parents first, then possibly their children.
+Dn and D1 would have to be among those, with Dn locked before D1.
+Which pair could it be?
+
+It can't be the parents - indeed, since D1 is an ancestor of Dn,
+it would be the first parent to be locked. Therefore at least one of the
+children must be involved and thus neither of them could be a descendent
+of another - otherwise the operation would not have progressed past
+locking the parents.
+
+It can't be a parent and its child; otherwise we would've had
+a loop, since the parents are locked before the children, so the parent
+would have to be a descendent of its child.
+
+It can't be a parent and a child of another parent either.
+Otherwise the child of the parent in question would've been a descendent
+of another child.
+
+That leaves only one possibility - namely, both Dn and D1 are
+among the children, in some order. But that is also impossible, since
+neither of the children is a descendent of another.
+
+That concludes the proof, since the set of operations with the
+properties required for a minimal deadlock can not exist.
+
+Note that the check for having a common ancestor in cross-directory
+rename is crucial - without it a deadlock would be possible. Indeed,
+suppose the parents are initially in different trees; we would lock the
+parent of source, then try to lock the parent of target, only to have
+an unrelated lookup splice a distant ancestor of source to some distant
+descendent of the parent of target. At that point we have cross-directory
+rename holding the lock on parent of source and trying to lock its
+distant ancestor. Add a bunch of rmdir() attempts on all directories
+in between (all of those would fail with -ENOTEMPTY, had they ever gotten
+the locks) and voila - we have a deadlock.
+
+Loop avoidance
+==============
These operations are guaranteed to avoid loop creation. Indeed,
the only operation that could introduce loops is cross-directory rename.
-Since the only new (parent, child) pair added by rename() is (new parent,
-source), such loop would have to contain these objects and the rest of it
-would have to exist before rename(). I.e. at the moment of loop creation
-rename() responsible for that would be holding filesystem lock and new parent
-would have to be equal to or a descendent of source. But that means that
-new parent had been equal to or a descendent of source since the moment when
-we had acquired filesystem lock and rename() would fail with -ELOOP in that
-case.
+Suppose after the operation there is a loop; since there hadn't been such
+loops before the operation, at least on of the nodes in that loop must've
+had its parent changed. In other words, the loop must be passing through
+the source or, in case of exchange, possibly the target.
+
+Since the operation has succeeded, neither source nor target could have
+been ancestors of each other. Therefore the chain of ancestors starting
+in the parent of source could not have passed through the target and
+vice versa. On the other hand, the chain of ancestors of any node could
+not have passed through the node itself, or we would've had a loop before
+the operation. But everything other than source and target has kept
+the parent after the operation, so the operation does not change the
+chains of ancestors of (ex-)parents of source and target. In particular,
+those chains must end after a finite number of steps.
+
+Now consider the loop created by the operation. It passes through either
+source or target; the next node in the loop would be the ex-parent of
+target or source resp. After that the loop would follow the chain of
+ancestors of that parent. But as we have just shown, that chain must
+end after a finite number of steps, which means that it can't be a part
+of any loop. Q.E.D.
While this locking scheme works for arbitrary DAGs, it relies on
ability to check that directory is a descendent of another object. Current
diff --git a/Documentation/filesystems/dlmfs.rst b/Documentation/filesystems/dlmfs.rst
index 7e2b1fd471d7..70d4e48242c3 100644
--- a/Documentation/filesystems/dlmfs.rst
+++ b/Documentation/filesystems/dlmfs.rst
@@ -36,7 +36,7 @@ None
Usage
=====
-If you're just interested in OCFS2, then please see ocfs2.txt. The
+If you're just interested in OCFS2, then please see ocfs2.rst. The
rest of this document will be geared towards those who want to use
dlmfs for easy to setup and easy to use clustered locking in
userspace.
diff --git a/Documentation/filesystems/efivarfs.rst b/Documentation/filesystems/efivarfs.rst
index 0551985821b8..f646c3f0980f 100644
--- a/Documentation/filesystems/efivarfs.rst
+++ b/Documentation/filesystems/efivarfs.rst
@@ -40,4 +40,4 @@ accidentally.
*See also:*
- Documentation/admin-guide/acpi/ssdt-overlays.rst
-- Documentation/ABI/stable/sysfs-firmware-efi-vars
+- Documentation/ABI/removed/sysfs-firmware-efi-vars
diff --git a/Documentation/filesystems/erofs.rst b/Documentation/filesystems/erofs.rst
index 4654ee57c1d5..08194f194b94 100644
--- a/Documentation/filesystems/erofs.rst
+++ b/Documentation/filesystems/erofs.rst
@@ -58,12 +58,14 @@ Here are the main features of EROFS:
- Support extended attributes as an option;
+ - Support a bloom filter that speeds up negative extended attribute lookups;
+
- Support POSIX.1e ACLs by using extended attributes;
- Support transparent data compression as an option:
- LZ4 and MicroLZMA algorithms can be used on a per-file basis; In addition,
- inplace decompression is also supported to avoid bounce compressed buffers
- and page cache thrashing.
+ LZ4, MicroLZMA and DEFLATE algorithms can be used on a per-file basis; In
+ addition, inplace decompression is also supported to avoid bounce compressed
+ buffers and unnecessary page cache thrashing.
- Support chunk-based data deduplication and rolling-hash compressed data
deduplication;
@@ -73,7 +75,7 @@ Here are the main features of EROFS:
- Support merging tail-end data into a special inode as fragments.
- - Support large folios for uncompressed files.
+ - Support large folios to make use of THPs (Transparent Hugepages);
- Support direct I/O on uncompressed files to avoid double caching for loop
devices;
@@ -89,6 +91,10 @@ compatibility checking tool (fsck.erofs), and a debugging tool (dump.erofs):
- git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git
+For more information, please also refer to the documentation site:
+
+- https://erofs.docs.kernel.org
+
Bugs and patches are welcome, please kindly help us and send to the following
linux-erofs mailing list:
@@ -110,7 +116,7 @@ cache_strategy=%s Select a strategy for cached decompression from now on:
cluster for further reading. It still does
in-place I/O decompression for the rest
compressed physical clusters;
- readaround Cache the both ends of incomplete compressed
+ readaround Cache both ends of incomplete compressed
physical clusters for further reading.
It still does in-place I/O decompression
for the rest compressed physical clusters.
@@ -122,6 +128,7 @@ device=%s Specify a path to an extra device to be used together.
fsid=%s Specify a filesystem image ID for Fscache back-end.
domain_id=%s Specify a domain ID in fscache mode so that different images
with the same blobs under a given domain ID can share storage.
+fsoffset=%llu Specify block-aligned filesystem offset for the primary device.
=================== =========================================================
Sysfs Entries
@@ -197,7 +204,7 @@ may not. All metadatas can be now observed in two different spaces (views):
| |
|__________________| 64 bytes
- Xattrs, extents, data inline are followed by the corresponding inode with
+ Xattrs, extents, data inline are placed after the corresponding inode with
proper alignment, and they could be optional for different data mappings.
_currently_ total 5 data layouts are supported:
@@ -268,6 +275,38 @@ details.)
By the way, chunk-based files are all uncompressed for now.
+Long extended attribute name prefixes
+-------------------------------------
+There are use cases where extended attributes with different values can have
+only a few common prefixes (such as overlayfs xattrs). The predefined prefixes
+work inefficiently in both image size and runtime performance in such cases.
+
+The long xattr name prefixes feature is introduced to address this issue. The
+overall idea is that, apart from the existing predefined prefixes, the xattr
+entry could also refer to user-specified long xattr name prefixes, e.g.
+"trusted.overlay.".
+
+When referring to a long xattr name prefix, the highest bit (bit 7) of
+erofs_xattr_entry.e_name_index is set, while the lower bits (bit 0-6) as a whole
+represent the index of the referred long name prefix among all long name
+prefixes. Therefore, only the trailing part of the name apart from the long
+xattr name prefix is stored in erofs_xattr_entry.e_name, which could be empty if
+the full xattr name matches exactly as its long xattr name prefix.
+
+All long xattr prefixes are stored one by one in the packed inode as long as
+the packed inode is valid, or in the meta inode otherwise. The
+xattr_prefix_count (of the on-disk superblock) indicates the total number of
+long xattr name prefixes, while (xattr_prefix_start * 4) indicates the start
+offset of long name prefixes in the packed/meta inode. Note that, long extended
+attribute name prefixes are disabled if xattr_prefix_count is 0.
+
+Each long name prefix is stored in the format: ALIGN({__le16 len, data}, 4),
+where len represents the total size of the data part. The data part is actually
+represented by 'struct erofs_xattr_long_prefix', where base_index represents the
+index of the predefined xattr name prefix, e.g. EROFS_XATTR_INDEX_TRUSTED for
+"trusted.overlay." long name prefix, while the infix string keeps the string
+after stripping the short prefix, e.g. "overlay." for the example above.
+
Data compression
----------------
EROFS implements fixed-sized output compression which generates fixed-sized
diff --git a/Documentation/filesystems/ext4/atomic_writes.rst b/Documentation/filesystems/ext4/atomic_writes.rst
new file mode 100644
index 000000000000..ae8995740aa8
--- /dev/null
+++ b/Documentation/filesystems/ext4/atomic_writes.rst
@@ -0,0 +1,225 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _atomic_writes:
+
+Atomic Block Writes
+-------------------------
+
+Introduction
+~~~~~~~~~~~~
+
+Atomic (untorn) block writes ensure that either the entire write is committed
+to disk or none of it is. This prevents "torn writes" during power loss or
+system crashes. The ext4 filesystem supports atomic writes (only with Direct
+I/O) on regular files with extents, provided the underlying storage device
+supports hardware atomic writes. This is supported in the following two ways:
+
+1. **Single-fsblock Atomic Writes**:
+ EXT4 supports atomic write operations with a single filesystem block since
+ v6.13. In this the atomic write unit minimum and maximum sizes are both set
+ to filesystem blocksize.
+ e.g. doing atomic write of 16KB with 16KB filesystem blocksize on 64KB
+ pagesize system is possible.
+
+2. **Multi-fsblock Atomic Writes with Bigalloc**:
+ EXT4 now also supports atomic writes spanning multiple filesystem blocks
+ using a feature known as bigalloc. The atomic write unit's minimum and
+ maximum sizes are determined by the filesystem block size and cluster size,
+ based on the underlying device’s supported atomic write unit limits.
+
+Requirements
+~~~~~~~~~~~~
+
+Basic requirements for atomic writes in ext4:
+
+ 1. The extents feature must be enabled (default for ext4)
+ 2. The underlying block device must support atomic writes
+ 3. For single-fsblock atomic writes:
+
+ 1. A filesystem with appropriate block size (up to the page size)
+ 4. For multi-fsblock atomic writes:
+
+ 1. The bigalloc feature must be enabled
+ 2. The cluster size must be appropriately configured
+
+NOTE: EXT4 does not support software or COW based atomic write, which means
+atomic writes on ext4 are only supported if underlying storage device supports
+it.
+
+Multi-fsblock Implementation Details
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The bigalloc feature changes ext4 to allocate in units of multiple filesystem
+blocks, also known as clusters. With bigalloc each bit within block bitmap
+represents a cluster (power of 2 number of blocks) rather than individual
+filesystem blocks.
+EXT4 supports multi-fsblock atomic writes with bigalloc, subject to the
+following constraints. The minimum atomic write size is the larger of the fs
+block size and the minimum hardware atomic write unit; and the maximum atomic
+write size is smaller of the bigalloc cluster size and the maximum hardware
+atomic write unit. Bigalloc ensures that all allocations are aligned to the
+cluster size, which satisfies the LBA alignment requirements of the hardware
+device if the start of the partition/logical volume is itself aligned correctly.
+
+Here is the block allocation strategy in bigalloc for atomic writes:
+
+ * For regions with fully mapped extents, no additional work is needed
+ * For append writes, a new mapped extent is allocated
+ * For regions that are entirely holes, unwritten extent is created
+ * For large unwritten extents, the extent gets split into two unwritten
+ extents of appropriate requested size
+ * For mixed mapping regions (combinations of holes, unwritten extents, or
+ mapped extents), ext4_map_blocks() is called in a loop with
+ EXT4_GET_BLOCKS_ZERO flag to convert the region into a single contiguous
+ mapped extent by writing zeroes to it and converting any unwritten extents to
+ written, if found within the range.
+
+Note: Writing on a single contiguous underlying extent, whether mapped or
+unwritten, is not inherently problematic. However, writing to a mixed mapping
+region (i.e. one containing a combination of mapped and unwritten extents)
+must be avoided when performing atomic writes.
+
+The reason is that, atomic writes when issued via pwritev2() with the RWF_ATOMIC
+flag, requires that either all data is written or none at all. In the event of
+a system crash or unexpected power loss during the write operation, the affected
+region (when later read) must reflect either the complete old data or the
+complete new data, but never a mix of both.
+
+To enforce this guarantee, we ensure that the write target is backed by
+a single, contiguous extent before any data is written. This is critical because
+ext4 defers the conversion of unwritten extents to written extents until the I/O
+completion path (typically in ->end_io()). If a write is allowed to proceed over
+a mixed mapping region (with mapped and unwritten extents) and a failure occurs
+mid-write, the system could observe partially updated regions after reboot, i.e.
+new data over mapped areas, and stale (old) data over unwritten extents that
+were never marked written. This violates the atomicity and/or torn write
+prevention guarantee.
+
+To prevent such torn writes, ext4 proactively allocates a single contiguous
+extent for the entire requested region in ``ext4_iomap_alloc`` via
+``ext4_map_blocks_atomic()``. EXT4 also force commits the current journalling
+transaction in case if allocation is done over mixed mapping. This ensures any
+pending metadata updates (like unwritten to written extents conversion) in this
+range are in consistent state with the file data blocks, before performing the
+actual write I/O. If the commit fails, the whole I/O must be aborted to prevent
+from any possible torn writes.
+Only after this step, the actual data write operation is performed by the iomap.
+
+Handling Split Extents Across Leaf Blocks
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+There can be a special edge case where we have logically and physically
+contiguous extents stored in separate leaf nodes of the on-disk extent tree.
+This occurs because on-disk extent tree merges only happens within the leaf
+blocks except for a case where we have 2-level tree which can get merged and
+collapsed entirely into the inode.
+If such a layout exists and, in the worst case, the extent status cache entries
+are reclaimed due to memory pressure, ``ext4_map_blocks()`` may never return
+a single contiguous extent for these split leaf extents.
+
+To address this edge case, a new get block flag
+``EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS flag`` is added to enhance the
+``ext4_map_query_blocks()`` lookup behavior.
+
+This new get block flag allows ``ext4_map_blocks()`` to first check if there is
+an entry in the extent status cache for the full range.
+If not present, it consults the on-disk extent tree using
+``ext4_map_query_blocks()``.
+If the located extent is at the end of a leaf node, it probes the next logical
+block (lblk) to detect a contiguous extent in the adjacent leaf.
+
+For now only one additional leaf block is queried to maintain efficiency, as
+atomic writes are typically constrained to small sizes
+(e.g. [blocksize, clustersize]).
+
+
+Handling Journal transactions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To support multi-fsblock atomic writes, we ensure enough journal credits are
+reserved during:
+
+ 1. Block allocation time in ``ext4_iomap_alloc()``. We first query if there
+ could be a mixed mapping for the underlying requested range. If yes, then we
+ reserve credits of up to ``m_len``, assuming every alternate block can be
+ an unwritten extent followed by a hole.
+
+ 2. During ``->end_io()`` call, we make sure a single transaction is started for
+ doing unwritten-to-written conversion. The loop for conversion is mainly
+ only required to handle a split extent across leaf blocks.
+
+How to
+~~~~~~
+
+Creating Filesystems with Atomic Write Support
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+First check the atomic write units supported by block device.
+See :ref:`atomic_write_bdev_support` for more details.
+
+For single-fsblock atomic writes with a larger block size
+(on systems with block size < page size):
+
+.. code-block:: bash
+
+ # Create an ext4 filesystem with a 16KB block size
+ # (requires page size >= 16KB)
+ mkfs.ext4 -b 16384 /dev/device
+
+For multi-fsblock atomic writes with bigalloc:
+
+.. code-block:: bash
+
+ # Create an ext4 filesystem with bigalloc and 64KB cluster size
+ mkfs.ext4 -F -O bigalloc -b 4096 -C 65536 /dev/device
+
+Where ``-b`` specifies the block size, ``-C`` specifies the cluster size in bytes,
+and ``-O bigalloc`` enables the bigalloc feature.
+
+Application Interface
+^^^^^^^^^^^^^^^^^^^^^
+
+Applications can use the ``pwritev2()`` system call with the ``RWF_ATOMIC`` flag
+to perform atomic writes:
+
+.. code-block:: c
+
+ pwritev2(fd, iov, iovcnt, offset, RWF_ATOMIC);
+
+The write must be aligned to the filesystem's block size and not exceed the
+filesystem's maximum atomic write unit size.
+See ``generic_atomic_write_valid()`` for more details.
+
+``statx()`` system call with ``STATX_WRITE_ATOMIC`` flag can provide following
+details:
+
+ * ``stx_atomic_write_unit_min``: Minimum size of an atomic write request.
+ * ``stx_atomic_write_unit_max``: Maximum size of an atomic write request.
+ * ``stx_atomic_write_segments_max``: Upper limit for segments. The number of
+ separate memory buffers that can be gathered into a write operation
+ (e.g., the iovcnt parameter for IOV_ITER). Currently, this is always set to one.
+
+The STATX_ATTR_WRITE_ATOMIC flag in ``statx->attributes`` is set if atomic
+writes are supported.
+
+.. _atomic_write_bdev_support:
+
+Hardware Support
+~~~~~~~~~~~~~~~~
+
+The underlying storage device must support atomic write operations.
+Modern NVMe and SCSI devices often provide this capability.
+The Linux kernel exposes this information through sysfs:
+
+* ``/sys/block/<device>/queue/atomic_write_unit_min`` - Minimum atomic write size
+* ``/sys/block/<device>/queue/atomic_write_unit_max`` - Maximum atomic write size
+
+Nonzero values for these attributes indicate that the device supports
+atomic writes.
+
+See Also
+~~~~~~~~
+
+* :doc:`bigalloc` - Documentation on the bigalloc feature
+* :doc:`allocators` - Documentation on block allocation in ext4
+* Support for atomic block writes in 6.13:
+ https://lwn.net/Articles/1009298/
diff --git a/Documentation/filesystems/ext4/bitmaps.rst b/Documentation/filesystems/ext4/bitmaps.rst
index 91c45d86e9bb..9d7d7b083a25 100644
--- a/Documentation/filesystems/ext4/bitmaps.rst
+++ b/Documentation/filesystems/ext4/bitmaps.rst
@@ -19,10 +19,3 @@ necessarily the case that no blocks are in use -- if ``meta_bg`` is set,
the bitmaps and group descriptor live inside the group. Unfortunately,
ext2fs_test_block_bitmap2() will return '0' for those locations,
which produces confusing debugfs output.
-
-Inode Table
------------
-Inode tables are statically allocated at mkfs time. Each block group
-descriptor points to the start of the table, and the superblock records
-the number of inodes per group. See the section on inodes for more
-information.
diff --git a/Documentation/filesystems/ext4/blockgroup.rst b/Documentation/filesystems/ext4/blockgroup.rst
index ed5a5cac6d40..7cbf0b2b778e 100644
--- a/Documentation/filesystems/ext4/blockgroup.rst
+++ b/Documentation/filesystems/ext4/blockgroup.rst
@@ -1,7 +1,10 @@
.. SPDX-License-Identifier: GPL-2.0
+Block Groups
+------------
+
Layout
-------
+~~~~~~
The layout of a standard block group is approximately as follows (each
of these fields is discussed in a separate section below):
@@ -60,7 +63,7 @@ groups (flex_bg). Leftover space is used for file data blocks, indirect
block maps, extent tree blocks, and extended attributes.
Flexible Block Groups
----------------------
+~~~~~~~~~~~~~~~~~~~~~
Starting in ext4, there is a new feature called flexible block groups
(flex_bg). In a flex_bg, several block groups are tied together as one
@@ -78,7 +81,7 @@ if flex_bg is enabled. The number of block groups that make up a
flex_bg is given by 2 ^ ``sb.s_log_groups_per_flex``.
Meta Block Groups
------------------
+~~~~~~~~~~~~~~~~~
Without the option META_BG, for safety concerns, all block group
descriptors copies are kept in the first block group. Given the default
@@ -117,7 +120,7 @@ Please see an important note about ``BLOCK_UNINIT`` in the section about
block and inode bitmaps.
Lazy Block Group Initialization
--------------------------------
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A new feature for ext4 are three block group descriptor flags that
enable mkfs to skip initializing other parts of the block group
diff --git a/Documentation/filesystems/ext4/directory.rst b/Documentation/filesystems/ext4/directory.rst
index 6eece8e31df8..9b003a4d453f 100644
--- a/Documentation/filesystems/ext4/directory.rst
+++ b/Documentation/filesystems/ext4/directory.rst
@@ -183,10 +183,10 @@ in the place where the name normally goes. The structure is
- det_checksum
- Directory leaf block checksum.
-The leaf directory block checksum is calculated against the FS UUID, the
-directory's inode number, the directory's inode generation number, and
-the entire directory entry block up to (but not including) the fake
-directory entry.
+The leaf directory block checksum is calculated against the FS UUID (or
+the checksum seed, if that feature is enabled for the fs), the directory's
+inode number, the directory's inode generation number, and the entire
+directory entry block up to (but not including) the fake directory entry.
Hash Tree Directories
~~~~~~~~~~~~~~~~~~~~~
@@ -196,12 +196,12 @@ new feature was added to ext3 to provide a faster (but peculiar)
balanced tree keyed off a hash of the directory entry name. If the
EXT4_INDEX_FL (0x1000) flag is set in the inode, this directory uses a
hashed btree (htree) to organize and find directory entries. For
-backwards read-only compatibility with ext2, this tree is actually
-hidden inside the directory file, masquerading as “empty” directory data
-blocks! It was stated previously that the end of the linear directory
-entry table was signified with an entry pointing to inode 0; this is
-(ab)used to fool the old linear-scan algorithm into thinking that the
-rest of the directory block is empty so that it moves on.
+backwards read-only compatibility with ext2, interior tree nodes are actually
+hidden inside the directory file, masquerading as “empty” directory entries
+spanning the whole block. It was stated previously that directory entries
+with the inode set to 0 are treated as unused entries; this is (ab)used to
+fool the old linear-scan algorithm into skipping over those blocks containing
+the interior tree node data.
The root of the tree always lives in the first data block of the
directory. By ext2 custom, the '.' and '..' entries must appear at the
@@ -209,24 +209,24 @@ beginning of this first block, so they are put here as two
``struct ext4_dir_entry_2`` s and not stored in the tree. The rest of
the root node contains metadata about the tree and finally a hash->block
map to find nodes that are lower in the htree. If
-``dx_root.info.indirect_levels`` is non-zero then the htree has two
-levels; the data block pointed to by the root node's map is an interior
-node, which is indexed by a minor hash. Interior nodes in this tree
-contains a zeroed out ``struct ext4_dir_entry_2`` followed by a
-minor_hash->block map to find leafe nodes. Leaf nodes contain a linear
-array of all ``struct ext4_dir_entry_2``; all of these entries
-(presumably) hash to the same value. If there is an overflow, the
-entries simply overflow into the next leaf node, and the
-least-significant bit of the hash (in the interior node map) that gets
-us to this next leaf node is set.
-
-To traverse the directory as a htree, the code calculates the hash of
-the desired file name and uses it to find the corresponding block
-number. If the tree is flat, the block is a linear array of directory
-entries that can be searched; otherwise, the minor hash of the file name
-is computed and used against this second block to find the corresponding
-third block number. That third block number will be a linear array of
-directory entries.
+``dx_root.info.indirect_levels`` is non-zero then the htree has that many
+levels and the blocks pointed to by the root node's map are interior nodes.
+These interior nodes have a zeroed out ``struct ext4_dir_entry_2`` followed by
+a hash->block map to find nodes of the next level. Leaf nodes look like
+classic linear directory blocks, but all of its entries have a hash value
+equal or greater than the indicated hash of the parent node.
+
+The actual hash value for an entry name is only 31 bits, the least-significant
+bit is set to 0. However, if there is a hash collision between directory
+entries, the least-significant bit may get set to 1 on interior nodes in the
+case where these two (or more) hash-colliding entries do not fit into one leaf
+node and must be split across multiple nodes.
+
+To look up a name in such a htree, the code calculates the hash of the desired
+file name and uses it to find the leaf node with the range of hash values the
+calculated hash falls into (in other words, a lookup works basically the same
+as it would in a B-Tree keyed by the hash value), and possibly also scanning
+the leaf nodes that follow (in tree order) in case of hash collisions.
To traverse the directory as a linear array (such as the old code does),
the code simply reads every data block in the directory. The blocks used
@@ -319,7 +319,8 @@ of a data block:
* - 0x24
- __le32
- block
- - The block number (within the directory file) that goes with hash=0.
+ - The block number (within the directory file) that lead to the left-most
+ leaf node, i.e. the leaf containing entries with the lowest hash values.
* - 0x28
- struct dx_entry
- entries[0]
@@ -442,7 +443,7 @@ The dx_tail structure is 8 bytes long and looks like this:
* - 0x0
- u32
- dt_reserved
- - Zero.
+ - Unused (but still part of the checksum curiously).
* - 0x4
- __le32
- dt_checksum
@@ -450,4 +451,4 @@ The dx_tail structure is 8 bytes long and looks like this:
The checksum is calculated against the FS UUID, the htree index header
(dx_root or dx_node), all of the htree indices (dx_entry) that are in
-use, and the tail block (dx_tail).
+use, and the tail block (dx_tail) with the dt_checksum initially set to 0.
diff --git a/Documentation/filesystems/ext4/dynamic.rst b/Documentation/filesystems/ext4/dynamic.rst
index bb0c84333341..bbad439aada2 100644
--- a/Documentation/filesystems/ext4/dynamic.rst
+++ b/Documentation/filesystems/ext4/dynamic.rst
@@ -6,7 +6,9 @@ Dynamic Structures
Dynamic metadata are created on the fly when files and blocks are
allocated to files.
-.. include:: inodes.rst
-.. include:: ifork.rst
-.. include:: directory.rst
-.. include:: attributes.rst
+.. toctree::
+
+ inodes
+ ifork
+ directory
+ attributes
diff --git a/Documentation/filesystems/ext4/globals.rst b/Documentation/filesystems/ext4/globals.rst
index b17418974fd3..c6a6abce818a 100644
--- a/Documentation/filesystems/ext4/globals.rst
+++ b/Documentation/filesystems/ext4/globals.rst
@@ -6,9 +6,12 @@ Global Structures
The filesystem is sharded into a number of block groups, each of which
have static metadata at fixed locations.
-.. include:: super.rst
-.. include:: group_descr.rst
-.. include:: bitmaps.rst
-.. include:: mmp.rst
-.. include:: journal.rst
-.. include:: orphan.rst
+.. toctree::
+
+ super
+ group_descr
+ bitmaps
+ inode_table
+ mmp
+ journal
+ orphan
diff --git a/Documentation/filesystems/ext4/index.rst b/Documentation/filesystems/ext4/index.rst
index 705d813d558f..1ff8150c50e9 100644
--- a/Documentation/filesystems/ext4/index.rst
+++ b/Documentation/filesystems/ext4/index.rst
@@ -5,7 +5,7 @@ ext4 Data Structures and Algorithms
===================================
.. toctree::
- :maxdepth: 6
+ :maxdepth: 2
:numbered:
about
diff --git a/Documentation/filesystems/ext4/inode_table.rst b/Documentation/filesystems/ext4/inode_table.rst
new file mode 100644
index 000000000000..f7900a52c0d5
--- /dev/null
+++ b/Documentation/filesystems/ext4/inode_table.rst
@@ -0,0 +1,9 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Inode Table
+-----------
+
+Inode tables are statically allocated at mkfs time. Each block group
+descriptor points to the start of the table, and the superblock records
+the number of inodes per group. See :doc:`inode documentation <inodes>`
+for more information on inode table layout.
diff --git a/Documentation/filesystems/ext4/inodes.rst b/Documentation/filesystems/ext4/inodes.rst
index cfc6c1659931..55cd5c380e92 100644
--- a/Documentation/filesystems/ext4/inodes.rst
+++ b/Documentation/filesystems/ext4/inodes.rst
@@ -297,6 +297,8 @@ The ``i_flags`` field is a combination of these values:
- Inode has inline data (EXT4_INLINE_DATA_FL).
* - 0x20000000
- Create children with the same project ID (EXT4_PROJINHERIT_FL).
+ * - 0x40000000
+ - Use case-insensitive lookups for directory contents (EXT4_CASEFOLD_FL).
* - 0x80000000
- Reserved for ext4 library (EXT4_RESERVED_FL).
* -
diff --git a/Documentation/filesystems/ext4/overview.rst b/Documentation/filesystems/ext4/overview.rst
index 0fad6eda6e15..171c3963d7f6 100644
--- a/Documentation/filesystems/ext4/overview.rst
+++ b/Documentation/filesystems/ext4/overview.rst
@@ -16,12 +16,15 @@ All fields in ext4 are written to disk in little-endian order. HOWEVER,
all fields in jbd2 (the journal) are written to disk in big-endian
order.
-.. include:: blocks.rst
-.. include:: blockgroup.rst
-.. include:: special_inodes.rst
-.. include:: allocators.rst
-.. include:: checksums.rst
-.. include:: bigalloc.rst
-.. include:: inlinedata.rst
-.. include:: eainode.rst
-.. include:: verity.rst
+.. toctree::
+
+ blocks
+ blockgroup
+ special_inodes
+ allocators
+ checksums
+ bigalloc
+ inlinedata
+ eainode
+ verity
+ atomic_writes
diff --git a/Documentation/filesystems/ext4/super.rst b/Documentation/filesystems/ext4/super.rst
index 0152888cac29..9a59cded9bd7 100644
--- a/Documentation/filesystems/ext4/super.rst
+++ b/Documentation/filesystems/ext4/super.rst
@@ -328,9 +328,13 @@ The ext4 superblock is laid out as follows in
- s_checksum_type
- Metadata checksum algorithm type. The only valid value is 1 (crc32c).
* - 0x176
- - __le16
- - s_reserved_pad
- -
+ - \_\_u8
+ - s\_encryption\_level
+ - Versioning level for encryption.
+ * - 0x177
+ - \_\_u8
+ - s\_reserved\_pad
+ - Padding to next 32bits.
* - 0x178
- __le64
- s_kbytes_written
@@ -466,9 +470,13 @@ The ext4 superblock is laid out as follows in
- s_last_error_time_hi
- Upper 8 bits of the s_last_error_time field.
* - 0x27A
- - __u8
- - s_pad[2]
- - Zero padding.
+ - \_\_u8
+ - s\_first\_error\_errcode
+ -
+ * - 0x27B
+ - \_\_u8
+ - s\_last\_error\_errcode
+ -
* - 0x27C
- __le16
- s_encoding
@@ -663,7 +671,9 @@ following:
* - 0x8000
- Data in inode (INCOMPAT_INLINE_DATA).
* - 0x10000
- - Encrypted inodes are present on the filesystem. (INCOMPAT_ENCRYPT).
+ - Encrypted inodes can be present. (INCOMPAT_ENCRYPT).
+ * - 0x20000
+ - Directories can be marked case-insensitive. (INCOMPAT_CASEFOLD).
.. _super_rocompat:
@@ -772,7 +782,7 @@ The ``s_default_mount_opts`` field is any combination of the following:
* - 0x0010
- Do not support 32-bit UIDs. (EXT4_DEFM_UID16)
* - 0x0020
- - All data and metadata are commited to the journal.
+ - All data and metadata are committed to the journal.
(EXT4_DEFM_JMODE_DATA)
* - 0x0040
- All data are flushed to the disk before metadata are committed to the
diff --git a/Documentation/filesystems/f2fs.rst b/Documentation/filesystems/f2fs.rst
index 9359978a5af2..cb90d1ae82d0 100644
--- a/Documentation/filesystems/f2fs.rst
+++ b/Documentation/filesystems/f2fs.rst
@@ -1,8 +1,11 @@
.. SPDX-License-Identifier: GPL-2.0
-==========================================
-WHAT IS Flash-Friendly File System (F2FS)?
-==========================================
+=================================
+Flash-Friendly File System (F2FS)
+=================================
+
+Overview
+========
NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
been equipped on a variety systems ranging from mobile to server systems. Since
@@ -126,9 +129,7 @@ norecovery Disable the roll-forward recovery routine, mounted read-
discard/nodiscard Enable/disable real-time discard in f2fs, if discard is
enabled, f2fs will issue discard/TRIM commands when a
segment is cleaned.
-no_heap Disable heap-style segment allocation which finds free
- segments for data from the beginning of main area, while
- for node from the end of main area.
+heap/no_heap Deprecated.
nouser_xattr Disable Extended User Attributes. Note: xattr is enabled
by default if CONFIG_F2FS_FS_XATTR is selected.
noacl Disable POSIX Access Control List. Note: acl is enabled
@@ -175,38 +176,48 @@ data_flush Enable data flushing before checkpoint in order to
persist data of regular and symlink.
reserve_root=%d Support configuring reserved space which is used for
allocation from a privileged user with specified uid or
- gid, unit: 4KB, the default limit is 0.2% of user blocks.
-resuid=%d The user ID which may use the reserved blocks.
-resgid=%d The group ID which may use the reserved blocks.
+ gid, unit: 4KB, the default limit is 12.5% of user blocks.
+reserve_node=%d Support configuring reserved nodes which are used for
+ allocation from a privileged user with specified uid or
+ gid, the default limit is 12.5% of all nodes.
+resuid=%d The user ID which may use the reserved blocks and nodes.
+resgid=%d The group ID which may use the reserved blocks and nodes.
fault_injection=%d Enable fault injection in all supported types with
specified injection rate.
fault_type=%d Support configuring fault injection type, should be
enabled with fault_injection option, fault type value
is shown below, it supports single or combined type.
- =================== ===========
- Type_Name Type_Value
- =================== ===========
- FAULT_KMALLOC 0x000000001
- FAULT_KVMALLOC 0x000000002
- FAULT_PAGE_ALLOC 0x000000004
- FAULT_PAGE_GET 0x000000008
- FAULT_ALLOC_BIO 0x000000010 (obsolete)
- FAULT_ALLOC_NID 0x000000020
- FAULT_ORPHAN 0x000000040
- FAULT_BLOCK 0x000000080
- FAULT_DIR_DEPTH 0x000000100
- FAULT_EVICT_INODE 0x000000200
- FAULT_TRUNCATE 0x000000400
- FAULT_READ_IO 0x000000800
- FAULT_CHECKPOINT 0x000001000
- FAULT_DISCARD 0x000002000
- FAULT_WRITE_IO 0x000004000
- FAULT_SLAB_ALLOC 0x000008000
- FAULT_DQUOT_INIT 0x000010000
- FAULT_LOCK_OP 0x000020000
- FAULT_BLKADDR 0x000040000
- =================== ===========
+ .. code-block:: none
+
+ =========================== ==========
+ Type_Name Type_Value
+ =========================== ==========
+ FAULT_KMALLOC 0x00000001
+ FAULT_KVMALLOC 0x00000002
+ FAULT_PAGE_ALLOC 0x00000004
+ FAULT_PAGE_GET 0x00000008
+ FAULT_ALLOC_BIO 0x00000010 (obsolete)
+ FAULT_ALLOC_NID 0x00000020
+ FAULT_ORPHAN 0x00000040
+ FAULT_BLOCK 0x00000080
+ FAULT_DIR_DEPTH 0x00000100
+ FAULT_EVICT_INODE 0x00000200
+ FAULT_TRUNCATE 0x00000400
+ FAULT_READ_IO 0x00000800
+ FAULT_CHECKPOINT 0x00001000
+ FAULT_DISCARD 0x00002000
+ FAULT_WRITE_IO 0x00004000
+ FAULT_SLAB_ALLOC 0x00008000
+ FAULT_DQUOT_INIT 0x00010000
+ FAULT_LOCK_OP 0x00020000
+ FAULT_BLKADDR_VALIDITY 0x00040000
+ FAULT_BLKADDR_CONSISTENCE 0x00080000
+ FAULT_NO_SEGMENT 0x00100000
+ FAULT_INCONSISTENT_FOOTER 0x00200000
+ FAULT_TIMEOUT 0x00400000 (1000ms)
+ FAULT_VMALLOC 0x00800000
+ =========================== ==========
mode=%s Control block allocation mode which supports "adaptive"
and "lfs". In "lfs" mode, there should be no random
writes towards main area.
@@ -215,7 +226,7 @@ mode=%s Control block allocation mode which supports "adaptive"
fragmentation/after-GC situation itself. The developers use these
modes to understand filesystem fragmentation/after-GC condition well,
and eventually get some insights to handle them better.
- In "fragment:segment", f2fs allocates a new segment in ramdom
+ In "fragment:segment", f2fs allocates a new segment in random
position. With this, we can simulate the after-GC condition.
In "fragment:block", we can scatter block allocation with
"max_fragment_chunk" and "max_fragment_hole" sysfs nodes.
@@ -228,8 +239,6 @@ mode=%s Control block allocation mode which supports "adaptive"
option for more randomness.
Please, use these options for your experiments and we strongly
recommend to re-format the filesystem after using these options.
-io_bits=%u Set the bit size of write IO requests. It should be set
- with "mode=lfs".
usrquota Enable plain user disk quota accounting.
grpquota Enable plain group disk quota accounting.
prjquota Enable plain project quota accounting.
@@ -237,9 +246,9 @@ usrjquota=<file> Appoint specified file and type during mount, so that quota
grpjquota=<file> information can be properly updated during recovery flow,
prjjquota=<file> <quota file>: must be in root directory;
jqfmt=<quota type> <quota type>: [vfsold,vfsv0,vfsv1].
-offusrjquota Turn off user journalled quota.
-offgrpjquota Turn off group journalled quota.
-offprjjquota Turn off project journalled quota.
+usrjquota= Turn off user journalled quota.
+grpjquota= Turn off group journalled quota.
+prjjquota= Turn off project journalled quota.
quota Enable plain user disk quota accounting.
noquota Disable all plain disk quota option.
alloc_mode=%s Adjust block allocation policy, which supports "reuse"
@@ -260,7 +269,7 @@ test_dummy_encryption=%s
The argument may be either "v1" or "v2", in order to
select the corresponding fscrypt policy version.
checkpoint=%s[:%u[%]] Set to "disable" to turn off checkpointing. Set to "enable"
- to reenable checkpointing. Is enabled by default. While
+ to re-enable checkpointing. Is enabled by default. While
disabled, any unmounting or unexpected shutdowns will cause
the filesystem contents to appear as they did when the
filesystem was mounted with that option.
@@ -289,10 +298,15 @@ nocheckpoint_merge Disable checkpoint merge feature.
compress_algorithm=%s Control compress algorithm, currently f2fs supports "lzo",
"lz4", "zstd" and "lzo-rle" algorithm.
compress_algorithm=%s:%d Control compress algorithm and its compress level, now, only
- "lz4" and "zstd" support compress level config.
- algorithm level range
- lz4 3 - 16
- zstd 1 - 22
+ "lz4" and "zstd" support compress level config::
+
+ ========= ===========
+ algorithm level range
+ ========= ===========
+ lz4 3 - 16
+ zstd 1 - 22
+ ========= ===========
+
compress_log_size=%u Support configuring compress cluster size. The size will
be 4KB * (1 << %u). The default and minimum sizes are 16KB.
compress_extension=%s Support adding specified extension, so that f2fs can enable
@@ -356,17 +370,43 @@ errors=%s Specify f2fs behavior on critical errors. This supports modes:
panic immediately, continue without doing anything, and remount
the partition in read-only mode. By default it uses "continue"
mode.
- ====================== =============== =============== ========
- mode continue remount-ro panic
- ====================== =============== =============== ========
- access ops normal noraml N/A
- syscall errors -EIO -EROFS N/A
- mount option rw ro N/A
- pending dir write keep keep N/A
- pending non-dir write drop keep N/A
- pending node write drop keep N/A
- pending meta write keep keep N/A
- ====================== =============== =============== ========
+
+ .. code-block:: none
+
+ ====================== =============== =============== ========
+ mode continue remount-ro panic
+ ====================== =============== =============== ========
+ access ops normal normal N/A
+ syscall errors -EIO -EROFS N/A
+ mount option rw ro N/A
+ pending dir write keep keep N/A
+ pending non-dir write drop keep N/A
+ pending node write drop keep N/A
+ pending meta write keep keep N/A
+ ====================== =============== =============== ========
+nat_bits Enable nat_bits feature to enhance full/empty nat blocks access,
+ by default it's disabled.
+lookup_mode=%s Control the directory lookup behavior for casefolded
+ directories. This option has no effect on directories
+ that do not have the casefold feature enabled.
+
+ .. code-block:: none
+
+ ================== ========================================
+ Value Description
+ ================== ========================================
+ perf (Default) Enforces a hash-only lookup.
+ The linear search fallback is always
+ disabled, ignoring the on-disk flag.
+ compat Enables the linear search fallback for
+ compatibility with directory entries
+ created by older kernel that used a
+ different case-folding algorithm.
+ This mode ignores the on-disk flag.
+ auto F2FS determines the mode based on the
+ on-disk `SB_ENC_NO_COMPAT_FALLBACK_FL`
+ flag.
+ ================== ========================================
======================== ============================================================
Debugfs Entries
@@ -480,7 +520,7 @@ Note: please refer to the manpage of dump.f2fs(8) to get full option list.
sload.f2fs
----------
-The sload.f2fs gives a way to insert files and directories in the exisiting disk
+The sload.f2fs gives a way to insert files and directories in the existing disk
image. This tool is useful when building f2fs images given compiled files.
Note: please refer to the manpage of sload.f2fs(8) to get full option list.
@@ -776,6 +816,37 @@ In order to identify whether the data in the victim segment are valid or not,
F2FS manages a bitmap. Each bit represents the validity of a block, and the
bitmap is composed of a bit stream covering whole blocks in main area.
+Write-hint Policy
+-----------------
+
+F2FS sets the whint all the time with the below policy.
+
+===================== ======================== ===================
+User F2FS Block
+===================== ======================== ===================
+N/A META WRITE_LIFE_NONE|REQ_META
+N/A HOT_NODE WRITE_LIFE_NONE
+N/A WARM_NODE WRITE_LIFE_MEDIUM
+N/A COLD_NODE WRITE_LIFE_LONG
+ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME
+extension list " "
+
+-- buffered io
+------------------------------------------------------------------
+N/A COLD_DATA WRITE_LIFE_EXTREME
+N/A HOT_DATA WRITE_LIFE_SHORT
+N/A WARM_DATA WRITE_LIFE_NOT_SET
+
+-- direct io
+------------------------------------------------------------------
+WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME
+WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT
+WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET
+WRITE_LIFE_NONE " WRITE_LIFE_NONE
+WRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM
+WRITE_LIFE_LONG " WRITE_LIFE_LONG
+===================== ======================== ===================
+
Fallocate(2) Policy
-------------------
@@ -792,7 +863,7 @@ Allocating disk space
as a method of optimally implementing that function.
However, once F2FS receives ioctl(fd, F2FS_IOC_SET_PIN_FILE) in prior to
-fallocate(fd, DEFAULT_MODE), it allocates on-disk block addressess having
+fallocate(fd, DEFAULT_MODE), it allocates on-disk block addresses having
zero or random data, which is useful to the below scenario where:
1. create(fd)
@@ -883,24 +954,26 @@ compression enabled files (refer to "Compression implementation" section for how
enable compression on a regular inode).
1) compress_mode=fs
-This is the default option. f2fs does automatic compression in the writeback of the
-compression enabled files.
+
+ This is the default option. f2fs does automatic compression in the writeback of the
+ compression enabled files.
2) compress_mode=user
-This disables the automatic compression and gives the user discretion of choosing the
-target file and the timing. The user can do manual compression/decompression on the
-compression enabled files using F2FS_IOC_DECOMPRESS_FILE and F2FS_IOC_COMPRESS_FILE
-ioctls like the below.
-To decompress a file,
+ This disables the automatic compression and gives the user discretion of choosing the
+ target file and the timing. The user can do manual compression/decompression on the
+ compression enabled files using F2FS_IOC_DECOMPRESS_FILE and F2FS_IOC_COMPRESS_FILE
+ ioctls like the below.
-fd = open(filename, O_WRONLY, 0);
-ret = ioctl(fd, F2FS_IOC_DECOMPRESS_FILE);
+To decompress a file::
-To compress a file,
+ fd = open(filename, O_WRONLY, 0);
+ ret = ioctl(fd, F2FS_IOC_DECOMPRESS_FILE);
-fd = open(filename, O_WRONLY, 0);
-ret = ioctl(fd, F2FS_IOC_COMPRESS_FILE);
+To compress a file::
+
+ fd = open(filename, O_WRONLY, 0);
+ ret = ioctl(fd, F2FS_IOC_COMPRESS_FILE);
NVMe Zoned Namespace devices
----------------------------
@@ -916,3 +989,47 @@ NVMe Zoned Namespace devices
can start before the zone-capacity and span across zone-capacity boundary.
Such spanning segments are also considered as usable segments. All blocks
past the zone-capacity are considered unusable in these segments.
+
+Device aliasing feature
+-----------------------
+
+f2fs can utilize a special file called a "device aliasing file." This file allows
+the entire storage device to be mapped with a single, large extent, not using
+the usual f2fs node structures. This mapped area is pinned and primarily intended
+for holding the space.
+
+Essentially, this mechanism allows a portion of the f2fs area to be temporarily
+reserved and used by another filesystem or for different purposes. Once that
+external usage is complete, the device aliasing file can be deleted, releasing
+the reserved space back to F2FS for its own use.
+
+.. code-block::
+
+ # ls /dev/vd*
+ /dev/vdb (32GB) /dev/vdc (32GB)
+ # mkfs.ext4 /dev/vdc
+ # mkfs.f2fs -c /dev/vdc@vdc.file /dev/vdb
+ # mount /dev/vdb /mnt/f2fs
+ # ls -l /mnt/f2fs
+ vdc.file
+ # df -h
+ /dev/vdb 64G 33G 32G 52% /mnt/f2fs
+
+ # mount -o loop /dev/vdc /mnt/ext4
+ # df -h
+ /dev/vdb 64G 33G 32G 52% /mnt/f2fs
+ /dev/loop7 32G 24K 30G 1% /mnt/ext4
+ # umount /mnt/ext4
+
+ # f2fs_io getflags /mnt/f2fs/vdc.file
+ get a flag on /mnt/f2fs/vdc.file ret=0, flags=nocow(pinned),immutable
+ # f2fs_io setflags noimmutable /mnt/f2fs/vdc.file
+ get a flag on noimmutable ret=0, flags=800010
+ set a flag on /mnt/f2fs/vdc.file ret=0, flags=noimmutable
+ # rm /mnt/f2fs/vdc.file
+ # df -h
+ /dev/vdb 64G 753M 64G 2% /mnt/f2fs
+
+So, the key idea is, user can do any file operations on /dev/vdc, and
+reclaim the space after the use, while the space is counted as /data.
+That doesn't require modifying partition size and filesystem format.
diff --git a/Documentation/filesystems/fiemap.rst b/Documentation/filesystems/fiemap.rst
index 93fc96f760aa..23b3ed229e49 100644
--- a/Documentation/filesystems/fiemap.rst
+++ b/Documentation/filesystems/fiemap.rst
@@ -12,21 +12,10 @@ returns a list of extents.
Request Basics
--------------
-A fiemap request is encoded within struct fiemap::
-
- struct fiemap {
- __u64 fm_start; /* logical offset (inclusive) at
- * which to start mapping (in) */
- __u64 fm_length; /* logical length of mapping which
- * userspace cares about (in) */
- __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */
- __u32 fm_mapped_extents; /* number of extents that were
- * mapped (out) */
- __u32 fm_extent_count; /* size of fm_extents array (in) */
- __u32 fm_reserved;
- struct fiemap_extent fm_extents[0]; /* array of mapped extents (out) */
- };
+A fiemap request is encoded within struct fiemap:
+.. kernel-doc:: include/uapi/linux/fiemap.h
+ :identifiers: fiemap
fm_start, and fm_length specify the logical range within the file
which the process would like mappings for. Extents returned mirror
@@ -60,6 +49,8 @@ FIEMAP_FLAG_XATTR
If this flag is set, the extents returned will describe the inodes
extended attribute lookup tree, instead of its data tree.
+FIEMAP_FLAG_CACHE
+ This flag requests caching of the extents.
Extent Mapping
--------------
@@ -77,18 +68,10 @@ complete the requested range and will not have the FIEMAP_EXTENT_LAST
flag set (see the next section on extent flags).
Each extent is described by a single fiemap_extent structure as
-returned in fm_extents::
-
- struct fiemap_extent {
- __u64 fe_logical; /* logical offset in bytes for the start of
- * the extent */
- __u64 fe_physical; /* physical offset in bytes for the start
- * of the extent */
- __u64 fe_length; /* length in bytes for the extent */
- __u64 fe_reserved64[2];
- __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
- __u32 fe_reserved[3];
- };
+returned in fm_extents:
+
+.. kernel-doc:: include/uapi/linux/fiemap.h
+ :identifiers: fiemap_extent
All offsets and lengths are in bytes and mirror those on disk. It is valid
for an extents logical offset to start before the request or its logical
@@ -175,6 +158,8 @@ FIEMAP_EXTENT_MERGED
userspace would be highly inefficient, the kernel will try to merge most
adjacent blocks into 'extents'.
+FIEMAP_EXTENT_SHARED
+ This flag is set to request that space be shared with other files.
VFS -> File System Implementation
---------------------------------
@@ -191,14 +176,10 @@ each discovered extent::
u64 len);
->fiemap is passed struct fiemap_extent_info which describes the
-fiemap request::
-
- struct fiemap_extent_info {
- unsigned int fi_flags; /* Flags as passed from user */
- unsigned int fi_extents_mapped; /* Number of mapped extents */
- unsigned int fi_extents_max; /* Size of fiemap_extent array */
- struct fiemap_extent *fi_extents_start; /* Start of fiemap_extent array */
- };
+fiemap request:
+
+.. kernel-doc:: include/linux/fiemap.h
+ :identifiers: fiemap_extent_info
It is intended that the file system should not need to access any of this
structure directly. Filesystem handlers should be tolerant to signals and return
diff --git a/Documentation/filesystems/files.rst b/Documentation/filesystems/files.rst
index bcf84459917f..eb770f891b27 100644
--- a/Documentation/filesystems/files.rst
+++ b/Documentation/filesystems/files.rst
@@ -62,7 +62,7 @@ the fdtable structure -
be held.
4. To look up the file structure given an fd, a reader
- must use either lookup_fd_rcu() or files_lookup_fd_rcu() APIs. These
+ must use either lookup_fdget_rcu() or files_lookup_fdget_rcu() APIs. These
take care of barrier requirements due to lock-free lookup.
An example::
@@ -70,43 +70,22 @@ the fdtable structure -
struct file *file;
rcu_read_lock();
- file = lookup_fd_rcu(fd);
- if (file) {
- ...
- }
- ....
+ file = lookup_fdget_rcu(fd);
rcu_read_unlock();
-
-5. Handling of the file structures is special. Since the look-up
- of the fd (fget()/fget_light()) are lock-free, it is possible
- that look-up may race with the last put() operation on the
- file structure. This is avoided using atomic_long_inc_not_zero()
- on ->f_count::
-
- rcu_read_lock();
- file = files_lookup_fd_rcu(files, fd);
if (file) {
- if (atomic_long_inc_not_zero(&file->f_count))
- *fput_needed = 1;
- else
- /* Didn't get the reference, someone's freed */
- file = NULL;
+ ...
+ fput(file);
}
- rcu_read_unlock();
....
- return file;
-
- atomic_long_inc_not_zero() detects if refcounts is already zero or
- goes to zero during increment. If it does, we fail
- fget()/fget_light().
-6. Since both fdtable and file structures can be looked up
+5. Since both fdtable and file structures can be looked up
lock-free, they must be installed using rcu_assign_pointer()
API. If they are looked up lock-free, rcu_dereference()
must be used. However it is advisable to use files_fdtable()
- and lookup_fd_rcu()/files_lookup_fd_rcu() which take care of these issues.
+ and lookup_fdget_rcu()/files_lookup_fdget_rcu() which take care of these
+ issues.
-7. While updating, the fdtable pointer must be looked up while
+6. While updating, the fdtable pointer must be looked up while
holding files->file_lock. If ->file_lock is dropped, then
another thread expand the files thereby creating a new
fdtable and making the earlier fdtable pointer stale.
@@ -126,3 +105,19 @@ the fdtable structure -
Since locate_fd() can drop ->file_lock (and reacquire ->file_lock),
the fdtable pointer (fdt) must be loaded after locate_fd().
+On newer kernels rcu based file lookup has been switched to rely on
+SLAB_TYPESAFE_BY_RCU instead of call_rcu(). It isn't sufficient anymore
+to just acquire a reference to the file in question under rcu using
+atomic_long_inc_not_zero() since the file might have already been
+recycled and someone else might have bumped the reference. In other
+words, callers might see reference count bumps from newer users. For
+this is reason it is necessary to verify that the pointer is the same
+before and after the reference count increment. This pattern can be seen
+in get_file_rcu() and __files_get_rcu().
+
+In addition, it isn't possible to access or check fields in struct file
+without first acquiring a reference on it under rcu lookup. Not doing
+that was always very dodgy and it was only usable for non-pointer data
+in struct file. With SLAB_TYPESAFE_BY_RCU it is necessary that callers
+either first acquire a reference or they must hold the files_lock of the
+fdtable.
diff --git a/Documentation/filesystems/fscrypt.rst b/Documentation/filesystems/fscrypt.rst
index eccd327e6df5..70af896822e1 100644
--- a/Documentation/filesystems/fscrypt.rst
+++ b/Documentation/filesystems/fscrypt.rst
@@ -31,15 +31,15 @@ However, except for filenames, fscrypt does not encrypt filesystem
metadata.
Unlike eCryptfs, which is a stacked filesystem, fscrypt is integrated
-directly into supported filesystems --- currently ext4, F2FS, and
-UBIFS. This allows encrypted files to be read and written without
-caching both the decrypted and encrypted pages in the pagecache,
-thereby nearly halving the memory used and bringing it in line with
-unencrypted files. Similarly, half as many dentries and inodes are
-needed. eCryptfs also limits encrypted filenames to 143 bytes,
-causing application compatibility issues; fscrypt allows the full 255
-bytes (NAME_MAX). Finally, unlike eCryptfs, the fscrypt API can be
-used by unprivileged users, with no need to mount anything.
+directly into supported filesystems --- currently ext4, F2FS, UBIFS,
+and CephFS. This allows encrypted files to be read and written
+without caching both the decrypted and encrypted pages in the
+pagecache, thereby nearly halving the memory used and bringing it in
+line with unencrypted files. Similarly, half as many dentries and
+inodes are needed. eCryptfs also limits encrypted filenames to 143
+bytes, causing application compatibility issues; fscrypt allows the
+full 255 bytes (NAME_MAX). Finally, unlike eCryptfs, the fscrypt API
+can be used by unprivileged users, with no need to mount anything.
fscrypt does not support encrypting files in-place. Instead, it
supports marking an empty directory as encrypted. Then, after
@@ -70,7 +70,7 @@ Online attacks
--------------
fscrypt (and storage encryption in general) can only provide limited
-protection, if any at all, against online attacks. In detail:
+protection against online attacks. In detail:
Side-channel attacks
~~~~~~~~~~~~~~~~~~~~
@@ -99,16 +99,23 @@ Therefore, any encryption-specific access control checks would merely
be enforced by kernel *code* and therefore would be largely redundant
with the wide variety of access control mechanisms already available.)
-Kernel memory compromise
-~~~~~~~~~~~~~~~~~~~~~~~~
+Read-only kernel memory compromise
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Unless `hardware-wrapped keys`_ are used, an attacker who gains the
+ability to read from arbitrary kernel memory, e.g. by mounting a
+physical attack or by exploiting a kernel security vulnerability, can
+compromise all fscrypt keys that are currently in-use. This also
+extends to cold boot attacks; if the system is suddenly powered off,
+keys the system was using may remain in memory for a short time.
-An attacker who compromises the system enough to read from arbitrary
-memory, e.g. by mounting a physical attack or by exploiting a kernel
-security vulnerability, can compromise all encryption keys that are
-currently in use.
+However, if hardware-wrapped keys are used, then the fscrypt master
+keys and file contents encryption keys (but not other types of fscrypt
+subkeys such as filenames encryption keys) are protected from
+compromises of arbitrary kernel memory.
-However, fscrypt allows encryption keys to be removed from the kernel,
-which may protect them from later compromise.
+In addition, fscrypt allows encryption keys to be removed from the
+kernel, which may protect them from later compromise.
In more detail, the FS_IOC_REMOVE_ENCRYPTION_KEY ioctl (or the
FS_IOC_REMOVE_ENCRYPTION_KEY_ALL_USERS ioctl) can wipe a master
@@ -137,13 +144,29 @@ However, these ioctls have some limitations:
- In general, decrypted contents and filenames in the kernel VFS
caches are freed but not wiped. Therefore, portions thereof may be
recoverable from freed memory, even after the corresponding key(s)
- were wiped. To partially solve this, you can set
- CONFIG_PAGE_POISONING=y in your kernel config and add page_poison=1
- to your kernel command line. However, this has a performance cost.
-
-- Secret keys might still exist in CPU registers, in crypto
- accelerator hardware (if used by the crypto API to implement any of
- the algorithms), or in other places not explicitly considered here.
+ were wiped. To partially solve this, you can add init_on_free=1 to
+ your kernel command line. However, this has a performance cost.
+
+- Secret keys might still exist in CPU registers or in other places
+ not explicitly considered here.
+
+Full system compromise
+~~~~~~~~~~~~~~~~~~~~~~
+
+An attacker who gains "root" access and/or the ability to execute
+arbitrary kernel code can freely exfiltrate data that is protected by
+any in-use fscrypt keys. Thus, usually fscrypt provides no meaningful
+protection in this scenario. (Data that is protected by a key that is
+absent throughout the entire attack remains protected, modulo the
+limitations of key removal mentioned above in the case where the key
+was removed prior to the attack.)
+
+However, if `hardware-wrapped keys`_ are used, such attackers will be
+unable to exfiltrate the master keys or file contents keys in a form
+that will be usable after the system is powered off. This may be
+useful if the attacker is significantly time-limited and/or
+bandwidth-limited, so they can only exfiltrate some data and need to
+rely on a later offline attack to exfiltrate the rest of it.
Limitations of v1 policies
~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -171,6 +194,10 @@ policies on all new encrypted directories.
Key hierarchy
=============
+Note: this section assumes the use of raw keys rather than
+hardware-wrapped keys. The use of hardware-wrapped keys modifies the
+key hierarchy slightly. For details, see `Hardware-wrapped keys`_.
+
Master Keys
-----------
@@ -261,9 +288,9 @@ DIRECT_KEY policies
The Adiantum encryption mode (see `Encryption modes and usage`_) is
suitable for both contents and filenames encryption, and it accepts
-long IVs --- long enough to hold both an 8-byte logical block number
-and a 16-byte per-file nonce. Also, the overhead of each Adiantum key
-is greater than that of an AES-256-XTS key.
+long IVs --- long enough to hold both an 8-byte data unit index and a
+16-byte per-file nonce. Also, the overhead of each Adiantum key is
+greater than that of an AES-256-XTS key.
Therefore, to improve performance and save memory, for Adiantum a
"direct key" configuration is supported. When the user has enabled
@@ -300,8 +327,8 @@ IV_INO_LBLK_32 policies
IV_INO_LBLK_32 policies work like IV_INO_LBLK_64, except that for
IV_INO_LBLK_32, the inode number is hashed with SipHash-2-4 (where the
-SipHash key is derived from the master key) and added to the file
-logical block number mod 2^32 to produce a 32-bit IV.
+SipHash key is derived from the master key) and added to the file data
+unit index mod 2^32 to produce a 32-bit IV.
This format is optimized for use with inline encryption hardware
compliant with the eMMC v5.2 standard, which supports only 32 IV bits
@@ -332,83 +359,174 @@ Encryption modes and usage
fscrypt allows one encryption mode to be specified for file contents
and one encryption mode to be specified for filenames. Different
directory trees are permitted to use different encryption modes.
+
+Supported modes
+---------------
+
Currently, the following pairs of encryption modes are supported:
-- AES-256-XTS for contents and AES-256-CTS-CBC for filenames
-- AES-128-CBC for contents and AES-128-CTS-CBC for filenames
+- AES-256-XTS for contents and AES-256-CBC-CTS for filenames
+- AES-256-XTS for contents and AES-256-HCTR2 for filenames
- Adiantum for both contents and filenames
-- AES-256-XTS for contents and AES-256-HCTR2 for filenames (v2 policies only)
-- SM4-XTS for contents and SM4-CTS-CBC for filenames (v2 policies only)
-
-If unsure, you should use the (AES-256-XTS, AES-256-CTS-CBC) pair.
-
-AES-128-CBC was added only for low-powered embedded devices with
-crypto accelerators such as CAAM or CESA that do not support XTS. To
-use AES-128-CBC, CONFIG_CRYPTO_ESSIV and CONFIG_CRYPTO_SHA256 (or
-another SHA-256 implementation) must be enabled so that ESSIV can be
-used.
-
-Adiantum is a (primarily) stream cipher-based mode that is fast even
-on CPUs without dedicated crypto instructions. It's also a true
-wide-block mode, unlike XTS. It can also eliminate the need to derive
-per-file encryption keys. However, it depends on the security of two
-primitives, XChaCha12 and AES-256, rather than just one. See the
-paper "Adiantum: length-preserving encryption for entry-level
-processors" (https://eprint.iacr.org/2018/720.pdf) for more details.
-To use Adiantum, CONFIG_CRYPTO_ADIANTUM must be enabled. Also, fast
-implementations of ChaCha and NHPoly1305 should be enabled, e.g.
-CONFIG_CRYPTO_CHACHA20_NEON and CONFIG_CRYPTO_NHPOLY1305_NEON for ARM.
-
-AES-256-HCTR2 is another true wide-block encryption mode that is intended for
-use on CPUs with dedicated crypto instructions. AES-256-HCTR2 has the property
-that a bitflip in the plaintext changes the entire ciphertext. This property
-makes it desirable for filename encryption since initialization vectors are
-reused within a directory. For more details on AES-256-HCTR2, see the paper
-"Length-preserving encryption with HCTR2"
-(https://eprint.iacr.org/2021/1441.pdf). To use AES-256-HCTR2,
-CONFIG_CRYPTO_HCTR2 must be enabled. Also, fast implementations of XCTR and
-POLYVAL should be enabled, e.g. CRYPTO_POLYVAL_ARM64_CE and
-CRYPTO_AES_ARM64_CE_BLK for ARM64.
-
-SM4 is a Chinese block cipher that is an alternative to AES. It has
-not seen as much security review as AES, and it only has a 128-bit key
-size. It may be useful in cases where its use is mandated.
-Otherwise, it should not be used. For SM4 support to be available, it
-also needs to be enabled in the kernel crypto API.
-
-New encryption modes can be added relatively easily, without changes
-to individual filesystems. However, authenticated encryption (AE)
-modes are not currently supported because of the difficulty of dealing
-with ciphertext expansion.
+- AES-128-CBC-ESSIV for contents and AES-128-CBC-CTS for filenames
+- SM4-XTS for contents and SM4-CBC-CTS for filenames
+
+Note: in the API, "CBC" means CBC-ESSIV, and "CTS" means CBC-CTS.
+So, for example, FSCRYPT_MODE_AES_256_CTS means AES-256-CBC-CTS.
+
+Authenticated encryption modes are not currently supported because of
+the difficulty of dealing with ciphertext expansion. Therefore,
+contents encryption uses a block cipher in `XTS mode
+<https://en.wikipedia.org/wiki/Disk_encryption_theory#XTS>`_ or
+`CBC-ESSIV mode
+<https://en.wikipedia.org/wiki/Disk_encryption_theory#Encrypted_salt-sector_initialization_vector_(ESSIV)>`_,
+or a wide-block cipher. Filenames encryption uses a
+block cipher in `CBC-CTS mode
+<https://en.wikipedia.org/wiki/Ciphertext_stealing>`_ or a wide-block
+cipher.
+
+The (AES-256-XTS, AES-256-CBC-CTS) pair is the recommended default.
+It is also the only option that is *guaranteed* to always be supported
+if the kernel supports fscrypt at all; see `Kernel config options`_.
+
+The (AES-256-XTS, AES-256-HCTR2) pair is also a good choice that
+upgrades the filenames encryption to use a wide-block cipher. (A
+*wide-block cipher*, also called a tweakable super-pseudorandom
+permutation, has the property that changing one bit scrambles the
+entire result.) As described in `Filenames encryption`_, a wide-block
+cipher is the ideal mode for the problem domain, though CBC-CTS is the
+"least bad" choice among the alternatives. For more information about
+HCTR2, see `the HCTR2 paper <https://eprint.iacr.org/2021/1441.pdf>`_.
+
+Adiantum is recommended on systems where AES is too slow due to lack
+of hardware acceleration for AES. Adiantum is a wide-block cipher
+that uses XChaCha12 and AES-256 as its underlying components. Most of
+the work is done by XChaCha12, which is much faster than AES when AES
+acceleration is unavailable. For more information about Adiantum, see
+`the Adiantum paper <https://eprint.iacr.org/2018/720.pdf>`_.
+
+The (AES-128-CBC-ESSIV, AES-128-CBC-CTS) pair was added to try to
+provide a more efficient option for systems that lack AES instructions
+in the CPU but do have a non-inline crypto engine such as CAAM or CESA
+that supports AES-CBC (and not AES-XTS). This is deprecated. It has
+been shown that just doing AES on the CPU is actually faster.
+Moreover, Adiantum is faster still and is recommended on such systems.
+
+The remaining mode pairs are the "national pride ciphers":
+
+- (SM4-XTS, SM4-CBC-CTS)
+
+Generally speaking, these ciphers aren't "bad" per se, but they
+receive limited security review compared to the usual choices such as
+AES and ChaCha. They also don't bring much new to the table. It is
+suggested to only use these ciphers where their use is mandated.
+
+Kernel config options
+---------------------
+
+Enabling fscrypt support (CONFIG_FS_ENCRYPTION) automatically pulls in
+only the basic support from the crypto API needed to use AES-256-XTS
+and AES-256-CBC-CTS encryption. For optimal performance, it is
+strongly recommended to also enable any available platform-specific
+kconfig options that provide acceleration for the algorithm(s) you
+wish to use. Support for any "non-default" encryption modes typically
+requires extra kconfig options as well.
+
+Below, some relevant options are listed by encryption mode. Note,
+acceleration options not listed below may be available for your
+platform; refer to the kconfig menus. File contents encryption can
+also be configured to use inline encryption hardware instead of the
+kernel crypto API (see `Inline encryption support`_); in that case,
+the file contents mode doesn't need to supported in the kernel crypto
+API, but the filenames mode still does.
+
+- AES-256-XTS and AES-256-CBC-CTS
+ - Recommended:
+ - arm64: CONFIG_CRYPTO_AES_ARM64_CE_BLK
+ - x86: CONFIG_CRYPTO_AES_NI_INTEL
+
+- AES-256-HCTR2
+ - Mandatory:
+ - CONFIG_CRYPTO_HCTR2
+ - Recommended:
+ - arm64: CONFIG_CRYPTO_AES_ARM64_CE_BLK
+ - x86: CONFIG_CRYPTO_AES_NI_INTEL
+
+- Adiantum
+ - Mandatory:
+ - CONFIG_CRYPTO_ADIANTUM
+ - Recommended:
+ - arm32: CONFIG_CRYPTO_NHPOLY1305_NEON
+ - arm64: CONFIG_CRYPTO_NHPOLY1305_NEON
+ - x86: CONFIG_CRYPTO_NHPOLY1305_SSE2
+ - x86: CONFIG_CRYPTO_NHPOLY1305_AVX2
+
+- AES-128-CBC-ESSIV and AES-128-CBC-CTS:
+ - Mandatory:
+ - CONFIG_CRYPTO_ESSIV
+ - CONFIG_CRYPTO_SHA256 or another SHA-256 implementation
+ - Recommended:
+ - AES-CBC acceleration
Contents encryption
-------------------
-For file contents, each filesystem block is encrypted independently.
-Starting from Linux kernel 5.5, encryption of filesystems with block
-size less than system's page size is supported.
-
-Each block's IV is set to the logical block number within the file as
-a little endian number, except that:
-
-- With CBC mode encryption, ESSIV is also used. Specifically, each IV
- is encrypted with AES-256 where the AES-256 key is the SHA-256 hash
- of the file's data encryption key.
-
-- With `DIRECT_KEY policies`_, the file's nonce is appended to the IV.
- Currently this is only allowed with the Adiantum encryption mode.
-
-- With `IV_INO_LBLK_64 policies`_, the logical block number is limited
- to 32 bits and is placed in bits 0-31 of the IV. The inode number
- (which is also limited to 32 bits) is placed in bits 32-63.
-
-- With `IV_INO_LBLK_32 policies`_, the logical block number is limited
- to 32 bits and is placed in bits 0-31 of the IV. The inode number
- is then hashed and added mod 2^32.
-
-Note that because file logical block numbers are included in the IVs,
-filesystems must enforce that blocks are never shifted around within
-encrypted files, e.g. via "collapse range" or "insert range".
+For contents encryption, each file's contents is divided into "data
+units". Each data unit is encrypted independently. The IV for each
+data unit incorporates the zero-based index of the data unit within
+the file. This ensures that each data unit within a file is encrypted
+differently, which is essential to prevent leaking information.
+
+Note: the encryption depending on the offset into the file means that
+operations like "collapse range" and "insert range" that rearrange the
+extent mapping of files are not supported on encrypted files.
+
+There are two cases for the sizes of the data units:
+
+* Fixed-size data units. This is how all filesystems other than UBIFS
+ work. A file's data units are all the same size; the last data unit
+ is zero-padded if needed. By default, the data unit size is equal
+ to the filesystem block size. On some filesystems, users can select
+ a sub-block data unit size via the ``log2_data_unit_size`` field of
+ the encryption policy; see `FS_IOC_SET_ENCRYPTION_POLICY`_.
+
+* Variable-size data units. This is what UBIFS does. Each "UBIFS
+ data node" is treated as a crypto data unit. Each contains variable
+ length, possibly compressed data, zero-padded to the next 16-byte
+ boundary. Users cannot select a sub-block data unit size on UBIFS.
+
+In the case of compression + encryption, the compressed data is
+encrypted. UBIFS compression works as described above. f2fs
+compression works a bit differently; it compresses a number of
+filesystem blocks into a smaller number of filesystem blocks.
+Therefore a f2fs-compressed file still uses fixed-size data units, and
+it is encrypted in a similar way to a file containing holes.
+
+As mentioned in `Key hierarchy`_, the default encryption setting uses
+per-file keys. In this case, the IV for each data unit is simply the
+index of the data unit in the file. However, users can select an
+encryption setting that does not use per-file keys. For these, some
+kind of file identifier is incorporated into the IVs as follows:
+
+- With `DIRECT_KEY policies`_, the data unit index is placed in bits
+ 0-63 of the IV, and the file's nonce is placed in bits 64-191.
+
+- With `IV_INO_LBLK_64 policies`_, the data unit index is placed in
+ bits 0-31 of the IV, and the file's inode number is placed in bits
+ 32-63. This setting is only allowed when data unit indices and
+ inode numbers fit in 32 bits.
+
+- With `IV_INO_LBLK_32 policies`_, the file's inode number is hashed
+ and added to the data unit index. The resulting value is truncated
+ to 32 bits and placed in bits 0-31 of the IV. This setting is only
+ allowed when data unit indices and inode numbers fit in 32 bits.
+
+The byte order of the IV is always little endian.
+
+If the user selects FSCRYPT_MODE_AES_128_CBC for the contents mode, an
+ESSIV layer is automatically included. In this case, before the IV is
+passed to AES-128-CBC, it is encrypted with AES-256 where the AES-256
+key is the SHA-256 hash of the file's contents encryption key.
Filenames encryption
--------------------
@@ -423,7 +541,7 @@ alternatively has the file's nonce (for `DIRECT_KEY policies`_) or
inode number (for `IV_INO_LBLK_64 policies`_) included in the IVs.
Thus, IV reuse is limited to within a single directory.
-With CTS-CBC, the IV reuse means that when the plaintext filenames share a
+With CBC-CTS, the IV reuse means that when the plaintext filenames share a
common prefix at least as long as the cipher block size (16 bytes for AES), the
corresponding encrypted filenames will also share a common prefix. This is
undesirable. Adiantum and HCTR2 do not have this weakness, as they are
@@ -477,7 +595,8 @@ follows::
__u8 contents_encryption_mode;
__u8 filenames_encryption_mode;
__u8 flags;
- __u8 __reserved[4];
+ __u8 log2_data_unit_size;
+ __u8 __reserved[3];
__u8 master_key_identifier[FSCRYPT_KEY_IDENTIFIER_SIZE];
};
@@ -493,7 +612,14 @@ This structure must be initialized as follows:
be set to constants from ``<linux/fscrypt.h>`` which identify the
encryption modes to use. If unsure, use FSCRYPT_MODE_AES_256_XTS
(1) for ``contents_encryption_mode`` and FSCRYPT_MODE_AES_256_CTS
- (4) for ``filenames_encryption_mode``.
+ (4) for ``filenames_encryption_mode``. For details, see `Encryption
+ modes and usage`_.
+
+ v1 encryption policies only support three combinations of modes:
+ (FSCRYPT_MODE_AES_256_XTS, FSCRYPT_MODE_AES_256_CTS),
+ (FSCRYPT_MODE_AES_128_CBC, FSCRYPT_MODE_AES_128_CTS), and
+ (FSCRYPT_MODE_ADIANTUM, FSCRYPT_MODE_ADIANTUM). v2 policies support
+ all combinations documented in `Supported modes`_.
- ``flags`` contains optional flags from ``<linux/fscrypt.h>``:
@@ -512,6 +638,29 @@ This structure must be initialized as follows:
The DIRECT_KEY, IV_INO_LBLK_64, and IV_INO_LBLK_32 flags are
mutually exclusive.
+- ``log2_data_unit_size`` is the log2 of the data unit size in bytes,
+ or 0 to select the default data unit size. The data unit size is
+ the granularity of file contents encryption. For example, setting
+ ``log2_data_unit_size`` to 12 causes file contents be passed to the
+ underlying encryption algorithm (such as AES-256-XTS) in 4096-byte
+ data units, each with its own IV.
+
+ Not all filesystems support setting ``log2_data_unit_size``. ext4
+ and f2fs support it since Linux v6.7. On filesystems that support
+ it, the supported nonzero values are 9 through the log2 of the
+ filesystem block size, inclusively. The default value of 0 selects
+ the filesystem block size.
+
+ The main use case for ``log2_data_unit_size`` is for selecting a
+ data unit size smaller than the filesystem block size for
+ compatibility with inline encryption hardware that only supports
+ smaller data unit sizes. ``/sys/block/$disk/queue/crypto/`` may be
+ useful for checking which data unit sizes are supported by a
+ particular system's inline encryption hardware.
+
+ Leave this field zeroed unless you are certain you need it. Using
+ an unnecessarily small data unit size reduces performance.
+
- For v2 encryption policies, ``__reserved`` must be zeroed.
- For v1 encryption policies, ``master_key_descriptor`` specifies how
@@ -704,7 +853,9 @@ a pointer to struct fscrypt_add_key_arg, defined as follows::
struct fscrypt_key_specifier key_spec;
__u32 raw_size;
__u32 key_id;
- __u32 __reserved[8];
+ #define FSCRYPT_ADD_KEY_FLAG_HW_WRAPPED 0x00000001
+ __u32 flags;
+ __u32 __reserved[7];
__u8 raw[];
};
@@ -723,7 +874,7 @@ a pointer to struct fscrypt_add_key_arg, defined as follows::
struct fscrypt_provisioning_key_payload {
__u32 type;
- __u32 __reserved;
+ __u32 flags;
__u8 raw[];
};
@@ -751,24 +902,32 @@ as follows:
Alternatively, if ``key_id`` is nonzero, this field must be 0, since
in that case the size is implied by the specified Linux keyring key.
-- ``key_id`` is 0 if the raw key is given directly in the ``raw``
- field. Otherwise ``key_id`` is the ID of a Linux keyring key of
- type "fscrypt-provisioning" whose payload is
- struct fscrypt_provisioning_key_payload whose ``raw`` field contains
- the raw key and whose ``type`` field matches ``key_spec.type``.
- Since ``raw`` is variable-length, the total size of this key's
- payload must be ``sizeof(struct fscrypt_provisioning_key_payload)``
- plus the raw key size. The process must have Search permission on
- this key.
-
- Most users should leave this 0 and specify the raw key directly.
- The support for specifying a Linux keyring key is intended mainly to
+- ``key_id`` is 0 if the key is given directly in the ``raw`` field.
+ Otherwise ``key_id`` is the ID of a Linux keyring key of type
+ "fscrypt-provisioning" whose payload is struct
+ fscrypt_provisioning_key_payload whose ``raw`` field contains the
+ key, whose ``type`` field matches ``key_spec.type``, and whose
+ ``flags`` field matches ``flags``. Since ``raw`` is
+ variable-length, the total size of this key's payload must be
+ ``sizeof(struct fscrypt_provisioning_key_payload)`` plus the number
+ of key bytes. The process must have Search permission on this key.
+
+ Most users should leave this 0 and specify the key directly. The
+ support for specifying a Linux keyring key is intended mainly to
allow re-adding keys after a filesystem is unmounted and re-mounted,
- without having to store the raw keys in userspace memory.
+ without having to store the keys in userspace memory.
+
+- ``flags`` contains optional flags from ``<linux/fscrypt.h>``:
+
+ - FSCRYPT_ADD_KEY_FLAG_HW_WRAPPED: This denotes that the key is a
+ hardware-wrapped key. See `Hardware-wrapped keys`_. This flag
+ can't be used if FSCRYPT_KEY_SPEC_TYPE_DESCRIPTOR is used.
- ``raw`` is a variable-length field which must contain the actual
key, ``raw_size`` bytes long. Alternatively, if ``key_id`` is
- nonzero, then this field is unused.
+ nonzero, then this field is unused. Note that despite being named
+ ``raw``, if FSCRYPT_ADD_KEY_FLAG_HW_WRAPPED is specified then it
+ will contain a wrapped key, not a raw key.
For v2 policy keys, the kernel keeps track of which user (identified
by effective user ID) added the key, and only allows the key to be
@@ -780,8 +939,8 @@ prevent that other user from unexpectedly removing it. Therefore,
FS_IOC_ADD_ENCRYPTION_KEY may also be used to add a v2 policy key
*again*, even if it's already added by other user(s). In this case,
FS_IOC_ADD_ENCRYPTION_KEY will just install a claim to the key for the
-current user, rather than actually add the key again (but the raw key
-must still be provided, as a proof of knowledge).
+current user, rather than actually add the key again (but the key must
+still be provided, as a proof of knowledge).
FS_IOC_ADD_ENCRYPTION_KEY returns 0 if either the key or a claim to
the key was either added or already exists.
@@ -790,20 +949,23 @@ FS_IOC_ADD_ENCRYPTION_KEY can fail with the following errors:
- ``EACCES``: FSCRYPT_KEY_SPEC_TYPE_DESCRIPTOR was specified, but the
caller does not have the CAP_SYS_ADMIN capability in the initial
- user namespace; or the raw key was specified by Linux key ID but the
+ user namespace; or the key was specified by Linux key ID but the
process lacks Search permission on the key.
+- ``EBADMSG``: invalid hardware-wrapped key
- ``EDQUOT``: the key quota for this user would be exceeded by adding
the key
- ``EINVAL``: invalid key size or key specifier type, or reserved bits
were set
-- ``EKEYREJECTED``: the raw key was specified by Linux key ID, but the
- key has the wrong type
-- ``ENOKEY``: the raw key was specified by Linux key ID, but no key
- exists with that ID
+- ``EKEYREJECTED``: the key was specified by Linux key ID, but the key
+ has the wrong type
+- ``ENOKEY``: the key was specified by Linux key ID, but no key exists
+ with that ID
- ``ENOTTY``: this type of filesystem does not implement encryption
- ``EOPNOTSUPP``: the kernel was not configured with encryption
support for this filesystem, or the filesystem superblock has not
- had encryption enabled on it
+ had encryption enabled on it; or a hardware wrapped key was specified
+ but the filesystem does not support inline encryption or the hardware
+ does not support hardware-wrapped keys
Legacy method
~~~~~~~~~~~~~
@@ -866,9 +1028,8 @@ or removed by non-root users.
These ioctls don't work on keys that were added via the legacy
process-subscribed keyrings mechanism.
-Before using these ioctls, read the `Kernel memory compromise`_
-section for a discussion of the security goals and limitations of
-these ioctls.
+Before using these ioctls, read the `Online attacks`_ section for a
+discussion of the security goals and limitations of these ioctls.
FS_IOC_REMOVE_ENCRYPTION_KEY
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -1005,8 +1166,8 @@ The caller must zero all input fields, then fill in ``key_spec``:
On success, 0 is returned and the kernel fills in the output fields:
- ``status`` indicates whether the key is absent, present, or
- incompletely removed. Incompletely removed means that the master
- secret has been removed, but some files are still in use; i.e.,
+ incompletely removed. Incompletely removed means that removal has
+ been initiated, but some files are still in use; i.e.,
`FS_IOC_REMOVE_ENCRYPTION_KEY`_ returned 0 but set the informational
status flag FSCRYPT_KEY_REMOVAL_STATUS_FLAG_FILES_BUSY.
@@ -1157,22 +1318,13 @@ this by validating all top-level encryption policies prior to access.
Inline encryption support
=========================
-By default, fscrypt uses the kernel crypto API for all cryptographic
-operations (other than HKDF, which fscrypt partially implements
-itself). The kernel crypto API supports hardware crypto accelerators,
-but only ones that work in the traditional way where all inputs and
-outputs (e.g. plaintexts and ciphertexts) are in memory. fscrypt can
-take advantage of such hardware, but the traditional acceleration
-model isn't particularly efficient and fscrypt hasn't been optimized
-for it.
-
-Instead, many newer systems (especially mobile SoCs) have *inline
-encryption hardware* that can encrypt/decrypt data while it is on its
-way to/from the storage device. Linux supports inline encryption
-through a set of extensions to the block layer called *blk-crypto*.
-blk-crypto allows filesystems to attach encryption contexts to bios
-(I/O requests) to specify how the data will be encrypted or decrypted
-in-line. For more information about blk-crypto, see
+Many newer systems (especially mobile SoCs) have *inline encryption
+hardware* that can encrypt/decrypt data while it is on its way to/from
+the storage device. Linux supports inline encryption through a set of
+extensions to the block layer called *blk-crypto*. blk-crypto allows
+filesystems to attach encryption contexts to bios (I/O requests) to
+specify how the data will be encrypted or decrypted in-line. For more
+information about blk-crypto, see
:ref:`Documentation/block/inline-encryption.rst <inline_encryption>`.
On supported filesystems (currently ext4 and f2fs), fscrypt can use
@@ -1188,7 +1340,8 @@ inline encryption hardware doesn't have the needed crypto capabilities
(e.g. support for the needed encryption algorithm and data unit size)
and where blk-crypto-fallback is unusable. (For blk-crypto-fallback
to be usable, it must be enabled in the kernel configuration with
-CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK=y.)
+CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK=y, and the file must be
+protected by a raw key rather than a hardware-wrapped key.)
Currently fscrypt always uses the filesystem block size (which is
usually 4096 bytes) as the data unit size. Therefore, it can only use
@@ -1196,7 +1349,76 @@ inline encryption hardware that supports that data unit size.
Inline encryption doesn't affect the ciphertext or other aspects of
the on-disk format, so users may freely switch back and forth between
-using "inlinecrypt" and not using "inlinecrypt".
+using "inlinecrypt" and not using "inlinecrypt". An exception is that
+files that are protected by a hardware-wrapped key can only be
+encrypted/decrypted by the inline encryption hardware and therefore
+can only be accessed when the "inlinecrypt" mount option is used. For
+more information about hardware-wrapped keys, see below.
+
+Hardware-wrapped keys
+---------------------
+
+fscrypt supports using *hardware-wrapped keys* when the inline
+encryption hardware supports it. Such keys are only present in kernel
+memory in wrapped (encrypted) form; they can only be unwrapped
+(decrypted) by the inline encryption hardware and are temporally bound
+to the current boot. This prevents the keys from being compromised if
+kernel memory is leaked. This is done without limiting the number of
+keys that can be used and while still allowing the execution of
+cryptographic tasks that are tied to the same key but can't use inline
+encryption hardware, e.g. filenames encryption.
+
+Note that hardware-wrapped keys aren't specific to fscrypt; they are a
+block layer feature (part of *blk-crypto*). For more details about
+hardware-wrapped keys, see the block layer documentation at
+:ref:`Documentation/block/inline-encryption.rst
+<hardware_wrapped_keys>`. The rest of this section just focuses on
+the details of how fscrypt can use hardware-wrapped keys.
+
+fscrypt supports hardware-wrapped keys by allowing the fscrypt master
+keys to be hardware-wrapped keys as an alternative to raw keys. To
+add a hardware-wrapped key with `FS_IOC_ADD_ENCRYPTION_KEY`_,
+userspace must specify FSCRYPT_ADD_KEY_FLAG_HW_WRAPPED in the
+``flags`` field of struct fscrypt_add_key_arg and also in the
+``flags`` field of struct fscrypt_provisioning_key_payload when
+applicable. The key must be in ephemerally-wrapped form, not
+long-term wrapped form.
+
+Some limitations apply. First, files protected by a hardware-wrapped
+key are tied to the system's inline encryption hardware. Therefore
+they can only be accessed when the "inlinecrypt" mount option is used,
+and they can't be included in portable filesystem images. Second,
+currently the hardware-wrapped key support is only compatible with
+`IV_INO_LBLK_64 policies`_ and `IV_INO_LBLK_32 policies`_, as it
+assumes that there is just one file contents encryption key per
+fscrypt master key rather than one per file. Future work may address
+this limitation by passing per-file nonces down the storage stack to
+allow the hardware to derive per-file keys.
+
+Implementation-wise, to encrypt/decrypt the contents of files that are
+protected by a hardware-wrapped key, fscrypt uses blk-crypto,
+attaching the hardware-wrapped key to the bio crypt contexts. As is
+the case with raw keys, the block layer will program the key into a
+keyslot when it isn't already in one. However, when programming a
+hardware-wrapped key, the hardware doesn't program the given key
+directly into a keyslot but rather unwraps it (using the hardware's
+ephemeral wrapping key) and derives the inline encryption key from it.
+The inline encryption key is the key that actually gets programmed
+into a keyslot, and it is never exposed to software.
+
+However, fscrypt doesn't just do file contents encryption; it also
+uses its master keys to derive filenames encryption keys, key
+identifiers, and sometimes some more obscure types of subkeys such as
+dirhash keys. So even with file contents encryption out of the
+picture, fscrypt still needs a raw key to work with. To get such a
+key from a hardware-wrapped key, fscrypt asks the inline encryption
+hardware to derive a cryptographically isolated "software secret" from
+the hardware-wrapped key. fscrypt uses this "software secret" to key
+its KDF to derive all subkeys other than file contents keys.
+
+Note that this implies that the hardware-wrapped key feature only
+protects the file contents encryption keys. It doesn't protect other
+fscrypt subkeys such as filenames encryption keys.
Direct I/O support
==================
@@ -1253,7 +1475,8 @@ directory.) These structs are defined as follows::
u8 contents_encryption_mode;
u8 filenames_encryption_mode;
u8 flags;
- u8 __reserved[4];
+ u8 log2_data_unit_size;
+ u8 __reserved[3];
u8 master_key_identifier[FSCRYPT_KEY_IDENTIFIER_SIZE];
u8 nonce[FSCRYPT_FILE_NONCE_SIZE];
};
@@ -1280,7 +1503,7 @@ read the ciphertext into the page cache and decrypt it in-place. The
folio lock must be held until decryption has finished, to prevent the
folio from becoming visible to userspace prematurely.
-For the write path (->writepage()) of regular files, filesystems
+For the write path (->writepages()) of regular files, filesystems
cannot encrypt data in-place in the page cache, since the cached
plaintext must be preserved. Instead, filesystems must encrypt into a
temporary buffer or "bounce page", then write out the temporary
diff --git a/Documentation/filesystems/fsverity.rst b/Documentation/filesystems/fsverity.rst
index cb845e8e5435..412cf11e3298 100644
--- a/Documentation/filesystems/fsverity.rst
+++ b/Documentation/filesystems/fsverity.rst
@@ -16,7 +16,7 @@ btrfs filesystems. Like fscrypt, not too much filesystem-specific
code is needed to support fs-verity.
fs-verity is similar to `dm-verity
-<https://www.kernel.org/doc/Documentation/device-mapper/verity.txt>`_
+<https://www.kernel.org/doc/Documentation/admin-guide/device-mapper/verity.rst>`_
but works on files rather than block devices. On regular files on
filesystems supporting fs-verity, userspace can execute an ioctl that
causes the filesystem to build a Merkle tree for the file and persist
@@ -86,6 +86,16 @@ authenticating fs-verity file hashes include:
signature in their "security.ima" extended attribute, as controlled
by the IMA policy. For more information, see the IMA documentation.
+- Integrity Policy Enforcement (IPE). IPE supports enforcing access
+ control decisions based on immutable security properties of files,
+ including those protected by fs-verity's built-in signatures.
+ "IPE policy" specifically allows for the authorization of fs-verity
+ files using properties ``fsverity_digest`` for identifying
+ files by their verity digest, and ``fsverity_signature`` to authorize
+ files with a verified fs-verity's built-in signature. For
+ details on configuring IPE policies and understanding its operational
+ modes, please refer to :doc:`IPE admin guide </admin-guide/LSM/ipe>`.
+
- Trusted userspace code in combination with `Built-in signature
verification`_. This approach should be used only with great care.
@@ -175,8 +185,7 @@ FS_IOC_ENABLE_VERITY can fail with the following errors:
- ``ENOKEY``: the ".fs-verity" keyring doesn't contain the certificate
needed to verify the builtin signature
- ``ENOPKG``: fs-verity recognizes the hash algorithm, but it's not
- available in the kernel's crypto API as currently configured (e.g.
- for SHA-512, missing CONFIG_CRYPTO_SHA512).
+ available in the kernel as currently configured
- ``ENOTTY``: this type of filesystem does not implement fs-verity
- ``EOPNOTSUPP``: the kernel was not configured with fs-verity
support; or the filesystem superblock has not had the 'verity'
@@ -238,11 +247,17 @@ FS_IOC_READ_VERITY_METADATA
The FS_IOC_READ_VERITY_METADATA ioctl reads verity metadata from a
verity file. This ioctl is available since Linux v5.12.
-This ioctl allows writing a server program that takes a verity file
-and serves it to a client program, such that the client can do its own
-fs-verity compatible verification of the file. This only makes sense
-if the client doesn't trust the server and if the server needs to
-provide the storage for the client.
+This ioctl is useful for cases where the verity verification should be
+performed somewhere other than the currently running kernel.
+
+One example is a server program that takes a verity file and serves it
+to a client program, such that the client can do its own fs-verity
+compatible verification of the file. This only makes sense if the
+client doesn't trust the server and if the server needs to provide the
+storage for the client.
+
+Another example is copying verity metadata when creating filesystem
+images in userspace (such as with ``mkfs.ext4 -d``).
This is a fairly specialized use case, and most fs-verity users won't
need this ioctl.
@@ -326,6 +341,8 @@ the file has fs-verity enabled. This can perform better than
FS_IOC_GETFLAGS and FS_IOC_MEASURE_VERITY because it doesn't require
opening the file, and opening verity files can be expensive.
+.. _accessing_verity_files:
+
Accessing verity files
======================
@@ -455,7 +472,11 @@ Enabling this option adds the following:
On success, the ioctl persists the signature alongside the Merkle
tree. Then, any time the file is opened, the kernel verifies the
file's actual digest against this signature, using the certificates
- in the ".fs-verity" keyring.
+ in the ".fs-verity" keyring. This verification happens as long as the
+ file's signature exists, regardless of the state of the sysctl variable
+ "fs.verity.require_signatures" described in the next item. The IPE LSM
+ relies on this behavior to recognize and label fsverity files
+ that contain a verified built-in fsverity signature.
3. A new sysctl "fs.verity.require_signatures" is made available.
When set to 1, the kernel requires that all verity files have a
@@ -479,7 +500,7 @@ be carefully considered before using them:
- Builtin signature verification does *not* make the kernel enforce
that any files actually have fs-verity enabled. Thus, it is not a
- complete authentication policy. Currently, if it is used, the only
+ complete authentication policy. Currently, if it is used, one
way to complete the authentication policy is for trusted userspace
code to explicitly check whether files have fs-verity enabled with a
signature before they are accessed. (With
@@ -488,6 +509,15 @@ be carefully considered before using them:
could just store the signature alongside the file and verify it
itself using a cryptographic library, instead of using this feature.
+- Another approach is to utilize fs-verity builtin signature
+ verification in conjunction with the IPE LSM, which supports defining
+ a kernel-enforced, system-wide authentication policy that allows only
+ files with a verified fs-verity builtin signature to perform certain
+ operations, such as execution. Note that IPE doesn't require
+ fs.verity.require_signatures=1.
+ Please refer to :doc:`IPE admin guide </admin-guide/LSM/ipe>` for
+ more details.
+
- A file's builtin signature can only be set at the same time that
fs-verity is being enabled on the file. Changing or deleting the
builtin signature later requires re-creating the file.
diff --git a/Documentation/filesystems/fuse/fuse-io-uring.rst b/Documentation/filesystems/fuse/fuse-io-uring.rst
new file mode 100644
index 000000000000..d73dd0dbd238
--- /dev/null
+++ b/Documentation/filesystems/fuse/fuse-io-uring.rst
@@ -0,0 +1,99 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================================
+FUSE-over-io-uring design documentation
+=======================================
+
+This documentation covers basic details how the fuse
+kernel/userspace communication through io-uring is configured
+and works. For generic details about FUSE see fuse.rst.
+
+This document also covers the current interface, which is
+still in development and might change.
+
+Limitations
+===========
+As of now not all requests types are supported through io-uring, userspace
+is required to also handle requests through /dev/fuse after io-uring setup
+is complete. Specifically notifications (initiated from the daemon side)
+and interrupts.
+
+Fuse io-uring configuration
+===========================
+
+Fuse kernel requests are queued through the classical /dev/fuse
+read/write interface - until io-uring setup is complete.
+
+In order to set up fuse-over-io-uring fuse-server (user-space)
+needs to submit SQEs (opcode = IORING_OP_URING_CMD) to the /dev/fuse
+connection file descriptor. Initial submit is with the sub command
+FUSE_URING_REQ_REGISTER, which will just register entries to be
+available in the kernel.
+
+Once at least one entry per queue is submitted, kernel starts
+to enqueue to ring queues.
+Note, every CPU core has its own fuse-io-uring queue.
+Userspace handles the CQE/fuse-request and submits the result as
+subcommand FUSE_URING_REQ_COMMIT_AND_FETCH - kernel completes
+the requests and also marks the entry available again. If there are
+pending requests waiting the request will be immediately submitted
+to the daemon again.
+
+Initial SQE
+-----------::
+
+ | | FUSE filesystem daemon
+ | |
+ | | >io_uring_submit()
+ | | IORING_OP_URING_CMD /
+ | | FUSE_URING_CMD_REGISTER
+ | | [wait cqe]
+ | | >io_uring_wait_cqe() or
+ | | >io_uring_submit_and_wait()
+ | |
+ | >fuse_uring_cmd() |
+ | >fuse_uring_register() |
+
+
+Sending requests with CQEs
+--------------------------::
+
+ | | FUSE filesystem daemon
+ | | [waiting for CQEs]
+ | "rm /mnt/fuse/file" |
+ | |
+ | >sys_unlink() |
+ | >fuse_unlink() |
+ | [allocate request] |
+ | >fuse_send_one() |
+ | ... |
+ | >fuse_uring_queue_fuse_req |
+ | [queue request on fg queue] |
+ | >fuse_uring_add_req_to_ring_ent() |
+ | ... |
+ | >fuse_uring_copy_to_ring() |
+ | >io_uring_cmd_done() |
+ | >request_wait_answer() |
+ | [sleep on req->waitq] |
+ | | [receives and handles CQE]
+ | | [submit result and fetch next]
+ | | >io_uring_submit()
+ | | IORING_OP_URING_CMD/
+ | | FUSE_URING_CMD_COMMIT_AND_FETCH
+ | >fuse_uring_cmd() |
+ | >fuse_uring_commit_fetch() |
+ | >fuse_uring_commit() |
+ | >fuse_uring_copy_from_ring() |
+ | [ copy the result to the fuse req] |
+ | >fuse_uring_req_end() |
+ | >fuse_request_end() |
+ | [wake up req->waitq] |
+ | >fuse_uring_next_fuse_req |
+ | [wait or handle next req] |
+ | |
+ | [req->waitq woken up] |
+ | <fuse_unlink() |
+ | <sys_unlink() |
+
+
+
diff --git a/Documentation/filesystems/fuse-io.rst b/Documentation/filesystems/fuse/fuse-io.rst
index 255a368fe534..d736ac4cb483 100644
--- a/Documentation/filesystems/fuse-io.rst
+++ b/Documentation/filesystems/fuse/fuse-io.rst
@@ -1,7 +1,7 @@
.. SPDX-License-Identifier: GPL-2.0
==============
-Fuse I/O Modes
+FUSE I/O Modes
==============
Fuse supports the following I/O modes:
@@ -15,7 +15,8 @@ The direct-io mode can be selected with the FOPEN_DIRECT_IO flag in the
FUSE_OPEN reply.
In direct-io mode the page cache is completely bypassed for reads and writes.
-No read-ahead takes place. Shared mmap is disabled.
+No read-ahead takes place. Shared mmap is disabled by default. To allow shared
+mmap, the FUSE_DIRECT_IO_ALLOW_MMAP flag may be enabled in the FUSE_INIT reply.
In cached mode reads may be satisfied from the page cache, and data may be
read-ahead by the kernel to fill the cache. The cache is always kept consistent
diff --git a/Documentation/filesystems/fuse/fuse-passthrough.rst b/Documentation/filesystems/fuse/fuse-passthrough.rst
new file mode 100644
index 000000000000..2b0e7c2da54a
--- /dev/null
+++ b/Documentation/filesystems/fuse/fuse-passthrough.rst
@@ -0,0 +1,133 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================
+FUSE Passthrough
+================
+
+Introduction
+============
+
+FUSE (Filesystem in Userspace) passthrough is a feature designed to improve the
+performance of FUSE filesystems for I/O operations. Typically, FUSE operations
+involve communication between the kernel and a userspace FUSE daemon, which can
+incur overhead. Passthrough allows certain operations on a FUSE file to bypass
+the userspace daemon and be executed directly by the kernel on an underlying
+"backing file".
+
+This is achieved by the FUSE daemon registering a file descriptor (pointing to
+the backing file on a lower filesystem) with the FUSE kernel module. The kernel
+then receives an identifier (``backing_id``) for this registered backing file.
+When a FUSE file is subsequently opened, the FUSE daemon can, in its response to
+the ``OPEN`` request, include this ``backing_id`` and set the
+``FOPEN_PASSTHROUGH`` flag. This establishes a direct link for specific
+operations.
+
+Currently, passthrough is supported for operations like ``read(2)``/``write(2)``
+(via ``read_iter``/``write_iter``), ``splice(2)``, and ``mmap(2)``.
+
+Enabling Passthrough
+====================
+
+To use FUSE passthrough:
+
+ 1. The FUSE filesystem must be compiled with ``CONFIG_FUSE_PASSTHROUGH``
+ enabled.
+ 2. The FUSE daemon, during the ``FUSE_INIT`` handshake, must negotiate the
+ ``FUSE_PASSTHROUGH`` capability and specify its desired
+ ``max_stack_depth``.
+ 3. The (privileged) FUSE daemon uses the ``FUSE_DEV_IOC_BACKING_OPEN`` ioctl
+ on its connection file descriptor (e.g., ``/dev/fuse``) to register a
+ backing file descriptor and obtain a ``backing_id``.
+ 4. When handling an ``OPEN`` or ``CREATE`` request for a FUSE file, the daemon
+ replies with the ``FOPEN_PASSTHROUGH`` flag set in
+ ``fuse_open_out::open_flags`` and provides the corresponding ``backing_id``
+ in ``fuse_open_out::backing_id``.
+ 5. The FUSE daemon should eventually call ``FUSE_DEV_IOC_BACKING_CLOSE`` with
+ the ``backing_id`` to release the kernel's reference to the backing file
+ when it's no longer needed for passthrough setups.
+
+Privilege Requirements
+======================
+
+Setting up passthrough functionality currently requires the FUSE daemon to
+possess the ``CAP_SYS_ADMIN`` capability. This requirement stems from several
+security and resource management considerations that are actively being
+discussed and worked on. The primary reasons for this restriction are detailed
+below.
+
+Resource Accounting and Visibility
+----------------------------------
+
+The core mechanism for passthrough involves the FUSE daemon opening a file
+descriptor to a backing file and registering it with the FUSE kernel module via
+the ``FUSE_DEV_IOC_BACKING_OPEN`` ioctl. This ioctl returns a ``backing_id``
+associated with a kernel-internal ``struct fuse_backing`` object, which holds a
+reference to the backing ``struct file``.
+
+A significant concern arises because the FUSE daemon can close its own file
+descriptor to the backing file after registration. The kernel, however, will
+still hold a reference to the ``struct file`` via the ``struct fuse_backing``
+object as long as it's associated with a ``backing_id`` (or subsequently, with
+an open FUSE file in passthrough mode).
+
+This behavior leads to two main issues for unprivileged FUSE daemons:
+
+ 1. **Invisibility to lsof and other inspection tools**: Once the FUSE
+ daemon closes its file descriptor, the open backing file held by the kernel
+ becomes "hidden." Standard tools like ``lsof``, which typically inspect
+ process file descriptor tables, would not be able to identify that this
+ file is still open by the system on behalf of the FUSE filesystem. This
+ makes it difficult for system administrators to track resource usage or
+ debug issues related to open files (e.g., preventing unmounts).
+
+ 2. **Bypassing RLIMIT_NOFILE**: The FUSE daemon process is subject to
+ resource limits, including the maximum number of open file descriptors
+ (``RLIMIT_NOFILE``). If an unprivileged daemon could register backing files
+ and then close its own FDs, it could potentially cause the kernel to hold
+ an unlimited number of open ``struct file`` references without these being
+ accounted against the daemon's ``RLIMIT_NOFILE``. This could lead to a
+ denial-of-service (DoS) by exhausting system-wide file resources.
+
+The ``CAP_SYS_ADMIN`` requirement acts as a safeguard against these issues,
+restricting this powerful capability to trusted processes.
+
+**NOTE**: ``io_uring`` solves this similar issue by exposing its "fixed files",
+which are visible via ``fdinfo`` and accounted under the registering user's
+``RLIMIT_NOFILE``.
+
+Filesystem Stacking and Shutdown Loops
+--------------------------------------
+
+Another concern relates to the potential for creating complex and problematic
+filesystem stacking scenarios if unprivileged users could set up passthrough.
+A FUSE passthrough filesystem might use a backing file that resides:
+
+ * On the *same* FUSE filesystem.
+ * On another filesystem (like OverlayFS) which itself might have an upper or
+ lower layer that is a FUSE filesystem.
+
+These configurations could create dependency loops, particularly during
+filesystem shutdown or unmount sequences, leading to deadlocks or system
+instability. This is conceptually similar to the risks associated with the
+``LOOP_SET_FD`` ioctl, which also requires ``CAP_SYS_ADMIN``.
+
+To mitigate this, FUSE passthrough already incorporates checks based on
+filesystem stacking depth (``sb->s_stack_depth`` and ``fc->max_stack_depth``).
+For example, during the ``FUSE_INIT`` handshake, the FUSE daemon can negotiate
+the ``max_stack_depth`` it supports. When a backing file is registered via
+``FUSE_DEV_IOC_BACKING_OPEN``, the kernel checks if the backing file's
+filesystem stack depth is within the allowed limit.
+
+The ``CAP_SYS_ADMIN`` requirement provides an additional layer of security,
+ensuring that only privileged users can create these potentially complex
+stacking arrangements.
+
+General Security Posture
+------------------------
+
+As a general principle for new kernel features that allow userspace to instruct
+the kernel to perform direct operations on its behalf based on user-provided
+file descriptors, starting with a higher privilege requirement (like
+``CAP_SYS_ADMIN``) is a conservative and common security practice. This allows
+the feature to be used and tested while further security implications are
+evaluated and addressed.
diff --git a/Documentation/filesystems/fuse.rst b/Documentation/filesystems/fuse/fuse.rst
index 1e31e87aee68..0fbd5a03fdc9 100644
--- a/Documentation/filesystems/fuse.rst
+++ b/Documentation/filesystems/fuse/fuse.rst
@@ -1,8 +1,8 @@
.. SPDX-License-Identifier: GPL-2.0
-====
-FUSE
-====
+=============
+FUSE Overview
+=============
Definitions
===========
@@ -129,6 +129,20 @@ For each connection the following files exist within this directory:
connection. This means that all waiting requests will be aborted an
error returned for all aborted and new requests.
+ max_background
+ The maximum number of background requests that can be outstanding
+ at a time. When the number of background requests reaches this limit,
+ further requests will be blocked until some are completed, potentially
+ causing I/O operations to stall.
+
+ congestion_threshold
+ The threshold of background requests at which the kernel considers
+ the filesystem to be congested. When the number of background requests
+ exceeds this value, the kernel will skip asynchronous readahead
+ operations, reducing read-ahead optimizations but preserving essential
+ I/O, as well as suspending non-synchronous writeback operations
+ (WB_SYNC_NONE), delaying page cache flushing to the filesystem.
+
Only the owner of the mount may read or write these files.
Interrupting filesystem operations
diff --git a/Documentation/filesystems/fuse/index.rst b/Documentation/filesystems/fuse/index.rst
new file mode 100644
index 000000000000..393a845214da
--- /dev/null
+++ b/Documentation/filesystems/fuse/index.rst
@@ -0,0 +1,14 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================================================
+FUSE (Filesystem in Userspace) Technical Documentation
+======================================================
+
+.. toctree::
+ :maxdepth: 2
+ :numbered:
+
+ fuse
+ fuse-io
+ fuse-io-uring
+ fuse-passthrough
diff --git a/Documentation/filesystems/gfs2-glocks.rst b/Documentation/filesystems/gfs2/glocks.rst
index d14f230f0b12..ce5ff08cbd59 100644
--- a/Documentation/filesystems/gfs2-glocks.rst
+++ b/Documentation/filesystems/gfs2/glocks.rst
@@ -20,8 +20,7 @@ The gl_holders list contains all the queued lock requests (not
just the holders) associated with the glock. If there are any
held locks, then they will be contiguous entries at the head
of the list. Locks are granted in strictly the order that they
-are queued, except for those marked LM_FLAG_PRIORITY which are
-used only during recovery, and even then only for journal locks.
+are queued.
There are three lock states that users of the glock layer can request,
namely shared (SH), deferred (DF) and exclusive (EX). Those translate
@@ -41,14 +40,14 @@ shared lock mode, SH. In GFS2 the DF mode is used exclusively for direct I/O
operations. The glocks are basically a lock plus some routines which deal
with cache management. The following rules apply for the cache:
-========== ========== ============== ========== ==============
-Glock mode Cache data Cache Metadata Dirty Data Dirty Metadata
-========== ========== ============== ========== ==============
- UN No No No No
- SH Yes Yes No No
- DF No Yes No No
- EX Yes Yes Yes Yes
-========== ========== ============== ========== ==============
+========== ============== ========== ========== ==============
+Glock mode Cache Metadata Cache data Dirty Data Dirty Metadata
+========== ============== ========== ========== ==============
+ UN No No No No
+ DF Yes No No No
+ SH Yes Yes No No
+ EX Yes Yes Yes Yes
+========== ============== ========== ========== ==============
These rules are implemented using the various glock operations which
are defined for each type of glock. Not all types of glocks use
@@ -56,60 +55,57 @@ all the modes. Only inode glocks use the DF mode for example.
Table of glock operations and per type constants:
-============= =============================================================
+============== =============================================================
Field Purpose
-============= =============================================================
-go_xmote_th Called before remote state change (e.g. to sync dirty data)
+============== =============================================================
+go_sync Called before remote state change (e.g. to sync dirty data)
go_xmote_bh Called after remote state change (e.g. to refill cache)
go_inval Called if remote state change requires invalidating the cache
-go_demote_ok Returns boolean value of whether its ok to demote a glock
- (e.g. checks timeout, and that there is no cached data)
-go_lock Called for the first local holder of a lock
-go_unlock Called on the final local unlock of a lock
+go_instantiate Called when a glock has been acquired
+go_held Called every time a glock holder is acquired
go_dump Called to print content of object for debugfs file, or on
error to dump glock to the log.
-go_type The type of the glock, ``LM_TYPE_*``
go_callback Called if the DLM sends a callback to drop this lock
+go_unlocked Called when a glock is unlocked (dlm_unlock())
+go_type The type of the glock, ``LM_TYPE_*``
go_flags GLOF_ASPACE is set, if the glock has an address space
associated with it
-============= =============================================================
+============== =============================================================
The minimum hold time for each lock is the time after a remote lock
grant for which we ignore remote demote requests. This is in order to
prevent a situation where locks are being bounced around the cluster
from node to node with none of the nodes making any progress. This
-tends to show up most with shared mmaped files which are being written
+tends to show up most with shared mmapped files which are being written
to by multiple nodes. By delaying the demotion in response to a
remote callback, that gives the userspace program time to make
some progress before the pages are unmapped.
-There is a plan to try and remove the go_lock and go_unlock callbacks
-if possible, in order to try and speed up the fast path though the locking.
-Also, eventually we hope to make the glock "EX" mode locally shared
-such that any local locking will be done with the i_mutex as required
-rather than via the glock.
+Eventually, we hope to make the glock "EX" mode locally shared such that any
+local locking will be done with the i_mutex as required rather than via the
+glock.
Locking rules for glock operations:
-============= ====================== =============================
+============== ====================== =============================
Operation GLF_LOCK bit lock held gl_lockref.lock spinlock held
-============= ====================== =============================
-go_xmote_th Yes No
+============== ====================== =============================
+go_sync Yes No
go_xmote_bh Yes No
go_inval Yes No
-go_demote_ok Sometimes Yes
-go_lock Yes No
-go_unlock Yes No
+go_instantiate No No
+go_held No No
go_dump Sometimes Yes
go_callback Sometimes (N/A) Yes
-============= ====================== =============================
+go_unlocked Yes No
+============== ====================== =============================
.. Note::
Operations must not drop either the bit lock or the spinlock
if its held on entry. go_dump and do_demote_ok must never block.
Note that go_dump will only be called if the glock's state
- indicates that it is caching uptodate data.
+ indicates that it is caching up-to-date data.
Glock locking order within GFS2:
diff --git a/Documentation/filesystems/gfs2.rst b/Documentation/filesystems/gfs2/index.rst
index 1bc48a13430c..e5e195403561 100644
--- a/Documentation/filesystems/gfs2.rst
+++ b/Documentation/filesystems/gfs2/index.rst
@@ -4,6 +4,9 @@
Global File System 2
====================
+Overview
+========
+
GFS2 is a cluster file system. It allows a cluster of computers to
simultaneously use a block device that is shared between them (with FC,
iSCSI, NBD, etc). GFS2 reads and writes to the block device like a local
@@ -50,3 +53,12 @@ The following man pages are available from gfs2-utils:
gfs2_convert to convert a gfs filesystem to GFS2 in-place
mkfs.gfs2 to make a filesystem
============ =============================================
+
+Implementation Notes
+====================
+
+.. toctree::
+ :maxdepth: 1
+
+ glocks
+ uevents
diff --git a/Documentation/filesystems/gfs2-uevents.rst b/Documentation/filesystems/gfs2/uevents.rst
index f162a2c76c69..f162a2c76c69 100644
--- a/Documentation/filesystems/gfs2-uevents.rst
+++ b/Documentation/filesystems/gfs2/uevents.rst
diff --git a/Documentation/filesystems/hpfs.rst b/Documentation/filesystems/hpfs.rst
index 7e0dd2f4373e..0f9516b5eb07 100644
--- a/Documentation/filesystems/hpfs.rst
+++ b/Documentation/filesystems/hpfs.rst
@@ -65,7 +65,7 @@ are case sensitive, so for example when you create a file FOO, you can use
'cat FOO', 'cat Foo', 'cat foo' or 'cat F*' but not 'cat f*'. Note, that you
also won't be able to compile linux kernel (and maybe other things) on HPFS
because kernel creates different files with names like bootsect.S and
-bootsect.s. When searching for file thats name has characters >= 128, codepages
+bootsect.s. When searching for file whose name has characters >= 128, codepages
are used - see below.
OS/2 ignores dots and spaces at the end of file name, so this driver does as
well. If you create 'a. ...', the file 'a' will be created, but you can still
diff --git a/Documentation/filesystems/idmappings.rst b/Documentation/filesystems/idmappings.rst
index ad6d21640576..2a206129f828 100644
--- a/Documentation/filesystems/idmappings.rst
+++ b/Documentation/filesystems/idmappings.rst
@@ -36,7 +36,7 @@ and write down the mappings it will generate::
From a mathematical viewpoint ``U`` and ``K`` are well-ordered sets and an
idmapping is an order isomorphism from ``U`` into ``K``. So ``U`` and ``K`` are
order isomorphic. In fact, ``U`` and ``K`` are always well-ordered subsets of
-the set of all possible ids useable on a given system.
+the set of all possible ids usable on a given system.
Looking at this mathematically briefly will help us highlight some properties
that make it easier to understand how we can translate between idmappings. For
@@ -47,7 +47,7 @@ example, we know that the inverse idmapping is an order isomorphism as well::
k10002 -> u24
Given that we are dealing with order isomorphisms plus the fact that we're
-dealing with subsets we can embedd idmappings into each other, i.e. we can
+dealing with subsets we can embed idmappings into each other, i.e. we can
sensibly translate between different idmappings. For example, assume we've been
given the three idmappings::
@@ -60,11 +60,11 @@ and id ``k11000`` which has been generated by the first idmapping by mapping
Because we're dealing with order isomorphic subsets it is meaningful to ask
what id ``k11000`` corresponds to in the second or third idmapping. The
-straightfoward algorithm to use is to apply the inverse of the first idmapping,
+straightforward algorithm to use is to apply the inverse of the first idmapping,
mapping ``k11000`` up to ``u1000``. Afterwards, we can map ``u1000`` down using
either the second idmapping mapping or third idmapping mapping. The second
-idmapping would map ``u1000`` down to ``21000``. The third idmapping would map
-``u1000`` down to ``u31000``.
+idmapping would map ``u1000`` down to ``k21000``. The third idmapping would map
+``u1000`` down to ``k31000``.
If we were given the same task for the following three idmappings::
@@ -146,9 +146,10 @@ For the rest of this document we will prefix all userspace ids with ``u`` and
all kernel ids with ``k``. Ranges of idmappings will be prefixed with ``r``. So
an idmapping will be written as ``u0:k10000:r10000``.
-For example, the id ``u1000`` is an id in the upper idmapset or "userspace
-idmapset" starting with ``u1000``. And it is mapped to ``k11000`` which is a
-kernel id in the lower idmapset or "kernel idmapset" starting with ``k10000``.
+For example, within this idmapping, the id ``u1000`` is an id in the upper
+idmapset or "userspace idmapset" starting with ``u0``. And it is mapped to
+``k11000`` which is a kernel id in the lower idmapset or "kernel idmapset"
+starting with ``k10000``.
A kernel id is always created by an idmapping. Such idmappings are associated
with user namespaces. Since we mainly care about how idmappings work we're not
@@ -367,12 +368,19 @@ So with the second step the kernel guarantees that a valid userspace id can be
written to disk. If it can't the kernel will refuse the creation request to not
even remotely risk filesystem corruption.
-The astute reader will have realized that this is simply a varation of the
+The astute reader will have realized that this is simply a variation of the
crossmapping algorithm we mentioned above in a previous section. First, the
kernel maps the caller's userspace id down into a kernel id according to the
caller's idmapping and then maps that kernel id up according to the
filesystem's idmapping.
+From the implementation point it's worth mentioning how idmappings are represented.
+All idmappings are taken from the corresponding user namespace.
+
+ - caller's idmapping (usually taken from ``current_user_ns()``)
+ - filesystem's idmapping (``sb->s_user_ns``)
+ - mount's idmapping (``mnt_idmap(vfsmnt)``)
+
Let's see some examples with caller/filesystem idmapping but without mount
idmappings. This will exhibit some problems we can hit. After that we will
revisit/reconsider these examples, this time using mount idmappings, to see how
@@ -458,7 +466,7 @@ the kernel id that was created in the caller's idmapping. This has mainly two
consequences.
First, that we can't allow a caller to ultimately write to disk with another
-userspace id. We could only do this if we were to mount the whole fileystem
+userspace id. We could only do this if we were to mount the whole filesystem
with the caller's or another idmapping. But that solution is limited to a few
filesystems and not very flexible. But this is a use-case that is pretty
important in containerized workloads.
@@ -589,7 +597,7 @@ on their work machine.
In both cases changing ownership recursively has grave implications. The most
obvious one is that ownership is changed globally and permanently. In the home
-directory case this change in ownership would even need to happen everytime the
+directory case this change in ownership would even need to happen every time the
user switches from their home to their work machine. For really large sets of
files this becomes increasingly costly.
@@ -662,7 +670,7 @@ use the ``vfsuid_into_kuid()`` and ``vfsgid_into_kgid()`` helpers.
To illustrate why this helper currently exists, consider what happens when we
change ownership of an inode from an idmapped mount. After we generated
a ``vfsuid_t`` or ``vfsgid_t`` based on the mount idmapping we later commit to
-this ``vfsuid_t`` or ``vfsgid_t`` to become the new filesytem wide ownership.
+this ``vfsuid_t`` or ``vfsgid_t`` to become the new filesystem wide ownership.
Thus, we are turning the ``vfsuid_t`` or ``vfsgid_t`` into a global ``kuid_t``
or ``kgid_t``. And this can be done by using ``vfsuid_into_kuid()`` and
``vfsgid_into_kgid()``.
@@ -813,7 +821,7 @@ the same idmapping to the mount. We now perform three steps:
/* Map the userspace id down into a kernel id in the filesystem's idmapping. */
make_kuid(u0:k20000:r10000, u1000) = k21000
-2. Verify that the caller's kernel ids can be mapped to userspace ids in the
+3. Verify that the caller's kernel ids can be mapped to userspace ids in the
filesystem's idmapping::
from_kuid(u0:k20000:r10000, k21000) = u1000
@@ -846,10 +854,10 @@ The same translation algorithm works with the third example.
/* Map the userspace id down into a kernel id in the filesystem's idmapping. */
make_kuid(u0:k0:r4294967295, u1000) = k1000
-2. Verify that the caller's kernel ids can be mapped to userspace ids in the
+3. Verify that the caller's kernel ids can be mapped to userspace ids in the
filesystem's idmapping::
- from_kuid(u0:k0:r4294967295, k21000) = u1000
+ from_kuid(u0:k0:r4294967295, k1000) = u1000
So the ownership that lands on disk will be ``u1000``.
@@ -986,7 +994,7 @@ from above:::
/* Map the userspace id down into a kernel id in the filesystem's idmapping. */
make_kuid(u0:k0:r4294967295, u1000) = k1000
-2. Verify that the caller's filesystem ids can be mapped to userspace ids in the
+3. Verify that the caller's filesystem ids can be mapped to userspace ids in the
filesystem's idmapping::
from_kuid(u0:k0:r4294967295, k1000) = u1000
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index eb252fc972aa..f4873197587d 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -29,11 +29,13 @@ algorithms work.
fiemap
files
locks
+ multigrain-ts
mount_api
quota
seq_file
sharedsubtree
idmappings
+ iomap/index
automount-support
@@ -50,6 +52,7 @@ filesystem implementations.
.. toctree::
:maxdepth: 2
+ buffer
journalling
fscrypt
fsverity
@@ -86,19 +89,15 @@ Documentation for filesystem implementations.
ext3
ext4/index
f2fs
- gfs2
- gfs2-uevents
- gfs2-glocks
+ gfs2/index
hfs
hfsplus
hpfs
- fuse
- fuse-io
+ fuse/index
inotify
isofs
nilfs2
nfs/index
- ntfs
ntfs3
ocfs2
ocfs2-online-filecheck
@@ -109,19 +108,17 @@ Documentation for filesystem implementations.
qnx6
ramfs-rootfs-initramfs
relay
+ resctrl
romfs
smb/index
spufs/index
squashfs
sysfs
- sysv-fs
tmpfs
ubifs
ubifs-authentication
udf
virtiofs
vfat
- xfs-delayed-logging-design
- xfs-self-describing-metadata
- xfs-online-fsck-design
+ xfs/index
zonefs
diff --git a/Documentation/filesystems/iomap/design.rst b/Documentation/filesystems/iomap/design.rst
new file mode 100644
index 000000000000..0f7672676c0b
--- /dev/null
+++ b/Documentation/filesystems/iomap/design.rst
@@ -0,0 +1,459 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _iomap_design:
+
+..
+ Dumb style notes to maintain the author's sanity:
+ Please try to start sentences on separate lines so that
+ sentence changes don't bleed colors in diff.
+ Heading decorations are documented in sphinx.rst.
+
+==============
+Library Design
+==============
+
+.. contents:: Table of Contents
+ :local:
+
+Introduction
+============
+
+iomap is a filesystem library for handling common file operations.
+The library has two layers:
+
+ 1. A lower layer that provides an iterator over ranges of file offsets.
+ This layer tries to obtain mappings of each file ranges to storage
+ from the filesystem, but the storage information is not necessarily
+ required.
+
+ 2. An upper layer that acts upon the space mappings provided by the
+ lower layer iterator.
+
+The iteration can involve mappings of file's logical offset ranges to
+physical extents, but the storage layer information is not necessarily
+required, e.g. for walking cached file information.
+The library exports various APIs for implementing file operations such
+as:
+
+ * Pagecache reads and writes
+ * Folio write faults to the pagecache
+ * Writeback of dirty folios
+ * Direct I/O reads and writes
+ * fsdax I/O reads, writes, loads, and stores
+ * FIEMAP
+ * lseek ``SEEK_DATA`` and ``SEEK_HOLE``
+ * swapfile activation
+
+This origins of this library is the file I/O path that XFS once used; it
+has now been extended to cover several other operations.
+
+Who Should Read This?
+=====================
+
+The target audience for this document are filesystem, storage, and
+pagecache programmers and code reviewers.
+
+If you are working on PCI, machine architectures, or device drivers, you
+are most likely in the wrong place.
+
+How Is This Better?
+===================
+
+Unlike the classic Linux I/O model which breaks file I/O into small
+units (generally memory pages or blocks) and looks up space mappings on
+the basis of that unit, the iomap model asks the filesystem for the
+largest space mappings that it can create for a given file operation and
+initiates operations on that basis.
+This strategy improves the filesystem's visibility into the size of the
+operation being performed, which enables it to combat fragmentation with
+larger space allocations when possible.
+Larger space mappings improve runtime performance by amortizing the cost
+of mapping function calls into the filesystem across a larger amount of
+data.
+
+At a high level, an iomap operation `looks like this
+<https://lore.kernel.org/all/ZGbVaewzcCysclPt@dread.disaster.area/>`_:
+
+1. For each byte in the operation range...
+
+ 1. Obtain a space mapping via ``->iomap_begin``
+
+ 2. For each sub-unit of work...
+
+ 1. Revalidate the mapping and go back to (1) above, if necessary.
+ So far only the pagecache operations need to do this.
+
+ 2. Do the work
+
+ 3. Increment operation cursor
+
+ 4. Release the mapping via ``->iomap_end``, if necessary
+
+Each iomap operation will be covered in more detail below.
+This library was covered previously by an `LWN article
+<https://lwn.net/Articles/935934/>`_ and a `KernelNewbies page
+<https://kernelnewbies.org/KernelProjects/iomap>`_.
+
+The goal of this document is to provide a brief discussion of the
+design and capabilities of iomap, followed by a more detailed catalog
+of the interfaces presented by iomap.
+If you change iomap, please update this design document.
+
+File Range Iterator
+===================
+
+Definitions
+-----------
+
+ * **buffer head**: Shattered remnants of the old buffer cache.
+
+ * ``fsblock``: The block size of a file, also known as ``i_blocksize``.
+
+ * ``i_rwsem``: The VFS ``struct inode`` rwsemaphore.
+ Processes hold this in shared mode to read file state and contents.
+ Some filesystems may allow shared mode for writes.
+ Processes often hold this in exclusive mode to change file state and
+ contents.
+
+ * ``invalidate_lock``: The pagecache ``struct address_space``
+ rwsemaphore that protects against folio insertion and removal for
+ filesystems that support punching out folios below EOF.
+ Processes wishing to insert folios must hold this lock in shared
+ mode to prevent removal, though concurrent insertion is allowed.
+ Processes wishing to remove folios must hold this lock in exclusive
+ mode to prevent insertions.
+ Concurrent removals are not allowed.
+
+ * ``dax_read_lock``: The RCU read lock that dax takes to prevent a
+ device pre-shutdown hook from returning before other threads have
+ released resources.
+
+ * **filesystem mapping lock**: This synchronization primitive is
+ internal to the filesystem and must protect the file mapping data
+ from updates while a mapping is being sampled.
+ The filesystem author must determine how this coordination should
+ happen; it does not need to be an actual lock.
+
+ * **iomap internal operation lock**: This is a general term for
+ synchronization primitives that iomap functions take while holding a
+ mapping.
+ A specific example would be taking the folio lock while reading or
+ writing the pagecache.
+
+ * **pure overwrite**: A write operation that does not require any
+ metadata or zeroing operations to perform during either submission
+ or completion.
+ This implies that the filesystem must have already allocated space
+ on disk as ``IOMAP_MAPPED`` and the filesystem must not place any
+ constraints on IO alignment or size.
+ The only constraints on I/O alignment are device level (minimum I/O
+ size and alignment, typically sector size).
+
+``struct iomap``
+----------------
+
+The filesystem communicates to the iomap iterator the mapping of
+byte ranges of a file to byte ranges of a storage device with the
+structure below:
+
+.. code-block:: c
+
+ struct iomap {
+ u64 addr;
+ loff_t offset;
+ u64 length;
+ u16 type;
+ u16 flags;
+ struct block_device *bdev;
+ struct dax_device *dax_dev;
+ void *inline_data;
+ void *private;
+ u64 validity_cookie;
+ };
+
+The fields are as follows:
+
+ * ``offset`` and ``length`` describe the range of file offsets, in
+ bytes, covered by this mapping.
+ These fields must always be set by the filesystem.
+
+ * ``type`` describes the type of the space mapping:
+
+ * **IOMAP_HOLE**: No storage has been allocated.
+ This type must never be returned in response to an ``IOMAP_WRITE``
+ operation because writes must allocate and map space, and return
+ the mapping.
+ The ``addr`` field must be set to ``IOMAP_NULL_ADDR``.
+ iomap does not support writing (whether via pagecache or direct
+ I/O) to a hole.
+
+ * **IOMAP_DELALLOC**: A promise to allocate space at a later time
+ ("delayed allocation").
+ If the filesystem returns IOMAP_F_NEW here and the write fails, the
+ ``->iomap_end`` function must delete the reservation.
+ The ``addr`` field must be set to ``IOMAP_NULL_ADDR``.
+
+ * **IOMAP_MAPPED**: The file range maps to specific space on the
+ storage device.
+ The device is returned in ``bdev`` or ``dax_dev``.
+ The device address, in bytes, is returned via ``addr``.
+
+ * **IOMAP_UNWRITTEN**: The file range maps to specific space on the
+ storage device, but the space has not yet been initialized.
+ The device is returned in ``bdev`` or ``dax_dev``.
+ The device address, in bytes, is returned via ``addr``.
+ Reads from this type of mapping will return zeroes to the caller.
+ For a write or writeback operation, the ioend should update the
+ mapping to MAPPED.
+ Refer to the sections about ioends for more details.
+
+ * **IOMAP_INLINE**: The file range maps to the memory buffer
+ specified by ``inline_data``.
+ For write operation, the ``->iomap_end`` function presumably
+ handles persisting the data.
+ The ``addr`` field must be set to ``IOMAP_NULL_ADDR``.
+
+ * ``flags`` describe the status of the space mapping.
+ These flags should be set by the filesystem in ``->iomap_begin``:
+
+ * **IOMAP_F_NEW**: The space under the mapping is newly allocated.
+ Areas that will not be written to must be zeroed.
+ If a write fails and the mapping is a space reservation, the
+ reservation must be deleted.
+
+ * **IOMAP_F_DIRTY**: The inode will have uncommitted metadata needed
+ to access any data written.
+ fdatasync is required to commit these changes to persistent
+ storage.
+ This needs to take into account metadata changes that *may* be made
+ at I/O completion, such as file size updates from direct I/O.
+
+ * **IOMAP_F_SHARED**: The space under the mapping is shared.
+ Copy on write is necessary to avoid corrupting other file data.
+
+ * **IOMAP_F_BUFFER_HEAD**: This mapping requires the use of buffer
+ heads for pagecache operations.
+ Do not add more uses of this.
+
+ * **IOMAP_F_MERGED**: Multiple contiguous block mappings were
+ coalesced into this single mapping.
+ This is only useful for FIEMAP.
+
+ * **IOMAP_F_XATTR**: The mapping is for extended attribute data, not
+ regular file data.
+ This is only useful for FIEMAP.
+
+ * **IOMAP_F_BOUNDARY**: This indicates I/O and its completion must not be
+ merged with any other I/O or completion. Filesystems must use this when
+ submitting I/O to devices that cannot handle I/O crossing certain LBAs
+ (e.g. ZNS devices). This flag applies only to buffered I/O writeback; all
+ other functions ignore it.
+
+ * **IOMAP_F_PRIVATE**: This flag is reserved for filesystem private use.
+
+ * **IOMAP_F_ANON_WRITE**: Indicates that (write) I/O does not have a target
+ block assigned to it yet and the file system will do that in the bio
+ submission handler, splitting the I/O as needed.
+
+ * **IOMAP_F_ATOMIC_BIO**: This indicates write I/O must be submitted with the
+ ``REQ_ATOMIC`` flag set in the bio. Filesystems need to set this flag to
+ inform iomap that the write I/O operation requires torn-write protection
+ based on HW-offload mechanism. They must also ensure that mapping updates
+ upon the completion of the I/O must be performed in a single metadata
+ update.
+
+ These flags can be set by iomap itself during file operations.
+ The filesystem should supply an ``->iomap_end`` function if it needs
+ to observe these flags:
+
+ * **IOMAP_F_SIZE_CHANGED**: The file size has changed as a result of
+ using this mapping.
+
+ * **IOMAP_F_STALE**: The mapping was found to be stale.
+ iomap will call ``->iomap_end`` on this mapping and then
+ ``->iomap_begin`` to obtain a new mapping.
+
+ Currently, these flags are only set by pagecache operations.
+
+ * ``addr`` describes the device address, in bytes.
+
+ * ``bdev`` describes the block device for this mapping.
+ This only needs to be set for mapped or unwritten operations.
+
+ * ``dax_dev`` describes the DAX device for this mapping.
+ This only needs to be set for mapped or unwritten operations, and
+ only for a fsdax operation.
+
+ * ``inline_data`` points to a memory buffer for I/O involving
+ ``IOMAP_INLINE`` mappings.
+ This value is ignored for all other mapping types.
+
+ * ``private`` is a pointer to `filesystem-private information
+ <https://lore.kernel.org/all/20180619164137.13720-7-hch@lst.de/>`_.
+ This value will be passed unchanged to ``->iomap_end``.
+
+ * ``validity_cookie`` is a magic freshness value set by the filesystem
+ that should be used to detect stale mappings.
+ For pagecache operations this is critical for correct operation
+ because page faults can occur, which implies that filesystem locks
+ should not be held between ``->iomap_begin`` and ``->iomap_end``.
+ Filesystems with completely static mappings need not set this value.
+ Only pagecache operations revalidate mappings; see the section about
+ ``iomap_valid`` for details.
+
+``struct iomap_ops``
+--------------------
+
+Every iomap function requires the filesystem to pass an operations
+structure to obtain a mapping and (optionally) to release the mapping:
+
+.. code-block:: c
+
+ struct iomap_ops {
+ int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length,
+ unsigned flags, struct iomap *iomap,
+ struct iomap *srcmap);
+
+ int (*iomap_end)(struct inode *inode, loff_t pos, loff_t length,
+ ssize_t written, unsigned flags,
+ struct iomap *iomap);
+ };
+
+``->iomap_begin``
+~~~~~~~~~~~~~~~~~
+
+iomap operations call ``->iomap_begin`` to obtain one file mapping for
+the range of bytes specified by ``pos`` and ``length`` for the file
+``inode``.
+This mapping should be returned through the ``iomap`` pointer.
+The mapping must cover at least the first byte of the supplied file
+range, but it does not need to cover the entire requested range.
+
+Each iomap operation describes the requested operation through the
+``flags`` argument.
+The exact value of ``flags`` will be documented in the
+operation-specific sections below.
+These flags can, at least in principle, apply generally to iomap
+operations:
+
+ * ``IOMAP_DIRECT`` is set when the caller wishes to issue file I/O to
+ block storage.
+
+ * ``IOMAP_DAX`` is set when the caller wishes to issue file I/O to
+ memory-like storage.
+
+ * ``IOMAP_NOWAIT`` is set when the caller wishes to perform a best
+ effort attempt to avoid any operation that would result in blocking
+ the submitting task.
+ This is similar in intent to ``O_NONBLOCK`` for network APIs - it is
+ intended for asynchronous applications to keep doing other work
+ instead of waiting for the specific unavailable filesystem resource
+ to become available.
+ Filesystems implementing ``IOMAP_NOWAIT`` semantics need to use
+ trylock algorithms.
+ They need to be able to satisfy the entire I/O request range with a
+ single iomap mapping.
+ They need to avoid reading or writing metadata synchronously.
+ They need to avoid blocking memory allocations.
+ They need to avoid waiting on transaction reservations to allow
+ modifications to take place.
+ They probably should not be allocating new space.
+ And so on.
+ If there is any doubt in the filesystem developer's mind as to
+ whether any specific ``IOMAP_NOWAIT`` operation may end up blocking,
+ then they should return ``-EAGAIN`` as early as possible rather than
+ start the operation and force the submitting task to block.
+ ``IOMAP_NOWAIT`` is often set on behalf of ``IOCB_NOWAIT`` or
+ ``RWF_NOWAIT``.
+
+ * ``IOMAP_DONTCACHE`` is set when the caller wishes to perform a
+ buffered file I/O and would like the kernel to drop the pagecache
+ after the I/O completes, if it isn't already being used by another
+ thread.
+
+If it is necessary to read existing file contents from a `different
+<https://lore.kernel.org/all/20191008071527.29304-9-hch@lst.de/>`_
+device or address range on a device, the filesystem should return that
+information via ``srcmap``.
+Only pagecache and fsdax operations support reading from one mapping and
+writing to another.
+
+``->iomap_end``
+~~~~~~~~~~~~~~~
+
+After the operation completes, the ``->iomap_end`` function, if present,
+is called to signal that iomap is finished with a mapping.
+Typically, implementations will use this function to tear down any
+context that were set up in ``->iomap_begin``.
+For example, a write might wish to commit the reservations for the bytes
+that were operated upon and unreserve any space that was not operated
+upon.
+``written`` might be zero if no bytes were touched.
+``flags`` will contain the same value passed to ``->iomap_begin``.
+iomap ops for reads are not likely to need to supply this function.
+
+Both functions should return a negative errno code on error, or zero on
+success.
+
+Preparing for File Operations
+=============================
+
+iomap only handles mapping and I/O.
+Filesystems must still call out to the VFS to check input parameters
+and file state before initiating an I/O operation.
+It does not handle obtaining filesystem freeze protection, updating of
+timestamps, stripping privileges, or access control.
+
+Locking Hierarchy
+=================
+
+iomap requires that filesystems supply their own locking model.
+There are three categories of synchronization primitives, as far as
+iomap is concerned:
+
+ * The **upper** level primitive is provided by the filesystem to
+ coordinate access to different iomap operations.
+ The exact primitive is specific to the filesystem and operation,
+ but is often a VFS inode, pagecache invalidation, or folio lock.
+ For example, a filesystem might take ``i_rwsem`` before calling
+ ``iomap_file_buffered_write`` and ``iomap_file_unshare`` to prevent
+ these two file operations from clobbering each other.
+ Pagecache writeback may lock a folio to prevent other threads from
+ accessing the folio until writeback is underway.
+
+ * The **lower** level primitive is taken by the filesystem in the
+ ``->iomap_begin`` and ``->iomap_end`` functions to coordinate
+ access to the file space mapping information.
+ The fields of the iomap object should be filled out while holding
+ this primitive.
+ The upper level synchronization primitive, if any, remains held
+ while acquiring the lower level synchronization primitive.
+ For example, XFS takes ``ILOCK_EXCL`` and ext4 takes ``i_data_sem``
+ while sampling mappings.
+ Filesystems with immutable mapping information may not require
+ synchronization here.
+
+ * The **operation** primitive is taken by an iomap operation to
+ coordinate access to its own internal data structures.
+ The upper level synchronization primitive, if any, remains held
+ while acquiring this primitive.
+ The lower level primitive is not held while acquiring this
+ primitive.
+ For example, pagecache write operations will obtain a file mapping,
+ then grab and lock a folio to copy new contents.
+ It may also lock an internal folio state object to update metadata.
+
+The exact locking requirements are specific to the filesystem; for
+certain operations, some of these locks can be elided.
+All further mentions of locking are *recommendations*, not mandates.
+Each filesystem author must figure out the locking for themself.
+
+Bugs and Limitations
+====================
+
+ * No support for fscrypt.
+ * No support for compression.
+ * No support for fsverity yet.
+ * Strong assumptions that IO should work the way it does on XFS.
+ * Does iomap *actually* work for non-regular file data?
+
+Patches welcome!
diff --git a/Documentation/filesystems/iomap/index.rst b/Documentation/filesystems/iomap/index.rst
new file mode 100644
index 000000000000..3c6a52440250
--- /dev/null
+++ b/Documentation/filesystems/iomap/index.rst
@@ -0,0 +1,13 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================
+VFS iomap Documentation
+=======================
+
+.. toctree::
+ :maxdepth: 2
+ :numbered:
+
+ design
+ operations
+ porting
diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst
new file mode 100644
index 000000000000..da982ca7e413
--- /dev/null
+++ b/Documentation/filesystems/iomap/operations.rst
@@ -0,0 +1,785 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _iomap_operations:
+
+..
+ Dumb style notes to maintain the author's sanity:
+ Please try to start sentences on separate lines so that
+ sentence changes don't bleed colors in diff.
+ Heading decorations are documented in sphinx.rst.
+
+=========================
+Supported File Operations
+=========================
+
+.. contents:: Table of Contents
+ :local:
+
+Below are a discussion of the high level file operations that iomap
+implements.
+
+Buffered I/O
+============
+
+Buffered I/O is the default file I/O path in Linux.
+File contents are cached in memory ("pagecache") to satisfy reads and
+writes.
+Dirty cache will be written back to disk at some point that can be
+forced via ``fsync`` and variants.
+
+iomap implements nearly all the folio and pagecache management that
+filesystems have to implement themselves under the legacy I/O model.
+This means that the filesystem need not know the details of allocating,
+mapping, managing uptodate and dirty state, or writeback of pagecache
+folios.
+Under the legacy I/O model, this was managed very inefficiently with
+linked lists of buffer heads instead of the per-folio bitmaps that iomap
+uses.
+Unless the filesystem explicitly opts in to buffer heads, they will not
+be used, which makes buffered I/O much more efficient, and the pagecache
+maintainer much happier.
+
+``struct address_space_operations``
+-----------------------------------
+
+The following iomap functions can be referenced directly from the
+address space operations structure:
+
+ * ``iomap_dirty_folio``
+ * ``iomap_release_folio``
+ * ``iomap_invalidate_folio``
+ * ``iomap_is_partially_uptodate``
+
+The following address space operations can be wrapped easily:
+
+ * ``read_folio``
+ * ``readahead``
+ * ``writepages``
+ * ``bmap``
+ * ``swap_activate``
+
+``struct iomap_write_ops``
+--------------------------
+
+.. code-block:: c
+
+ struct iomap_write_ops {
+ struct folio *(*get_folio)(struct iomap_iter *iter, loff_t pos,
+ unsigned len);
+ void (*put_folio)(struct inode *inode, loff_t pos, unsigned copied,
+ struct folio *folio);
+ bool (*iomap_valid)(struct inode *inode, const struct iomap *iomap);
+ int (*read_folio_range)(const struct iomap_iter *iter,
+ struct folio *folio, loff_t pos, size_t len);
+ };
+
+iomap calls these functions:
+
+ - ``get_folio``: Called to allocate and return an active reference to
+ a locked folio prior to starting a write.
+ If this function is not provided, iomap will call
+ ``iomap_get_folio``.
+ This could be used to `set up per-folio filesystem state
+ <https://lore.kernel.org/all/20190429220934.10415-5-agruenba@redhat.com/>`_
+ for a write.
+
+ - ``put_folio``: Called to unlock and put a folio after a pagecache
+ operation completes.
+ If this function is not provided, iomap will ``folio_unlock`` and
+ ``folio_put`` on its own.
+ This could be used to `commit per-folio filesystem state
+ <https://lore.kernel.org/all/20180619164137.13720-6-hch@lst.de/>`_
+ that was set up by ``->get_folio``.
+
+ - ``iomap_valid``: The filesystem may not hold locks between
+ ``->iomap_begin`` and ``->iomap_end`` because pagecache operations
+ can take folio locks, fault on userspace pages, initiate writeback
+ for memory reclamation, or engage in other time-consuming actions.
+ If a file's space mapping data are mutable, it is possible that the
+ mapping for a particular pagecache folio can `change in the time it
+ takes
+ <https://lore.kernel.org/all/20221123055812.747923-8-david@fromorbit.com/>`_
+ to allocate, install, and lock that folio.
+
+ For the pagecache, races can happen if writeback doesn't take
+ ``i_rwsem`` or ``invalidate_lock`` and updates mapping information.
+ Races can also happen if the filesystem allows concurrent writes.
+ For such files, the mapping *must* be revalidated after the folio
+ lock has been taken so that iomap can manage the folio correctly.
+
+ fsdax does not need this revalidation because there's no writeback
+ and no support for unwritten extents.
+
+ Filesystems subject to this kind of race must provide a
+ ``->iomap_valid`` function to decide if the mapping is still valid.
+ If the mapping is not valid, the mapping will be sampled again.
+
+ To support making the validity decision, the filesystem's
+ ``->iomap_begin`` function may set ``struct iomap::validity_cookie``
+ at the same time that it populates the other iomap fields.
+ A simple validation cookie implementation is a sequence counter.
+ If the filesystem bumps the sequence counter every time it modifies
+ the inode's extent map, it can be placed in the ``struct
+ iomap::validity_cookie`` during ``->iomap_begin``.
+ If the value in the cookie is found to be different to the value
+ the filesystem holds when the mapping is passed back to
+ ``->iomap_valid``, then the iomap should considered stale and the
+ validation failed.
+
+ - ``read_folio_range``: Called to synchronously read in the range that will
+ be written to. If this function is not provided, iomap will default to
+ submitting a bio read request.
+
+These ``struct kiocb`` flags are significant for buffered I/O with iomap:
+
+ * ``IOCB_NOWAIT``: Turns on ``IOMAP_NOWAIT``.
+
+ * ``IOCB_DONTCACHE``: Turns on ``IOMAP_DONTCACHE``.
+
+``struct iomap_read_ops``
+--------------------------
+
+.. code-block:: c
+
+ struct iomap_read_ops {
+ int (*read_folio_range)(const struct iomap_iter *iter,
+ struct iomap_read_folio_ctx *ctx, size_t len);
+ void (*submit_read)(struct iomap_read_folio_ctx *ctx);
+ };
+
+iomap calls these functions:
+
+ - ``read_folio_range``: Called to read in the range. This must be provided
+ by the caller. If this succeeds, iomap_finish_folio_read() must be called
+ after the range is read in, regardless of whether the read succeeded or
+ failed.
+
+ - ``submit_read``: Submit any pending read requests. This function is
+ optional.
+
+Internal per-Folio State
+------------------------
+
+If the fsblock size matches the size of a pagecache folio, it is assumed
+that all disk I/O operations will operate on the entire folio.
+The uptodate (memory contents are at least as new as what's on disk) and
+dirty (memory contents are newer than what's on disk) status of the
+folio are all that's needed for this case.
+
+If the fsblock size is less than the size of a pagecache folio, iomap
+tracks the per-fsblock uptodate and dirty state itself.
+This enables iomap to handle both "bs < ps" `filesystems
+<https://lore.kernel.org/all/20230725122932.144426-1-ritesh.list@gmail.com/>`_
+and large folios in the pagecache.
+
+iomap internally tracks two state bits per fsblock:
+
+ * ``uptodate``: iomap will try to keep folios fully up to date.
+ If there are read(ahead) errors, those fsblocks will not be marked
+ uptodate.
+ The folio itself will be marked uptodate when all fsblocks within the
+ folio are uptodate.
+
+ * ``dirty``: iomap will set the per-block dirty state when programs
+ write to the file.
+ The folio itself will be marked dirty when any fsblock within the
+ folio is dirty.
+
+iomap also tracks the amount of read and write disk IOs that are in
+flight.
+This structure is much lighter weight than ``struct buffer_head``
+because there is only one per folio, and the per-fsblock overhead is two
+bits vs. 104 bytes.
+
+Filesystems wishing to turn on large folios in the pagecache should call
+``mapping_set_large_folios`` when initializing the incore inode.
+
+Buffered Readahead and Reads
+----------------------------
+
+The ``iomap_readahead`` function initiates readahead to the pagecache.
+The ``iomap_read_folio`` function reads one folio's worth of data into
+the pagecache.
+The ``flags`` argument to ``->iomap_begin`` will be set to zero.
+The pagecache takes whatever locks it needs before calling the
+filesystem.
+
+Both ``iomap_readahead`` and ``iomap_read_folio`` pass in a ``struct
+iomap_read_folio_ctx``:
+
+.. code-block:: c
+
+ struct iomap_read_folio_ctx {
+ const struct iomap_read_ops *ops;
+ struct folio *cur_folio;
+ struct readahead_control *rac;
+ void *read_ctx;
+ };
+
+``iomap_readahead`` must set:
+ * ``ops->read_folio_range()`` and ``rac``
+
+``iomap_read_folio`` must set:
+ * ``ops->read_folio_range()`` and ``cur_folio``
+
+``ops->submit_read()`` and ``read_ctx`` are optional. ``read_ctx`` is used to
+pass in any custom data the caller needs accessible in the ops callbacks for
+fulfilling reads.
+
+Buffered Writes
+---------------
+
+The ``iomap_file_buffered_write`` function writes an ``iocb`` to the
+pagecache.
+``IOMAP_WRITE`` or ``IOMAP_WRITE`` | ``IOMAP_NOWAIT`` will be passed as
+the ``flags`` argument to ``->iomap_begin``.
+Callers commonly take ``i_rwsem`` in either shared or exclusive mode
+before calling this function.
+
+mmap Write Faults
+~~~~~~~~~~~~~~~~~
+
+The ``iomap_page_mkwrite`` function handles a write fault to a folio in
+the pagecache.
+``IOMAP_WRITE | IOMAP_FAULT`` will be passed as the ``flags`` argument
+to ``->iomap_begin``.
+Callers commonly take the mmap ``invalidate_lock`` in shared or
+exclusive mode before calling this function.
+
+Buffered Write Failures
+~~~~~~~~~~~~~~~~~~~~~~~
+
+After a short write to the pagecache, the areas not written will not
+become marked dirty.
+The filesystem must arrange to `cancel
+<https://lore.kernel.org/all/20221123055812.747923-6-david@fromorbit.com/>`_
+such `reservations
+<https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/>`_
+because writeback will not consume the reservation.
+The ``iomap_write_delalloc_release`` can be called from a
+``->iomap_end`` function to find all the clean areas of the folios
+caching a fresh (``IOMAP_F_NEW``) delalloc mapping.
+It takes the ``invalidate_lock``.
+
+The filesystem must supply a function ``punch`` to be called for
+each file range in this state.
+This function must *only* remove delayed allocation reservations, in
+case another thread racing with the current thread writes successfully
+to the same region and triggers writeback to flush the dirty data out to
+disk.
+
+Zeroing for File Operations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Filesystems can call ``iomap_zero_range`` to perform zeroing of the
+pagecache for non-truncation file operations that are not aligned to
+the fsblock size.
+``IOMAP_ZERO`` will be passed as the ``flags`` argument to
+``->iomap_begin``.
+Callers typically hold ``i_rwsem`` and ``invalidate_lock`` in exclusive
+mode before calling this function.
+
+Unsharing Reflinked File Data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Filesystems can call ``iomap_file_unshare`` to force a file sharing
+storage with another file to preemptively copy the shared data to newly
+allocate storage.
+``IOMAP_WRITE | IOMAP_UNSHARE`` will be passed as the ``flags`` argument
+to ``->iomap_begin``.
+Callers typically hold ``i_rwsem`` and ``invalidate_lock`` in exclusive
+mode before calling this function.
+
+Truncation
+----------
+
+Filesystems can call ``iomap_truncate_page`` to zero the bytes in the
+pagecache from EOF to the end of the fsblock during a file truncation
+operation.
+``truncate_setsize`` or ``truncate_pagecache`` will take care of
+everything after the EOF block.
+``IOMAP_ZERO`` will be passed as the ``flags`` argument to
+``->iomap_begin``.
+Callers typically hold ``i_rwsem`` and ``invalidate_lock`` in exclusive
+mode before calling this function.
+
+Pagecache Writeback
+-------------------
+
+Filesystems can call ``iomap_writepages`` to respond to a request to
+write dirty pagecache folios to disk.
+The ``mapping`` and ``wbc`` parameters should be passed unchanged.
+The ``wpc`` pointer should be allocated by the filesystem and must
+be initialized to zero.
+
+The pagecache will lock each folio before trying to schedule it for
+writeback.
+It does not lock ``i_rwsem`` or ``invalidate_lock``.
+
+The dirty bit will be cleared for all folios run through the
+``->writeback_range`` machinery described below even if the writeback fails.
+This is to prevent dirty folio clots when storage devices fail; an
+``-EIO`` is recorded for userspace to collect via ``fsync``.
+
+The ``ops`` structure must be specified and is as follows:
+
+``struct iomap_writeback_ops``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ struct iomap_writeback_ops {
+ int (*writeback_range)(struct iomap_writepage_ctx *wpc,
+ struct folio *folio, u64 pos, unsigned int len, u64 end_pos);
+ int (*writeback_submit)(struct iomap_writepage_ctx *wpc, int error);
+ };
+
+The fields are as follows:
+
+ - ``writeback_range``: Sets ``wpc->iomap`` to the space mapping of the file
+ range (in bytes) given by ``offset`` and ``len``.
+ iomap calls this function for each dirty fs block in each dirty folio,
+ though it will `reuse mappings
+ <https://lore.kernel.org/all/20231207072710.176093-15-hch@lst.de/>`_
+ for runs of contiguous dirty fsblocks within a folio.
+ Do not return ``IOMAP_INLINE`` mappings here; the ``->iomap_end``
+ function must deal with persisting written data.
+ Do not return ``IOMAP_DELALLOC`` mappings here; iomap currently
+ requires mapping to allocated space.
+ Filesystems can skip a potentially expensive mapping lookup if the
+ mappings have not changed.
+ This revalidation must be open-coded by the filesystem; it is
+ unclear if ``iomap::validity_cookie`` can be reused for this
+ purpose.
+
+ If this methods fails to schedule I/O for any part of a dirty folio, it
+ should throw away any reservations that may have been made for the write.
+ The folio will be marked clean and an ``-EIO`` recorded in the
+ pagecache.
+ Filesystems can use this callback to `remove
+ <https://lore.kernel.org/all/20201029163313.1766967-1-bfoster@redhat.com/>`_
+ delalloc reservations to avoid having delalloc reservations for
+ clean pagecache.
+ This function must be supplied by the filesystem.
+ If this succeeds, iomap_finish_folio_write() must be called once writeback
+ completes for the range, regardless of whether the writeback succeeded or
+ failed.
+
+ - ``writeback_submit``: Submit the previous built writeback context.
+ Block based file systems should use the iomap_ioend_writeback_submit
+ helper, other file system can implement their own.
+ File systems can optionally hook into writeback bio submission.
+ This might include pre-write space accounting updates, or installing
+ a custom ``->bi_end_io`` function for internal purposes, such as
+ deferring the ioend completion to a workqueue to run metadata update
+ transactions from process context before submitting the bio.
+ This function must be supplied by the filesystem.
+
+Pagecache Writeback Completion
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To handle the bookkeeping that must happen after disk I/O for writeback
+completes, iomap creates chains of ``struct iomap_ioend`` objects that
+wrap the ``bio`` that is used to write pagecache data to disk.
+By default, iomap finishes writeback ioends by clearing the writeback
+bit on the folios attached to the ``ioend``.
+If the write failed, it will also set the error bits on the folios and
+the address space.
+This can happen in interrupt or process context, depending on the
+storage device.
+Filesystems that need to update internal bookkeeping (e.g. unwritten
+extent conversions) should set their own bi_end_io on the bios
+submitted by ``->submit_writeback``
+This function should call ``iomap_finish_ioends`` after finishing its
+own work (e.g. unwritten extent conversion).
+
+Some filesystems may wish to `amortize the cost of running metadata
+transactions
+<https://lore.kernel.org/all/20220120034733.221737-1-david@fromorbit.com/>`_
+for post-writeback updates by batching them.
+They may also require transactions to run from process context, which
+implies punting batches to a workqueue.
+iomap ioends contain a ``list_head`` to enable batching.
+
+Given a batch of ioends, iomap has a few helpers to assist with
+amortization:
+
+ * ``iomap_sort_ioends``: Sort all the ioends in the list by file
+ offset.
+
+ * ``iomap_ioend_try_merge``: Given an ioend that is not in any list and
+ a separate list of sorted ioends, merge as many of the ioends from
+ the head of the list into the given ioend.
+ ioends can only be merged if the file range and storage addresses are
+ contiguous; the unwritten and shared status are the same; and the
+ write I/O outcome is the same.
+ The merged ioends become their own list.
+
+ * ``iomap_finish_ioends``: Finish an ioend that possibly has other
+ ioends linked to it.
+
+Direct I/O
+==========
+
+In Linux, direct I/O is defined as file I/O that is issued directly to
+storage, bypassing the pagecache.
+The ``iomap_dio_rw`` function implements O_DIRECT (direct I/O) reads and
+writes for files.
+
+.. code-block:: c
+
+ ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
+ const struct iomap_ops *ops,
+ const struct iomap_dio_ops *dops,
+ unsigned int dio_flags, void *private,
+ size_t done_before);
+
+The filesystem can provide the ``dops`` parameter if it needs to perform
+extra work before or after the I/O is issued to storage.
+The ``done_before`` parameter tells the how much of the request has
+already been transferred.
+It is used to continue a request asynchronously when `part of the
+request
+<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c03098d4b9ad76bca2966a8769dcfe59f7f85103>`_
+has already been completed synchronously.
+
+The ``done_before`` parameter should be set if writes for the ``iocb``
+have been initiated prior to the call.
+The direction of the I/O is determined from the ``iocb`` passed in.
+
+The ``dio_flags`` argument can be set to any combination of the
+following values:
+
+ * ``IOMAP_DIO_FORCE_WAIT``: Wait for the I/O to complete even if the
+ kiocb is not synchronous.
+
+ * ``IOMAP_DIO_OVERWRITE_ONLY``: Perform a pure overwrite for this range
+ or fail with ``-EAGAIN``.
+ This can be used by filesystems with complex unaligned I/O
+ write paths to provide an optimised fast path for unaligned writes.
+ If a pure overwrite can be performed, then serialisation against
+ other I/Os to the same filesystem block(s) is unnecessary as there is
+ no risk of stale data exposure or data loss.
+ If a pure overwrite cannot be performed, then the filesystem can
+ perform the serialisation steps needed to provide exclusive access
+ to the unaligned I/O range so that it can perform allocation and
+ sub-block zeroing safely.
+ Filesystems can use this flag to try to reduce locking contention,
+ but a lot of `detailed checking
+ <https://lore.kernel.org/linux-ext4/20230314130759.642710-1-bfoster@redhat.com/>`_
+ is required to do it `correctly
+ <https://lore.kernel.org/linux-ext4/20230810165559.946222-1-bfoster@redhat.com/>`_.
+
+ * ``IOMAP_DIO_PARTIAL``: If a page fault occurs, return whatever
+ progress has already been made.
+ The caller may deal with the page fault and retry the operation.
+ If the caller decides to retry the operation, it should pass the
+ accumulated return values of all previous calls as the
+ ``done_before`` parameter to the next call.
+
+These ``struct kiocb`` flags are significant for direct I/O with iomap:
+
+ * ``IOCB_NOWAIT``: Turns on ``IOMAP_NOWAIT``.
+
+ * ``IOCB_SYNC``: Ensure that the device has persisted data to disk
+ before completing the call.
+ In the case of pure overwrites, the I/O may be issued with FUA
+ enabled.
+
+ * ``IOCB_HIPRI``: Poll for I/O completion instead of waiting for an
+ interrupt.
+ Only meaningful for asynchronous I/O, and only if the entire I/O can
+ be issued as a single ``struct bio``.
+
+Filesystems should call ``iomap_dio_rw`` from ``->read_iter`` and
+``->write_iter``, and set ``FMODE_CAN_ODIRECT`` in the ``->open``
+function for the file.
+They should not set ``->direct_IO``, which is deprecated.
+
+If a filesystem wishes to perform its own work before direct I/O
+completion, it should call ``__iomap_dio_rw``.
+If its return value is not an error pointer or a NULL pointer, the
+filesystem should pass the return value to ``iomap_dio_complete`` after
+finishing its internal work.
+
+Return Values
+-------------
+
+``iomap_dio_rw`` can return one of the following:
+
+ * A non-negative number of bytes transferred.
+
+ * ``-ENOTBLK``: Fall back to buffered I/O.
+ iomap itself will return this value if it cannot invalidate the page
+ cache before issuing the I/O to storage.
+ The ``->iomap_begin`` or ``->iomap_end`` functions may also return
+ this value.
+
+ * ``-EIOCBQUEUED``: The asynchronous direct I/O request has been
+ queued and will be completed separately.
+
+ * Any of the other negative error codes.
+
+Direct Reads
+------------
+
+A direct I/O read initiates a read I/O from the storage device to the
+caller's buffer.
+Dirty parts of the pagecache are flushed to storage before initiating
+the read io.
+The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DIRECT`` with
+any combination of the following enhancements:
+
+ * ``IOMAP_NOWAIT``, as defined previously.
+
+Callers commonly hold ``i_rwsem`` in shared mode before calling this
+function.
+
+Direct Writes
+-------------
+
+A direct I/O write initiates a write I/O to the storage device from the
+caller's buffer.
+Dirty parts of the pagecache are flushed to storage before initiating
+the write io.
+The pagecache is invalidated both before and after the write io.
+The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DIRECT |
+IOMAP_WRITE`` with any combination of the following enhancements:
+
+ * ``IOMAP_NOWAIT``, as defined previously.
+
+ * ``IOMAP_OVERWRITE_ONLY``: Allocating blocks and zeroing partial
+ blocks is not allowed.
+ The entire file range must map to a single written or unwritten
+ extent.
+ The file I/O range must be aligned to the filesystem block size
+ if the mapping is unwritten and the filesystem cannot handle zeroing
+ the unaligned regions without exposing stale contents.
+
+ * ``IOMAP_ATOMIC``: This write is being issued with torn-write
+ protection.
+ Torn-write protection may be provided based on HW-offload or by a
+ software mechanism provided by the filesystem.
+
+ For HW-offload based support, only a single bio can be created for the
+ write, and the write must not be split into multiple I/O requests, i.e.
+ flag REQ_ATOMIC must be set.
+ The file range to write must be aligned to satisfy the requirements
+ of both the filesystem and the underlying block device's atomic
+ commit capabilities.
+ If filesystem metadata updates are required (e.g. unwritten extent
+ conversion or copy-on-write), all updates for the entire file range
+ must be committed atomically as well.
+ Untorn-writes may be longer than a single file block. In all cases,
+ the mapping start disk block must have at least the same alignment as
+ the write offset.
+ The filesystems must set IOMAP_F_ATOMIC_BIO to inform iomap core of an
+ untorn-write based on HW-offload.
+
+ For untorn-writes based on a software mechanism provided by the
+ filesystem, all the disk block alignment and single bio restrictions
+ which apply for HW-offload based untorn-writes do not apply.
+ The mechanism would typically be used as a fallback for when
+ HW-offload based untorn-writes may not be issued, e.g. the range of the
+ write covers multiple extents, meaning that it is not possible to issue
+ a single bio.
+ All filesystem metadata updates for the entire file range must be
+ committed atomically as well.
+
+Callers commonly hold ``i_rwsem`` in shared or exclusive mode before
+calling this function.
+
+``struct iomap_dio_ops:``
+-------------------------
+.. code-block:: c
+
+ struct iomap_dio_ops {
+ void (*submit_io)(const struct iomap_iter *iter, struct bio *bio,
+ loff_t file_offset);
+ int (*end_io)(struct kiocb *iocb, ssize_t size, int error,
+ unsigned flags);
+ struct bio_set *bio_set;
+ };
+
+The fields of this structure are as follows:
+
+ - ``submit_io``: iomap calls this function when it has constructed a
+ ``struct bio`` object for the I/O requested, and wishes to submit it
+ to the block device.
+ If no function is provided, ``submit_bio`` will be called directly.
+ Filesystems that would like to perform additional work before (e.g.
+ data replication for btrfs) should implement this function.
+
+ - ``end_io``: This is called after the ``struct bio`` completes.
+ This function should perform post-write conversions of unwritten
+ extent mappings, handle write failures, etc.
+ The ``flags`` argument may be set to a combination of the following:
+
+ * ``IOMAP_DIO_UNWRITTEN``: The mapping was unwritten, so the ioend
+ should mark the extent as written.
+
+ * ``IOMAP_DIO_COW``: Writing to the space in the mapping required a
+ copy on write operation, so the ioend should switch mappings.
+
+ - ``bio_set``: This allows the filesystem to provide a custom bio_set
+ for allocating direct I/O bios.
+ This enables filesystems to `stash additional per-bio information
+ <https://lore.kernel.org/all/20220505201115.937837-3-hch@lst.de/>`_
+ for private use.
+ If this field is NULL, generic ``struct bio`` objects will be used.
+
+Filesystems that want to perform extra work after an I/O completion
+should set a custom ``->bi_end_io`` function via ``->submit_io``.
+Afterwards, the custom endio function must call
+``iomap_dio_bio_end_io`` to finish the direct I/O.
+
+DAX I/O
+=======
+
+Some storage devices can be directly mapped as memory.
+These devices support a new access mode known as "fsdax" that allows
+loads and stores through the CPU and memory controller.
+
+fsdax Reads
+-----------
+
+A fsdax read performs a memcpy from storage device to the caller's
+buffer.
+The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DAX`` with any
+combination of the following enhancements:
+
+ * ``IOMAP_NOWAIT``, as defined previously.
+
+Callers commonly hold ``i_rwsem`` in shared mode before calling this
+function.
+
+fsdax Writes
+------------
+
+A fsdax write initiates a memcpy to the storage device from the caller's
+buffer.
+The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DAX |
+IOMAP_WRITE`` with any combination of the following enhancements:
+
+ * ``IOMAP_NOWAIT``, as defined previously.
+
+ * ``IOMAP_OVERWRITE_ONLY``: The caller requires a pure overwrite to be
+ performed from this mapping.
+ This requires the filesystem extent mapping to already exist as an
+ ``IOMAP_MAPPED`` type and span the entire range of the write I/O
+ request.
+ If the filesystem cannot map this request in a way that allows the
+ iomap infrastructure to perform a pure overwrite, it must fail the
+ mapping operation with ``-EAGAIN``.
+
+Callers commonly hold ``i_rwsem`` in exclusive mode before calling this
+function.
+
+fsdax mmap Faults
+~~~~~~~~~~~~~~~~~
+
+The ``dax_iomap_fault`` function handles read and write faults to fsdax
+storage.
+For a read fault, ``IOMAP_DAX | IOMAP_FAULT`` will be passed as the
+``flags`` argument to ``->iomap_begin``.
+For a write fault, ``IOMAP_DAX | IOMAP_FAULT | IOMAP_WRITE`` will be
+passed as the ``flags`` argument to ``->iomap_begin``.
+
+Callers commonly hold the same locks as they do to call their iomap
+pagecache counterparts.
+
+fsdax Truncation, fallocate, and Unsharing
+------------------------------------------
+
+For fsdax files, the following functions are provided to replace their
+iomap pagecache I/O counterparts.
+The ``flags`` argument to ``->iomap_begin`` are the same as the
+pagecache counterparts, with ``IOMAP_DAX`` added.
+
+ * ``dax_file_unshare``
+ * ``dax_zero_range``
+ * ``dax_truncate_page``
+
+Callers commonly hold the same locks as they do to call their iomap
+pagecache counterparts.
+
+fsdax Deduplication
+-------------------
+
+Filesystems implementing the ``FIDEDUPERANGE`` ioctl must call the
+``dax_remap_file_range_prep`` function with their own iomap read ops.
+
+Seeking Files
+=============
+
+iomap implements the two iterating whence modes of the ``llseek`` system
+call.
+
+SEEK_DATA
+---------
+
+The ``iomap_seek_data`` function implements the SEEK_DATA "whence" value
+for llseek.
+``IOMAP_REPORT`` will be passed as the ``flags`` argument to
+``->iomap_begin``.
+
+For unwritten mappings, the pagecache will be searched.
+Regions of the pagecache with a folio mapped and uptodate fsblocks
+within those folios will be reported as data areas.
+
+Callers commonly hold ``i_rwsem`` in shared mode before calling this
+function.
+
+SEEK_HOLE
+---------
+
+The ``iomap_seek_hole`` function implements the SEEK_HOLE "whence" value
+for llseek.
+``IOMAP_REPORT`` will be passed as the ``flags`` argument to
+``->iomap_begin``.
+
+For unwritten mappings, the pagecache will be searched.
+Regions of the pagecache with no folio mapped, or a !uptodate fsblock
+within a folio will be reported as sparse hole areas.
+
+Callers commonly hold ``i_rwsem`` in shared mode before calling this
+function.
+
+Swap File Activation
+====================
+
+The ``iomap_swapfile_activate`` function finds all the base-page aligned
+regions in a file and sets them up as swap space.
+The file will be ``fsync()``'d before activation.
+``IOMAP_REPORT`` will be passed as the ``flags`` argument to
+``->iomap_begin``.
+All mappings must be mapped or unwritten; cannot be dirty or shared, and
+cannot span multiple block devices.
+Callers must hold ``i_rwsem`` in exclusive mode; this is already
+provided by ``swapon``.
+
+File Space Mapping Reporting
+============================
+
+iomap implements two of the file space mapping system calls.
+
+FS_IOC_FIEMAP
+-------------
+
+The ``iomap_fiemap`` function exports file extent mappings to userspace
+in the format specified by the ``FS_IOC_FIEMAP`` ioctl.
+``IOMAP_REPORT`` will be passed as the ``flags`` argument to
+``->iomap_begin``.
+Callers commonly hold ``i_rwsem`` in shared mode before calling this
+function.
+
+FIBMAP (deprecated)
+-------------------
+
+``iomap_bmap`` implements FIBMAP.
+The calling conventions are the same as for FIEMAP.
+This function is only provided to maintain compatibility for filesystems
+that implemented FIBMAP prior to conversion.
+This ioctl is deprecated; do **not** add a FIBMAP implementation to
+filesystems that do not have it.
+Callers should probably hold ``i_rwsem`` in shared mode before calling
+this function, but this is unclear.
diff --git a/Documentation/filesystems/iomap/porting.rst b/Documentation/filesystems/iomap/porting.rst
new file mode 100644
index 000000000000..3d49a32c0fff
--- /dev/null
+++ b/Documentation/filesystems/iomap/porting.rst
@@ -0,0 +1,120 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _iomap_porting:
+
+..
+ Dumb style notes to maintain the author's sanity:
+ Please try to start sentences on separate lines so that
+ sentence changes don't bleed colors in diff.
+ Heading decorations are documented in sphinx.rst.
+
+=======================
+Porting Your Filesystem
+=======================
+
+.. contents:: Table of Contents
+ :local:
+
+Why Convert?
+============
+
+There are several reasons to convert a filesystem to iomap:
+
+ 1. The classic Linux I/O path is not terribly efficient.
+ Pagecache operations lock a single base page at a time and then call
+ into the filesystem to return a mapping for only that page.
+ Direct I/O operations build I/O requests a single file block at a
+ time.
+ This worked well enough for direct/indirect-mapped filesystems such
+ as ext2, but is very inefficient for extent-based filesystems such
+ as XFS.
+
+ 2. Large folios are only supported via iomap; there are no plans to
+ convert the old buffer_head path to use them.
+
+ 3. Direct access to storage on memory-like devices (fsdax) is only
+ supported via iomap.
+
+ 4. Lower maintenance overhead for individual filesystem maintainers.
+ iomap handles common pagecache related operations itself, such as
+ allocating, instantiating, locking, and unlocking of folios.
+ No ->write_begin(), ->write_end() or direct_IO
+ address_space_operations are required to be implemented by
+ filesystem using iomap.
+
+How Do I Convert a Filesystem?
+==============================
+
+First, add ``#include <linux/iomap.h>`` from your source code and add
+``select FS_IOMAP`` to your filesystem's Kconfig option.
+Build the kernel, run fstests with the ``-g all`` option across a wide
+variety of your filesystem's supported configurations to build a
+baseline of which tests pass and which ones fail.
+
+The recommended approach is first to implement ``->iomap_begin`` (and
+``->iomap_end`` if necessary) to allow iomap to obtain a read-only
+mapping of a file range.
+In most cases, this is a relatively trivial conversion of the existing
+``get_block()`` function for read-only mappings.
+``FS_IOC_FIEMAP`` is a good first target because it is trivial to
+implement support for it and then to determine that the extent map
+iteration is correct from userspace.
+If FIEMAP is returning the correct information, it's a good sign that
+other read-only mapping operations will do the right thing.
+
+Next, modify the filesystem's ``get_block(create = false)``
+implementation to use the new ``->iomap_begin`` implementation to map
+file space for selected read operations.
+Hide behind a debugging knob the ability to switch on the iomap mapping
+functions for selected call paths.
+It is necessary to write some code to fill out the bufferhead-based
+mapping information from the ``iomap`` structure, but the new functions
+can be tested without needing to implement any iomap APIs.
+
+Once the read-only functions are working like this, convert each high
+level file operation one by one to use iomap native APIs instead of
+going through ``get_block()``.
+Done one at a time, regressions should be self evident.
+You *do* have a regression test baseline for fstests, right?
+It is suggested to convert swap file activation, ``SEEK_DATA``, and
+``SEEK_HOLE`` before tackling the I/O paths.
+A likely complexity at this point will be converting the buffered read
+I/O path because of bufferheads.
+The buffered read I/O paths doesn't need to be converted yet, though the
+direct I/O read path should be converted in this phase.
+
+At this point, you should look over your ``->iomap_begin`` function.
+If it switches between large blocks of code based on dispatching of the
+``flags`` argument, you should consider breaking it up into
+per-operation iomap ops with smaller, more cohesive functions.
+XFS is a good example of this.
+
+The next thing to do is implement ``get_blocks(create == true)``
+functionality in the ``->iomap_begin``/``->iomap_end`` methods.
+It is strongly recommended to create separate mapping functions and
+iomap ops for write operations.
+Then convert the direct I/O write path to iomap, and start running fsx
+w/ DIO enabled in earnest on filesystem.
+This will flush out lots of data integrity corner case bugs that the new
+write mapping implementation introduces.
+
+Now, convert any remaining file operations to call the iomap functions.
+This will get the entire filesystem using the new mapping functions, and
+they should largely be debugged and working correctly after this step.
+
+Most likely at this point, the buffered read and write paths will still
+need to be converted.
+The mapping functions should all work correctly, so all that needs to be
+done is rewriting all the code that interfaces with bufferheads to
+interface with iomap and folios.
+It is much easier first to get regular file I/O (without any fancy
+features like fscrypt, fsverity, compression, or data=journaling)
+converted to use iomap.
+Some of those fancy features (fscrypt and compression) aren't
+implemented yet in iomap.
+For unjournalled filesystems that use the pagecache for symbolic links
+and directories, you might also try converting their handling to iomap.
+
+The rest is left as an exercise for the reader, as it will be different
+for every filesystem.
+If you encounter problems, email the people and lists in
+``get_maintainers.pl`` for help.
diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
index e18f90ffc6fd..863e93e623f7 100644
--- a/Documentation/filesystems/journalling.rst
+++ b/Documentation/filesystems/journalling.rst
@@ -111,9 +111,7 @@ a callback function when the transaction is finally committed to disk,
so that you can do some of your own management. You ask the journalling
layer for calling the callback by simply setting
``journal->j_commit_callback`` function pointer and that function is
-called after each transaction commit. You can also use
-``transaction->t_private_list`` for attaching entries to a transaction
-that need processing when the transaction commits.
+called after each transaction commit.
JBD2 also provides a way to block all transaction updates via
jbd2_journal_lock_updates() /
@@ -137,7 +135,7 @@ Fast commits
JBD2 to also allows you to perform file-system specific delta commits known as
fast commits. In order to use fast commits, you will need to set following
-callbacks that perform correspodning work:
+callbacks that perform corresponding work:
`journal->j_fc_cleanup_cb`: Cleanup function called after every full commit and
fast commit.
@@ -149,7 +147,7 @@ File system is free to perform fast commits as and when it wants as long as it
gets permission from JBD2 to do so by calling the function
:c:func:`jbd2_fc_begin_commit()`. Once a fast commit is done, the client
file system should tell JBD2 about it by calling
-:c:func:`jbd2_fc_end_commit()`. If file system wants JBD2 to perform a full
+:c:func:`jbd2_fc_end_commit()`. If the file system wants JBD2 to perform a full
commit immediately after stopping the fast commit it can do so by calling
:c:func:`jbd2_fc_end_commit_fallback()`. This is useful if fast commit operation
fails for some reason and the only way to guarantee consistency is for JBD2 to
@@ -199,7 +197,7 @@ Journal Level
.. kernel-doc:: fs/jbd2/recovery.c
:internal:
-Transasction Level
+Transaction Level
~~~~~~~~~~~~~~~~~~
.. kernel-doc:: fs/jbd2/transaction.c
diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index ed148919e11a..77704fde9845 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -17,7 +17,8 @@ dentry_operations
prototypes::
- int (*d_revalidate)(struct dentry *, unsigned int);
+ int (*d_revalidate)(struct inode *, const struct qstr *,
+ struct dentry *, unsigned int);
int (*d_weak_revalidate)(struct dentry *, unsigned int);
int (*d_hash)(const struct dentry *, struct qstr *);
int (*d_compare)(const struct dentry *,
@@ -29,7 +30,9 @@ prototypes::
char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen);
struct vfsmount *(*d_automount)(struct path *path);
int (*d_manage)(const struct path *, bool);
- struct dentry *(*d_real)(struct dentry *, const struct inode *);
+ struct dentry *(*d_real)(struct dentry *, enum d_real_type type);
+ bool (*d_unalias_trylock)(const struct dentry *);
+ void (*d_unalias_unlock)(const struct dentry *);
locking rules:
@@ -49,6 +52,8 @@ d_dname: no no no no
d_automount: no no yes no
d_manage: no no yes (ref-walk) maybe
d_real no no yes no
+d_unalias_trylock yes no no no
+d_unalias_unlock yes no no no
================== =========== ======== ============== ========
inode_operations
@@ -61,7 +66,7 @@ prototypes::
int (*link) (struct dentry *,struct inode *,struct dentry *);
int (*unlink) (struct inode *,struct dentry *);
int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *,const char *);
- int (*mkdir) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t);
+ struct dentry *(*mkdir) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t);
int (*rmdir) (struct inode *,struct dentry *);
int (*mknod) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t);
int (*rename) (struct mnt_idmap *, struct inode *, struct dentry *,
@@ -82,16 +87,17 @@ prototypes::
int (*tmpfile) (struct mnt_idmap *, struct inode *,
struct file *, umode_t);
int (*fileattr_set)(struct mnt_idmap *idmap,
- struct dentry *dentry, struct fileattr *fa);
- int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa);
+ struct dentry *dentry, struct file_kattr *fa);
+ int (*fileattr_get)(struct dentry *dentry, struct file_kattr *fa);
struct posix_acl * (*get_acl)(struct mnt_idmap *, struct dentry *, int);
+ struct offset_ctx *(*get_offset_ctx)(struct inode *inode);
locking rules:
all may block
-============== =============================================
+============== ==================================================
ops i_rwsem(inode)
-============== =============================================
+============== ==================================================
lookup: shared
create: exclusive
link: exclusive (both)
@@ -100,7 +106,7 @@ symlink: exclusive
mkdir: exclusive
unlink: exclusive (both)
rmdir: exclusive (both)(see below)
-rename: exclusive (all) (see below)
+rename: exclusive (both parents, some children) (see below)
readlink: no
get_link: no
setattr: exclusive
@@ -115,12 +121,16 @@ atomic_open: shared (exclusive if O_CREAT is set in open flags)
tmpfile: no
fileattr_get: no or exclusive
fileattr_set: exclusive
-============== =============================================
+get_offset_ctx no
+============== ==================================================
Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_rwsem
exclusive on victim.
cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
+ ->unlink() and ->rename() have ->i_rwsem exclusive on all non-directories
+ involved.
+ ->rename() has ->i_rwsem exclusive on any subdirectory that changes parent.
See Documentation/filesystems/directory-locking.rst for more detailed discussion
of the locking scheme for directory operations.
@@ -239,17 +249,16 @@ address_space_operations
========================
prototypes::
- int (*writepage)(struct page *page, struct writeback_control *wbc);
int (*read_folio)(struct file *, struct folio *);
int (*writepages)(struct address_space *, struct writeback_control *);
bool (*dirty_folio)(struct address_space *, struct folio *folio);
void (*readahead)(struct readahead_control *);
- int (*write_begin)(struct file *, struct address_space *mapping,
+ int (*write_begin)(const struct kiocb *, struct address_space *mapping,
loff_t pos, unsigned len,
- struct page **pagep, void **fsdata);
- int (*write_end)(struct file *, struct address_space *mapping,
+ struct folio **foliop, void **fsdata);
+ int (*write_end)(const struct kiocb *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
- struct page *page, void *fsdata);
+ struct folio *folio, void *fsdata);
sector_t (*bmap)(struct address_space *, sector_t);
void (*invalidate_folio) (struct folio *, size_t start, size_t len);
bool (*release_folio)(struct folio *, gfp_t);
@@ -259,7 +268,7 @@ prototypes::
struct folio *src, enum migrate_mode);
int (*launder_folio)(struct folio *);
bool (*is_partially_uptodate)(struct folio *, size_t from, size_t count);
- int (*error_remove_page)(struct address_space *, struct page *);
+ int (*error_remove_folio)(struct address_space *, struct folio *);
int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
int (*swap_deactivate)(struct file *);
int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
@@ -270,12 +279,11 @@ locking rules:
====================== ======================== ========= ===============
ops folio locked i_rwsem invalidate_lock
====================== ======================== ========= ===============
-writepage: yes, unlocks (see below)
read_folio: yes, unlocks shared
writepages:
dirty_folio: maybe
readahead: yes, unlocks shared
-write_begin: locks the page exclusive
+write_begin: locks the folio exclusive
write_end: yes, unlocks exclusive
bmap:
invalidate_folio: yes exclusive
@@ -285,7 +293,7 @@ direct_IO:
migrate_folio: yes (both)
launder_folio: yes
is_partially_uptodate: yes
-error_remove_page: yes
+error_remove_folio: yes
swap_activate: no
swap_deactivate: no
swap_rw: yes, unlocks
@@ -299,54 +307,6 @@ completion.
->readahead() unlocks the folios that I/O is attempted on like ->read_folio().
-->writepage() is used for two purposes: for "memory cleansing" and for
-"sync". These are quite different operations and the behaviour may differ
-depending upon the mode.
-
-If writepage is called for sync (wbc->sync_mode != WBC_SYNC_NONE) then
-it *must* start I/O against the page, even if that would involve
-blocking on in-progress I/O.
-
-If writepage is called for memory cleansing (sync_mode ==
-WBC_SYNC_NONE) then its role is to get as much writeout underway as
-possible. So writepage should try to avoid blocking against
-currently-in-progress I/O.
-
-If the filesystem is not called for "sync" and it determines that it
-would need to block against in-progress I/O to be able to start new I/O
-against the page the filesystem should redirty the page with
-redirty_page_for_writepage(), then unlock the page and return zero.
-This may also be done to avoid internal deadlocks, but rarely.
-
-If the filesystem is called for sync then it must wait on any
-in-progress I/O and then start new I/O.
-
-The filesystem should unlock the page synchronously, before returning to the
-caller, unless ->writepage() returns special WRITEPAGE_ACTIVATE
-value. WRITEPAGE_ACTIVATE means that page cannot really be written out
-currently, and VM should stop calling ->writepage() on this page for some
-time. VM does this by moving page to the head of the active list, hence the
-name.
-
-Unless the filesystem is going to redirty_page_for_writepage(), unlock the page
-and return zero, writepage *must* run set_page_writeback() against the page,
-followed by unlocking it. Once set_page_writeback() has been run against the
-page, write I/O can be submitted and the write I/O completion handler must run
-end_page_writeback() once the I/O is complete. If no I/O is submitted, the
-filesystem must run end_page_writeback() against the page before returning from
-writepage.
-
-That is: after 2.5.12, pages which are under writeout are *not* locked. Note,
-if the filesystem needs the page to be locked during writeout, that is ok, too,
-the page is allowed to be unlocked at any point in time between the calls to
-set_page_writeback() and end_page_writeback().
-
-Note, failure to run either redirty_page_for_writepage() or the combination of
-set_page_writeback()/end_page_writeback() on a page submitted to writepage
-will leave the page itself marked clean but it will be tagged as dirty in the
-radix tree. This incoherency can lead to all sorts of hard-to-debug problems
-in the filesystem like having dirty inodes at umount and losing written data.
-
->writepages() is used for periodic writeback and for syscall-initiated
sync operations. The address_space should start I/O against at least
``*nr_to_write`` pages. ``*nr_to_write`` must be decremented for each page
@@ -354,8 +314,8 @@ which is written. The address_space implementation may write more (or less)
pages than ``*nr_to_write`` asks for, but it should try to be reasonably close.
If nr_to_write is NULL, all dirty pages must be written.
-writepages should _only_ write pages which are present on
-mapping->io_pages.
+writepages should _only_ write pages which are present in
+mapping->i_pages.
->dirty_folio() is called from various places in the kernel when
the target folio is marked as needing writeback. The folio cannot be
@@ -374,10 +334,17 @@ invalidate_lock before invalidating page cache in truncate / hole punch
path (and thus calling into ->invalidate_folio) to block races between page
cache invalidation and page cache filling functions (fault, read, ...).
-->release_folio() is called when the kernel is about to try to drop the
-buffers from the folio in preparation for freeing it. It returns false to
-indicate that the buffers are (or may be) freeable. If ->release_folio is
-NULL, the kernel assumes that the fs has no private interest in the buffers.
+->release_folio() is called when the MM wants to make a change to the
+folio that would invalidate the filesystem's private data. For example,
+it may be about to be removed from the address_space or split. The folio
+is locked and not under writeback. It may be dirty. The gfp parameter
+is not usually used for allocation, but rather to indicate what the
+filesystem may do to attempt to free the private data. The filesystem may
+return false to indicate that the folio's private data cannot be freed.
+If it returns true, it should have already removed the private data from
+the folio. If a filesystem does not provide a ->release_folio method,
+the pagecache will assume that private data is buffer_heads and call
+try_to_free_buffers().
->free_folio() is called when the kernel has dropped the folio
from the page cache.
@@ -476,7 +443,7 @@ prototypes::
int (*direct_access) (struct block_device *, sector_t, void **,
unsigned long *);
void (*unlock_native_capacity) (struct gendisk *);
- int (*getgeo)(struct block_device *, struct hd_geometry *);
+ int (*getgeo)(struct gendisk *, struct hd_geometry *);
void (*swap_slot_free_notify) (struct block_device *, unsigned long);
locking rules:
@@ -509,7 +476,6 @@ prototypes::
ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
int (*iopoll) (struct kiocb *kiocb, bool spin);
- int (*iterate) (struct file *, struct dir_context *);
int (*iterate_shared) (struct file *, struct dir_context *);
__poll_t (*poll) (struct file *, struct poll_table_struct *);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
@@ -551,9 +517,8 @@ mutex or just to use i_size_read() instead.
Note: this does not protect the file->f_pos against concurrent modifications
since this is something the userspace has to take care about.
-->iterate() is called with i_rwsem exclusive.
-
-->iterate_shared() is called with i_rwsem at least shared.
+->iterate_shared() is called with i_rwsem held for reading, and with the
+file f_pos_lock held exclusively
->fasync() is responsible for maintaining the FASYNC bit in filp->f_flags.
Most instances call fasync_helper(), which does that maintenance, so it's
@@ -628,26 +593,29 @@ vm_operations_struct
prototypes::
- void (*open)(struct vm_area_struct*);
- void (*close)(struct vm_area_struct*);
- vm_fault_t (*fault)(struct vm_area_struct*, struct vm_fault *);
+ void (*open)(struct vm_area_struct *);
+ void (*close)(struct vm_area_struct *);
+ vm_fault_t (*fault)(struct vm_fault *);
+ vm_fault_t (*huge_fault)(struct vm_fault *, unsigned int order);
+ vm_fault_t (*map_pages)(struct vm_fault *, pgoff_t start, pgoff_t end);
vm_fault_t (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *);
vm_fault_t (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *);
int (*access)(struct vm_area_struct *, unsigned long, void*, int, int);
locking rules:
-============= ========= ===========================
+============= ========== ===========================
ops mmap_lock PageLocked(page)
-============= ========= ===========================
-open: yes
-close: yes
-fault: yes can return with page locked
-map_pages: read
-page_mkwrite: yes can return with page locked
-pfn_mkwrite: yes
-access: yes
-============= ========= ===========================
+============= ========== ===========================
+open: write
+close: read/write
+fault: read can return with page locked
+huge_fault: maybe-read
+map_pages: maybe-read
+page_mkwrite: read can return with page locked
+pfn_mkwrite: read
+access: read
+============= ========== ===========================
->fault() is called when a previously not present pte is about to be faulted
in. The filesystem must find and return the page associated with the passed in
@@ -657,11 +625,18 @@ then ensure the page is not already truncated (invalidate_lock will block
subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
locked. The VM will unlock the page.
+->huge_fault() is called when there is no PUD or PMD entry present. This
+gives the filesystem the opportunity to install a PUD or PMD sized page.
+Filesystems can also use the ->fault method to return a PMD sized page,
+so implementing this function may not be necessary. In particular,
+filesystems should not call filemap_fault() from ->huge_fault().
+The mmap_lock may not be held when this method is called.
+
->map_pages() is called when VM asks to map easy accessible pages.
Filesystem should find and map pages associated with offsets from "start_pgoff"
till "end_pgoff". ->map_pages() is called with the RCU lock held and must
not block. If it's not possible to reach a page without blocking,
-filesystem should skip it. Filesystem should use do_set_pte() to setup
+filesystem should skip it. Filesystem should use set_pte_range() to setup
page table entry. Pointer to entry associated with the page is passed in
"pte" field in vm_fault structure. Pointers to entries for other offsets
should be calculated relative to "pte".
diff --git a/Documentation/filesystems/mount_api.rst b/Documentation/filesystems/mount_api.rst
index 9aaf6ef75eb5..c99ab1f7fea4 100644
--- a/Documentation/filesystems/mount_api.rst
+++ b/Documentation/filesystems/mount_api.rst
@@ -506,8 +506,16 @@ returned.
* ::
+ int vfs_parse_fs_qstr(struct fs_context *fc, const char *key,
+ const struct qstr *value);
+
+ A wrapper around vfs_parse_fs_param() that copies the value string it is
+ passed.
+
+ * ::
+
int vfs_parse_fs_string(struct fs_context *fc, const char *key,
- const char *value, size_t v_size);
+ const char *value);
A wrapper around vfs_parse_fs_param() that copies the value string it is
passed.
@@ -645,6 +653,8 @@ The members are as follows:
fs_param_is_blockdev Blockdev path * Needs lookup
fs_param_is_path Path * Needs lookup
fs_param_is_fd File descriptor result->int_32
+ fs_param_is_uid User ID (u32) result->uid
+ fs_param_is_gid Group ID (u32) result->gid
======================= ======================= =====================
Note that if the value is of fs_param_is_bool type, fs_parse() will try
@@ -669,7 +679,6 @@ The members are as follows:
fsparam_bool() fs_param_is_bool
fsparam_u32() fs_param_is_u32
fsparam_u32oct() fs_param_is_u32_octal
- fsparam_u32hex() fs_param_is_u32_hex
fsparam_s32() fs_param_is_s32
fsparam_u64() fs_param_is_u64
fsparam_enum() fs_param_is_enum
@@ -678,6 +687,8 @@ The members are as follows:
fsparam_bdev() fs_param_is_blockdev
fsparam_path() fs_param_is_path
fsparam_fd() fs_param_is_fd
+ fsparam_uid() fs_param_is_uid
+ fsparam_gid() fs_param_is_gid
======================= ===============================================
all of which take two arguments, name string and option number - for
@@ -751,22 +762,8 @@ process the parameters it is given.
* ::
- bool validate_constant_table(const struct constant_table *tbl,
- size_t tbl_size,
- int low, int high, int special);
-
- Validate a constant table. Checks that all the elements are appropriately
- ordered, that there are no duplicates and that the values are between low
- and high inclusive, though provision is made for one allowable special
- value outside of that range. If no special value is required, special
- should just be set to lie inside the low-to-high range.
-
- If all is good, true is returned. If the table is invalid, errors are
- logged to the kernel log buffer and false is returned.
-
- * ::
-
- bool fs_validate_description(const struct fs_parameter_description *desc);
+ bool fs_validate_description(const char *name,
+ const struct fs_parameter_description *desc);
This performs some validation checks on a parameter description. It
returns true if the description is good and false if it is not. It will
@@ -784,8 +781,9 @@ process the parameters it is given.
option number (which it returns).
If successful, and if the parameter type indicates the result is a
- boolean, integer or enum type, the value is converted by this function and
- the result stored in result->{boolean,int_32,uint_32,uint_64}.
+ boolean, integer, enum, uid, or gid type, the value is converted by this
+ function and the result stored in
+ result->{boolean,int_32,uint_32,uint_64,uid,gid}.
If a match isn't initially made, the key is prefixed with "no" and no
value is present then an attempt will be made to look up the key with the
diff --git a/Documentation/filesystems/multigrain-ts.rst b/Documentation/filesystems/multigrain-ts.rst
new file mode 100644
index 000000000000..c779e47284e8
--- /dev/null
+++ b/Documentation/filesystems/multigrain-ts.rst
@@ -0,0 +1,125 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Multigrain Timestamps
+=====================
+
+Introduction
+============
+Historically, the kernel has always used coarse time values to stamp inodes.
+This value is updated every jiffy, so any change that happens within that jiffy
+will end up with the same timestamp.
+
+When the kernel goes to stamp an inode (due to a read or write), it first gets
+the current time and then compares it to the existing timestamp(s) to see
+whether anything will change. If nothing changed, then it can avoid updating
+the inode's metadata.
+
+Coarse timestamps are therefore good from a performance standpoint, since they
+reduce the need for metadata updates, but bad from the standpoint of
+determining whether anything has changed, since a lot of things can happen in a
+jiffy.
+
+They are particularly troublesome with NFSv3, where unchanging timestamps can
+make it difficult to tell whether to invalidate caches. NFSv4 provides a
+dedicated change attribute that should always show a visible change, but not
+all filesystems implement this properly, causing the NFS server to substitute
+the ctime in many cases.
+
+Multigrain timestamps aim to remedy this by selectively using fine-grained
+timestamps when a file has had its timestamps queried recently, and the current
+coarse-grained time does not cause a change.
+
+Inode Timestamps
+================
+There are currently 3 timestamps in the inode that are updated to the current
+wallclock time on different activity:
+
+ctime:
+ The inode change time. This is stamped with the current time whenever
+ the inode's metadata is changed. Note that this value is not settable
+ from userland.
+
+mtime:
+ The inode modification time. This is stamped with the current time
+ any time a file's contents change.
+
+atime:
+ The inode access time. This is stamped whenever an inode's contents are
+ read. Widely considered to be a terrible mistake. Usually avoided with
+ options like noatime or relatime.
+
+Updating the mtime always implies a change to the ctime, but updating the
+atime due to a read request does not.
+
+Multigrain timestamps are only tracked for the ctime and the mtime. atimes are
+not affected and always use the coarse-grained value (subject to the floor).
+
+Inode Timestamp Ordering
+========================
+
+In addition to just providing info about changes to individual files, file
+timestamps also serve an important purpose in applications like "make". These
+programs measure timestamps in order to determine whether source files might be
+newer than cached objects.
+
+Userland applications like make can only determine ordering based on
+operational boundaries. For a syscall those are the syscall entry and exit
+points. For io_uring or nfsd operations, that's the request submission and
+response. In the case of concurrent operations, userland can make no
+determination about the order in which things will occur.
+
+For instance, if a single thread modifies one file, and then another file in
+sequence, the second file must show an equal or later mtime than the first. The
+same is true if two threads are issuing similar operations that do not overlap
+in time.
+
+If however, two threads have racing syscalls that overlap in time, then there
+is no such guarantee, and the second file may appear to have been modified
+before, after or at the same time as the first, regardless of which one was
+submitted first.
+
+Note that the above assumes that the system doesn't experience a backward jump
+of the realtime clock. If that occurs at an inopportune time, then timestamps
+can appear to go backward, even on a properly functioning system.
+
+Multigrain Timestamp Implementation
+===================================
+Multigrain timestamps are aimed at ensuring that changes to a single file are
+always recognizable, without violating the ordering guarantees when multiple
+different files are modified. This affects the mtime and the ctime, but the
+atime will always use coarse-grained timestamps.
+
+It uses an unused bit in the i_ctime_nsec field to indicate whether the mtime
+or ctime has been queried. If either or both have, then the kernel takes
+special care to ensure the next timestamp update will display a visible change.
+This ensures tight cache coherency for use-cases like NFS, without sacrificing
+the benefits of reduced metadata updates when files aren't being watched.
+
+The Ctime Floor Value
+=====================
+It's not sufficient to simply use fine or coarse-grained timestamps based on
+whether the mtime or ctime has been queried. A file could get a fine grained
+timestamp, and then a second file modified later could get a coarse-grained one
+that appears earlier than the first, which would break the kernel's timestamp
+ordering guarantees.
+
+To mitigate this problem, maintain a global floor value that ensures that
+this can't happen. The two files in the above example may appear to have been
+modified at the same time in such a case, but they will never show the reverse
+order. To avoid problems with realtime clock jumps, the floor is managed as a
+monotonic ktime_t, and the values are converted to realtime clock values as
+needed.
+
+Implementation Notes
+====================
+Multigrain timestamps are intended for use by local filesystems that get
+ctime values from the local clock. This is in contrast to network filesystems
+and the like that just mirror timestamp values from a server.
+
+For most filesystems, it's sufficient to just set the FS_MGTIME flag in the
+fstype->fs_flags in order to opt-in, providing the ctime is only ever set via
+inode_set_ctime_current(). If the filesystem has a ->getattr routine that
+doesn't call generic_fillattr, then it should call fill_mg_cmtime() to
+fill those values. For setattr, it should use setattr_copy() to update the
+timestamps, or otherwise mimic its behavior.
diff --git a/Documentation/filesystems/netfs_library.rst b/Documentation/filesystems/netfs_library.rst
index 73a4176144b3..ddd799df6ce3 100644
--- a/Documentation/filesystems/netfs_library.rst
+++ b/Documentation/filesystems/netfs_library.rst
@@ -1,33 +1,187 @@
.. SPDX-License-Identifier: GPL-2.0
-=================================
-Network Filesystem Helper Library
-=================================
+===================================
+Network Filesystem Services Library
+===================================
.. Contents:
- Overview.
+ - Requests and streams.
+ - Subrequests.
+ - Result collection and retry.
+ - Local caching.
+ - Content encryption (fscrypt).
- Per-inode context.
- Inode context helper functions.
- - Buffered read helpers.
- - Read helper functions.
- - Read helper structures.
- - Read helper operations.
- - Read helper procedure.
- - Read helper cache API.
+ - Inode locking.
+ - Inode writeback.
+ - High-level VFS API.
+ - Unlocked read/write iter.
+ - Pre-locked read/write iter.
+ - Monolithic files API.
+ - Memory-mapped I/O API.
+ - High-level VM API.
+ - Deprecated PG_private2 API.
+ - I/O request API.
+ - Request structure.
+ - Stream structure.
+ - Subrequest structure.
+ - Filesystem methods.
+ - Terminating a subrequest.
+ - Local cache API.
+ - API function reference.
Overview
========
-The network filesystem helper library is a set of functions designed to aid a
-network filesystem in implementing VM/VFS operations. For the moment, that
-just includes turning various VM buffered read operations into requests to read
-from the server. The helper library, however, can also interpose other
-services, such as local caching or local data encryption.
+The network filesystem services library, netfslib, is a set of functions
+designed to aid a network filesystem in implementing VM/VFS API operations. It
+takes over the normal buffered read, readahead, write and writeback and also
+handles unbuffered and direct I/O.
-Note that the library module doesn't link against local caching directly, so
-access must be provided by the netfs.
+The library provides support for (re-)negotiation of I/O sizes and retrying
+failed I/O as well as local caching and will, in the future, provide content
+encryption.
+
+It insulates the filesystem from VM interface changes as much as possible and
+handles VM features such as large multipage folios. The filesystem basically
+just has to provide a way to perform read and write RPC calls.
+
+The way I/O is organised inside netfslib consists of a number of objects:
+
+ * A *request*. A request is used to track the progress of the I/O overall and
+ to hold on to resources. The collection of results is done at the request
+ level. The I/O within a request is divided into a number of parallel
+ streams of subrequests.
+
+ * A *stream*. A non-overlapping series of subrequests. The subrequests
+ within a stream do not have to be contiguous.
+
+ * A *subrequest*. This is the basic unit of I/O. It represents a single RPC
+ call or a single cache I/O operation. The library passes these to the
+ filesystem and the cache to perform.
+
+Requests and Streams
+--------------------
+
+When actually performing I/O (as opposed to just copying into the pagecache),
+netfslib will create one or more requests to track the progress of the I/O and
+to hold resources.
+
+A read operation will have a single stream and the subrequests within that
+stream may be of mixed origins, for instance mixing RPC subrequests and cache
+subrequests.
+
+On the other hand, a write operation may have multiple streams, where each
+stream targets a different destination. For instance, there may be one stream
+writing to the local cache and one to the server. Currently, only two streams
+are allowed, but this could be increased if parallel writes to multiple servers
+is desired.
+
+The subrequests within a write stream do not need to match alignment or size
+with the subrequests in another write stream and netfslib performs the tiling
+of subrequests in each stream over the source buffer independently. Further,
+each stream may contain holes that don't correspond to holes in the other
+stream.
+
+In addition, the subrequests do not need to correspond to the boundaries of the
+folios or vectors in the source/destination buffer. The library handles the
+collection of results and the wrangling of folio flags and references.
+
+Subrequests
+-----------
+
+Subrequests are at the heart of the interaction between netfslib and the
+filesystem using it. Each subrequest is expected to correspond to a single
+read or write RPC or cache operation. The library will stitch together the
+results from a set of subrequests to provide a higher level operation.
+
+Netfslib has two interactions with the filesystem or the cache when setting up
+a subrequest. First, there's an optional preparatory step that allows the
+filesystem to negotiate the limits on the subrequest, both in terms of maximum
+number of bytes and maximum number of vectors (e.g. for RDMA). This may
+involve negotiating with the server (e.g. cifs needing to acquire credits).
+
+And, secondly, there's the issuing step in which the subrequest is handed off
+to the filesystem to perform.
+
+Note that these two steps are done slightly differently between read and write:
+
+ * For reads, the VM/VFS tells us how much is being requested up front, so the
+ library can preset maximum values that the cache and then the filesystem can
+ then reduce. The cache also gets consulted first on whether it wants to do
+ a read before the filesystem is consulted.
+
+ * For writeback, it is unknown how much there will be to write until the
+ pagecache is walked, so no limit is set by the library.
+
+Once a subrequest is completed, the filesystem or cache informs the library of
+the completion and then collection is invoked. Depending on whether the
+request is synchronous or asynchronous, the collection of results will be done
+in either the application thread or in a work queue.
+
+Result Collection and Retry
+---------------------------
+
+As subrequests complete, the results are collected and collated by the library
+and folio unlocking is performed progressively (if appropriate). Once the
+request is complete, async completion will be invoked (again, if appropriate).
+It is possible for the filesystem to provide interim progress reports to the
+library to cause folio unlocking to happen earlier if possible.
+
+If any subrequests fail, netfslib can retry them. It will wait until all
+subrequests are completed, offer the filesystem the opportunity to fiddle with
+the resources/state held by the request and poke at the subrequests before
+re-preparing and re-issuing the subrequests.
+
+This allows the tiling of contiguous sets of failed subrequest within a stream
+to be changed, adding more subrequests or ditching excess as necessary (for
+instance, if the network sizes change or the server decides it wants smaller
+chunks).
+
+Further, if one or more contiguous cache-read subrequests fail, the library
+will pass them to the filesystem to perform instead, renegotiating and retiling
+them as necessary to fit with the filesystem's parameters rather than those of
+the cache.
+
+Local Caching
+-------------
+
+One of the services netfslib provides, via ``fscache``, is the option to cache
+on local disk a copy of the data obtained from/written to a network filesystem.
+The library will manage the storing, retrieval and some invalidation of data
+automatically on behalf of the filesystem if a cookie is attached to the
+``netfs_inode``.
+
+Note that local caching used to use the PG_private_2 (aliased as PG_fscache) to
+keep track of a page that was being written to the cache, but this is now
+deprecated as PG_private_2 will be removed.
+
+Instead, folios that are read from the server for which there was no data in
+the cache will be marked as dirty and will have ``folio->private`` set to a
+special value (``NETFS_FOLIO_COPY_TO_CACHE``) and left to writeback to write.
+If the folio is modified before that happened, the special value will be
+cleared and the write will become normally dirty.
+
+When writeback occurs, folios that are so marked will only be written to the
+cache and not to the server. Writeback handles mixed cache-only writes and
+server-and-cache writes by using two streams, sending one to the cache and one
+to the server. The server stream will have gaps in it corresponding to those
+folios.
+
+Content Encryption (fscrypt)
+----------------------------
+
+Though it does not do so yet, at some point netfslib will acquire the ability
+to do client-side content encryption on behalf of the network filesystem (Ceph,
+for example). fscrypt can be used for this if appropriate (it may not be -
+cifs, for example).
+
+The data will be stored encrypted in the local cache using the same manner of
+encryption as the data written to the server and the library will impose bounce
+buffering and RMW cycles as necessary.
Per-Inode Context
@@ -40,10 +194,13 @@ structure is defined::
struct netfs_inode {
struct inode inode;
const struct netfs_request_ops *ops;
- struct fscache_cookie *cache;
+ struct fscache_cookie * cache;
+ loff_t remote_i_size;
+ unsigned long flags;
+ ...
};
-A network filesystem that wants to use netfs lib must place one of these in its
+A network filesystem that wants to use netfslib must place one of these in its
inode wrapper struct instead of the VFS ``struct inode``. This can be done in
a way similar to the following::
@@ -56,7 +213,8 @@ This allows netfslib to find its state by using ``container_of()`` from the
inode pointer, thereby allowing the netfslib helper functions to be pointed to
directly by the VFS/VM operation tables.
-The structure contains the following fields:
+The structure contains the following fields that are of interest to the
+filesystem:
* ``inode``
@@ -71,6 +229,37 @@ The structure contains the following fields:
Local caching cookie, or NULL if no caching is enabled. This field does not
exist if fscache is disabled.
+ * ``remote_i_size``
+
+ The size of the file on the server. This differs from inode->i_size if
+ local modifications have been made but not yet written back.
+
+ * ``flags``
+
+ A set of flags, some of which the filesystem might be interested in:
+
+ * ``NETFS_ICTX_MODIFIED_ATTR``
+
+ Set if netfslib modifies mtime/ctime. The filesystem is free to ignore
+ this or clear it.
+
+ * ``NETFS_ICTX_UNBUFFERED``
+
+ Do unbuffered I/O upon the file. Like direct I/O but without the
+ alignment limitations. RMW will be performed if necessary. The pagecache
+ will not be used unless mmap() is also used.
+
+ * ``NETFS_ICTX_WRITETHROUGH``
+
+ Do writethrough caching upon the file. I/O will be set up and dispatched
+ as buffered writes are made to the page cache. mmap() does the normal
+ writeback thing.
+
+ * ``NETFS_ICTX_SINGLE_NO_UPLOAD``
+
+ Set if the file has a monolithic content that must be read entirely in a
+ single go and must not be written back to the server, though it can be
+ cached (e.g. AFS directories).
Inode Context Helper Functions
------------------------------
@@ -84,117 +273,250 @@ set the operations table pointer::
then a function to cast from the VFS inode structure to the netfs context::
- struct netfs_inode *netfs_node(struct inode *inode);
+ struct netfs_inode *netfs_inode(struct inode *inode);
and finally, a function to get the cache cookie pointer from the context
attached to an inode (or NULL if fscache is disabled)::
struct fscache_cookie *netfs_i_cookie(struct netfs_inode *ctx);
+Inode Locking
+-------------
+
+A number of functions are provided to manage the locking of i_rwsem for I/O and
+to effectively extend it to provide more separate classes of exclusion::
+
+ int netfs_start_io_read(struct inode *inode);
+ void netfs_end_io_read(struct inode *inode);
+ int netfs_start_io_write(struct inode *inode);
+ void netfs_end_io_write(struct inode *inode);
+ int netfs_start_io_direct(struct inode *inode);
+ void netfs_end_io_direct(struct inode *inode);
+
+The exclusion breaks down into four separate classes:
+
+ 1) Buffered reads and writes.
+
+ Buffered reads can run concurrently each other and with buffered writes,
+ but buffered writes cannot run concurrently with each other.
+
+ 2) Direct reads and writes.
+
+ Direct (and unbuffered) reads and writes can run concurrently since they do
+ not share local buffering (i.e. the pagecache) and, in a network
+ filesystem, are expected to have exclusion managed on the server (though
+ this may not be the case for, say, Ceph).
+
+ 3) Other major inode modifying operations (e.g. truncate, fallocate).
+
+ These should just access i_rwsem directly.
+
+ 4) mmap().
+
+ mmap'd accesses might operate concurrently with any of the other classes.
+ They might form the buffer for an intra-file loopback DIO read/write. They
+ might be permitted on unbuffered files.
+
+Inode Writeback
+---------------
+
+Netfslib will pin resources on an inode for future writeback (such as pinning
+use of an fscache cookie) when an inode is dirtied. However, this pinning
+needs careful management. To manage the pinning, the following sequence
+occurs:
+
+ 1) An inode state flag ``I_PINNING_NETFS_WB`` is set by netfslib when the
+ pinning begins (when a folio is dirtied, for example) if the cache is
+ active to stop the cache structures from being discarded and the cache
+ space from being culled. This also prevents re-getting of cache resources
+ if the flag is already set.
+
+ 2) This flag then cleared inside the inode lock during inode writeback in the
+ VM - and the fact that it was set is transferred to ``->unpinned_netfs_wb``
+ in ``struct writeback_control``.
+
+ 3) If ``->unpinned_netfs_wb`` is now set, the write_inode procedure is forced.
+
+ 4) The filesystem's ``->write_inode()`` function is invoked to do the cleanup.
+
+ 5) The filesystem invokes netfs to do its cleanup.
+
+To do the cleanup, netfslib provides a function to do the resource unpinning::
+
+ int netfs_unpin_writeback(struct inode *inode, struct writeback_control *wbc);
+
+If the filesystem doesn't need to do anything else, this may be set as a its
+``.write_inode`` method.
+
+Further, if an inode is deleted, the filesystem's write_inode method may not
+get called, so::
+
+ void netfs_clear_inode_writeback(struct inode *inode, const void *aux);
-Buffered Read Helpers
-=====================
+must be called from ``->evict_inode()`` *before* ``clear_inode()`` is called.
-The library provides a set of read helpers that handle the ->read_folio(),
-->readahead() and much of the ->write_begin() VM operations and translate them
-into a common call framework.
-The following services are provided:
+High-Level VFS API
+==================
- * Handle folios that span multiple pages.
+Netfslib provides a number of sets of API calls for the filesystem to delegate
+VFS operations to. Netfslib, in turn, will call out to the filesystem and the
+cache to negotiate I/O sizes, issue RPCs and provide places for it to intervene
+at various times.
- * Insulate the netfs from VM interface changes.
+Unlocked Read/Write Iter
+------------------------
- * Allow the netfs to arbitrarily split reads up into pieces, even ones that
- don't match folio sizes or folio alignments and that may cross folios.
+The first API set is for the delegation of operations to netfslib when the
+filesystem is called through the standard VFS read/write_iter methods::
- * Allow the netfs to expand a readahead request in both directions to meet its
- needs.
+ ssize_t netfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter);
+ ssize_t netfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from);
+ ssize_t netfs_buffered_read_iter(struct kiocb *iocb, struct iov_iter *iter);
+ ssize_t netfs_unbuffered_read_iter(struct kiocb *iocb, struct iov_iter *iter);
+ ssize_t netfs_unbuffered_write_iter(struct kiocb *iocb, struct iov_iter *from);
- * Allow the netfs to partially fulfil a read, which will then be resubmitted.
+They can be assigned directly to ``.read_iter`` and ``.write_iter``. They
+perform the inode locking themselves and the first two will switch between
+buffered I/O and DIO as appropriate.
- * Handle local caching, allowing cached data and server-read data to be
- interleaved for a single request.
+Pre-Locked Read/Write Iter
+--------------------------
- * Handle clearing of bufferage that aren't on the server.
+The second API set is for the delegation of operations to netfslib when the
+filesystem is called through the standard VFS methods, but needs to do some
+other stuff before or after calling netfslib whilst still inside locked section
+(e.g. Ceph negotiating caps). The unbuffered read function is::
- * Handle retrying of reads that failed, switching reads from the cache to the
- server as necessary.
+ ssize_t netfs_unbuffered_read_iter_locked(struct kiocb *iocb, struct iov_iter *iter);
- * In the future, this is a place that other services can be performed, such as
- local encryption of data to be stored remotely or in the cache.
+This must not be assigned directly to ``.read_iter`` and the filesystem is
+responsible for performing the inode locking before calling it. In the case of
+buffered read, the filesystem should use ``filemap_read()``.
-From the network filesystem, the helpers require a table of operations. This
-includes a mandatory method to issue a read operation along with a number of
-optional methods.
+There are three functions for writes::
+ ssize_t netfs_buffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *from,
+ struct netfs_group *netfs_group);
+ ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter,
+ struct netfs_group *netfs_group);
+ ssize_t netfs_unbuffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *iter,
+ struct netfs_group *netfs_group);
-Read Helper Functions
+These must not be assigned directly to ``.write_iter`` and the filesystem is
+responsible for performing the inode locking before calling them.
+
+The first two functions are for buffered writes; the first just adds some
+standard write checks and jumps to the second, but if the filesystem wants to
+do the checks itself, it can use the second directly. The third function is
+for unbuffered or DIO writes.
+
+On all three write functions, there is a writeback group pointer (which should
+be NULL if the filesystem doesn't use this). Writeback groups are set on
+folios when they're modified. If a folio to-be-modified is already marked with
+a different group, it is flushed first. The writeback API allows writing back
+of a specific group.
+
+Memory-Mapped I/O API
---------------------
-Three read helpers are provided::
+An API for support of mmap()'d I/O is provided::
+
+ vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *netfs_group);
+
+This allows the filesystem to delegate ``.page_mkwrite`` to netfslib. The
+filesystem should not take the inode lock before calling it, but, as with the
+locked write functions above, this does take a writeback group pointer. If the
+page to be made writable is in a different group, it will be flushed first.
+
+Monolithic Files API
+--------------------
+
+There is also a special API set for files for which the content must be read in
+a single RPC (and not written back) and is maintained as a monolithic blob
+(e.g. an AFS directory), though it can be stored and updated in the local cache::
+
+ ssize_t netfs_read_single(struct inode *inode, struct file *file, struct iov_iter *iter);
+ void netfs_single_mark_inode_dirty(struct inode *inode);
+ int netfs_writeback_single(struct address_space *mapping,
+ struct writeback_control *wbc,
+ struct iov_iter *iter);
+
+The first function reads from a file into the given buffer, reading from the
+cache in preference if the data is cached there; the second function allows the
+inode to be marked dirty, causing a later writeback; and the third function can
+be called from the writeback code to write the data to the cache, if there is
+one.
+
+The inode should be marked ``NETFS_ICTX_SINGLE_NO_UPLOAD`` if this API is to be
+used. The writeback function requires the buffer to be of ITER_FOLIOQ type.
- void netfs_readahead(struct readahead_control *ractl);
- int netfs_read_folio(struct file *file,
- struct folio *folio);
- int netfs_write_begin(struct netfs_inode *ctx,
- struct file *file,
- struct address_space *mapping,
- loff_t pos,
- unsigned int len,
- struct folio **_folio,
- void **_fsdata);
+High-Level VM API
+==================
-Each corresponds to a VM address space operation. These operations use the
-state in the per-inode context.
+Netfslib also provides a number of sets of API calls for the filesystem to
+delegate VM operations to. Again, netfslib, in turn, will call out to the
+filesystem and the cache to negotiate I/O sizes, issue RPCs and provide places
+for it to intervene at various times::
-For ->readahead() and ->read_folio(), the network filesystem just point directly
-at the corresponding read helper; whereas for ->write_begin(), it may be a
-little more complicated as the network filesystem might want to flush
-conflicting writes or track dirty data and needs to put the acquired folio if
-an error occurs after calling the helper.
+ void netfs_readahead(struct readahead_control *);
+ int netfs_read_folio(struct file *, struct folio *);
+ int netfs_writepages(struct address_space *mapping,
+ struct writeback_control *wbc);
+ bool netfs_dirty_folio(struct address_space *mapping, struct folio *folio);
+ void netfs_invalidate_folio(struct folio *folio, size_t offset, size_t length);
+ bool netfs_release_folio(struct folio *folio, gfp_t gfp);
-The helpers manage the read request, calling back into the network filesystem
-through the suppplied table of operations. Waits will be performed as
-necessary before returning for helpers that are meant to be synchronous.
+These are ``address_space_operations`` methods and can be set directly in the
+operations table.
-If an error occurs, the ->free_request() will be called to clean up the
-netfs_io_request struct allocated. If some parts of the request are in
-progress when an error occurs, the request will get partially completed if
-sufficient data is read.
+Deprecated PG_private_2 API
+---------------------------
-Additionally, there is::
+There is also a deprecated function for filesystems that still use the
+``->write_begin`` method::
- * void netfs_subreq_terminated(struct netfs_io_subrequest *subreq,
- ssize_t transferred_or_error,
- bool was_async);
+ int netfs_write_begin(struct netfs_inode *inode, struct file *file,
+ struct address_space *mapping, loff_t pos, unsigned int len,
+ struct folio **_folio, void **_fsdata);
-which should be called to complete a read subrequest. This is given the number
-of bytes transferred or a negative error code, plus a flag indicating whether
-the operation was asynchronous (ie. whether the follow-on processing can be
-done in the current context, given this may involve sleeping).
+It uses the deprecated PG_private_2 flag and so should not be used.
-Read Helper Structures
-----------------------
+I/O Request API
+===============
-The read helpers make use of a couple of structures to maintain the state of
-the read. The first is a structure that manages a read request as a whole::
+The I/O request API comprises a number of structures and a number of functions
+that the filesystem may need to use.
+
+Request Structure
+-----------------
+
+The request structure manages the request as a whole, holding some resources
+and state on behalf of the filesystem and tracking the collection of results::
struct netfs_io_request {
+ enum netfs_io_origin origin;
struct inode *inode;
struct address_space *mapping;
- struct netfs_cache_resources cache_resources;
+ struct netfs_group *group;
+ struct netfs_io_stream io_streams[];
void *netfs_priv;
- loff_t start;
- size_t len;
- loff_t i_size;
- const struct netfs_request_ops *netfs_ops;
+ void *netfs_priv2;
+ unsigned long long start;
+ unsigned long long len;
+ unsigned long long i_size;
unsigned int debug_id;
+ unsigned long flags;
...
};
-The above fields are the ones the netfs can use. They are:
+Many of the fields are for internal use, but the fields shown here are of
+interest to the filesystem:
+
+ * ``origin``
+
+ The origin of the request (readahead, read_folio, DIO read, writeback, ...).
* ``inode``
* ``mapping``
@@ -202,11 +524,19 @@ The above fields are the ones the netfs can use. They are:
The inode and the address space of the file being read from. The mapping
may or may not point to inode->i_data.
- * ``cache_resources``
+ * ``group``
+
+ The writeback group this request is dealing with or NULL. This holds a ref
+ on the group.
- Resources for the local cache to use, if present.
+ * ``io_streams``
+
+ The parallel streams of subrequests available to the request. Currently two
+ are available, but this may be made extensible in future. ``NR_IO_STREAMS``
+ indicates the size of the array.
* ``netfs_priv``
+ * ``netfs_priv2``
The network filesystem's private data. The value for this can be passed in
to the helper functions or set during the request.
@@ -221,37 +551,121 @@ The above fields are the ones the netfs can use. They are:
The size of the file at the start of the request.
- * ``netfs_ops``
-
- A pointer to the operation table. The value for this is passed into the
- helper functions.
-
* ``debug_id``
A number allocated to this operation that can be displayed in trace lines
for reference.
+ * ``flags``
+
+ Flags for managing and controlling the operation of the request. Some of
+ these may be of interest to the filesystem:
+
+ * ``NETFS_RREQ_RETRYING``
+
+ Netfslib sets this when generating retries.
+
+ * ``NETFS_RREQ_PAUSE``
+
+ The filesystem can set this to request to pause the library's subrequest
+ issuing loop - but care needs to be taken as netfslib may also set it.
+
+ * ``NETFS_RREQ_NONBLOCK``
+ * ``NETFS_RREQ_BLOCKED``
+
+ Netfslib sets the first to indicate that non-blocking mode was set by the
+ caller and the filesystem can set the second to indicate that it would
+ have had to block.
+
+ * ``NETFS_RREQ_USE_PGPRIV2``
+
+ The filesystem can set this if it wants to use PG_private_2 to track
+ whether a folio is being written to the cache. This is deprecated as
+ PG_private_2 is going to go away.
+
+If the filesystem wants more private data than is afforded by this structure,
+then it should wrap it and provide its own allocator.
+
+Stream Structure
+----------------
+
+A request is comprised of one or more parallel streams and each stream may be
+aimed at a different target.
+
+For read requests, only stream 0 is used. This can contain a mixture of
+subrequests aimed at different sources. For write requests, stream 0 is used
+for the server and stream 1 is used for the cache. For buffered writeback,
+stream 0 is not enabled unless a normal dirty folio is encountered, at which
+point ->begin_writeback() will be invoked and the filesystem can mark the
+stream available.
+
+The stream struct looks like::
+
+ struct netfs_io_stream {
+ unsigned char stream_nr;
+ bool avail;
+ size_t sreq_max_len;
+ unsigned int sreq_max_segs;
+ unsigned int submit_extendable_to;
+ ...
+ };
+
+A number of members are available for access/use by the filesystem:
+
+ * ``stream_nr``
+
+ The number of the stream within the request.
+
+ * ``avail``
+
+ True if the stream is available for use. The filesystem should set this on
+ stream zero if in ->begin_writeback().
+
+ * ``sreq_max_len``
+ * ``sreq_max_segs``
+
+ These are set by the filesystem or the cache in ->prepare_read() or
+ ->prepare_write() for each subrequest to indicate the maximum number of
+ bytes and, optionally, the maximum number of segments (if not 0) that that
+ subrequest can support.
+
+ * ``submit_extendable_to``
+
+ The size that a subrequest can be rounded up to beyond the EOF, given the
+ available buffer. This allows the cache to work out if it can do a DIO read
+ or write that straddles the EOF marker.
-The second structure is used to manage individual slices of the overall read
-request::
+Subrequest Structure
+--------------------
+
+Individual units of I/O are managed by the subrequest structure. These
+represent slices of the overall request and run independently::
struct netfs_io_subrequest {
struct netfs_io_request *rreq;
- loff_t start;
+ struct iov_iter io_iter;
+ unsigned long long start;
size_t len;
size_t transferred;
unsigned long flags;
+ short error;
unsigned short debug_index;
+ unsigned char stream_nr;
...
};
-Each subrequest is expected to access a single source, though the helpers will
+Each subrequest is expected to access a single source, though the library will
handle falling back from one source type to another. The members are:
* ``rreq``
A pointer to the read request.
+ * ``io_iter``
+
+ An I/O iterator representing a slice of the buffer to be read into or
+ written from.
+
* ``start``
* ``len``
@@ -260,256 +674,295 @@ handle falling back from one source type to another. The members are:
* ``transferred``
- The amount of data transferred so far of the length of this slice. The
- network filesystem or cache should start the operation this far into the
- slice. If a short read occurs, the helpers will call again, having updated
- this to reflect the amount read so far.
+ The amount of data transferred so far for this subrequest. This should be
+ added to with the length of the transfer made by this issuance of the
+ subrequest. If this is less than ``len`` then the subrequest may be
+ reissued to continue.
* ``flags``
- Flags pertaining to the read. There are two of interest to the filesystem
- or cache:
+ Flags for managing the subrequest. There are a number of interest to the
+ filesystem or cache:
+
+ * ``NETFS_SREQ_MADE_PROGRESS``
+
+ Set by the filesystem to indicates that at least one byte of data was read
+ or written.
+
+ * ``NETFS_SREQ_HIT_EOF``
+
+ The filesystem should set this if a read hit the EOF on the file (in which
+ case ``transferred`` should stop at the EOF). Netfslib may expand the
+ subrequest out to the size of the folio containing the EOF on the off
+ chance that a third party change happened or a DIO read may have asked for
+ more than is available. The library will clear any excess pagecache.
* ``NETFS_SREQ_CLEAR_TAIL``
- This can be set to indicate that the remainder of the slice, from
- transferred to len, should be cleared.
+ The filesystem can set this to indicate that the remainder of the slice,
+ from transferred to len, should be cleared. Do not set if HIT_EOF is set.
+
+ * ``NETFS_SREQ_NEED_RETRY``
+
+ The filesystem can set this to tell netfslib to retry the subrequest.
- * ``NETFS_SREQ_SEEK_DATA_READ``
+ * ``NETFS_SREQ_BOUNDARY``
- This is a hint to the cache that it might want to try skipping ahead to
- the next data (ie. using SEEK_DATA).
+ This can be set by the filesystem on a subrequest to indicate that it ends
+ at a boundary with the filesystem structure (e.g. at the end of a Ceph
+ object). It tells netfslib not to retile subrequests across it.
+
+ * ``error``
+
+ This is for the filesystem to store result of the subrequest. It should be
+ set to 0 if successful and a negative error code otherwise.
* ``debug_index``
+ * ``stream_nr``
A number allocated to this slice that can be displayed in trace lines for
- reference.
+ reference and the number of the request stream that it belongs to.
+
+If necessary, the filesystem can get and put extra refs on the subrequest it is
+given::
+ void netfs_get_subrequest(struct netfs_io_subrequest *subreq,
+ enum netfs_sreq_ref_trace what);
+ void netfs_put_subrequest(struct netfs_io_subrequest *subreq,
+ enum netfs_sreq_ref_trace what);
-Read Helper Operations
-----------------------
+using netfs trace codes to indicate the reason. Care must be taken, however,
+as once control of the subrequest is returned to netfslib, the same subrequest
+can be reissued/retried.
-The network filesystem must provide the read helpers with a table of operations
-through which it can issue requests and negotiate::
+Filesystem Methods
+------------------
+
+The filesystem sets a table of operations in ``netfs_inode`` for netfslib to
+use::
struct netfs_request_ops {
- void (*init_request)(struct netfs_io_request *rreq, struct file *file);
+ mempool_t *request_pool;
+ mempool_t *subrequest_pool;
+ int (*init_request)(struct netfs_io_request *rreq, struct file *file);
void (*free_request)(struct netfs_io_request *rreq);
- int (*begin_cache_operation)(struct netfs_io_request *rreq);
+ void (*free_subrequest)(struct netfs_io_subrequest *rreq);
void (*expand_readahead)(struct netfs_io_request *rreq);
- bool (*clamp_length)(struct netfs_io_subrequest *subreq);
+ int (*prepare_read)(struct netfs_io_subrequest *subreq);
void (*issue_read)(struct netfs_io_subrequest *subreq);
- bool (*is_still_valid)(struct netfs_io_request *rreq);
- int (*check_write_begin)(struct file *file, loff_t pos, unsigned len,
- struct folio **foliop, void **_fsdata);
void (*done)(struct netfs_io_request *rreq);
+ void (*update_i_size)(struct inode *inode, loff_t i_size);
+ void (*post_modify)(struct inode *inode);
+ void (*begin_writeback)(struct netfs_io_request *wreq);
+ void (*prepare_write)(struct netfs_io_subrequest *subreq);
+ void (*issue_write)(struct netfs_io_subrequest *subreq);
+ void (*retry_request)(struct netfs_io_request *wreq,
+ struct netfs_io_stream *stream);
+ void (*invalidate_cache)(struct netfs_io_request *wreq);
};
-The operations are as follows:
+The table starts with a pair of optional pointers to memory pools from which
+requests and subrequests can be allocated. If these are not given, netfslib
+has default pools that it will use instead. If the filesystem wraps the netfs
+structs in its own larger structs, then it will need to use its own pools.
+Netfslib will allocate directly from the pools.
- * ``init_request()``
-
- [Optional] This is called to initialise the request structure. It is given
- the file for reference.
+The methods defined in the table are:
+ * ``init_request()``
* ``free_request()``
+ * ``free_subrequest()``
- [Optional] This is called as the request is being deallocated so that the
- filesystem can clean up any state it has attached there.
-
- * ``begin_cache_operation()``
-
- [Optional] This is called to ask the network filesystem to call into the
- cache (if present) to initialise the caching state for this read. The netfs
- library module cannot access the cache directly, so the cache should call
- something like fscache_begin_read_operation() to do this.
-
- The cache gets to store its state in ->cache_resources and must set a table
- of operations of its own there (though of a different type).
-
- This should return 0 on success and an error code otherwise. If an error is
- reported, the operation may proceed anyway, just without local caching (only
- out of memory and interruption errors cause failure here).
+ [Optional] A filesystem may implement these to initialise or clean up any
+ resources that it attaches to the request or subrequest.
* ``expand_readahead()``
[Optional] This is called to allow the filesystem to expand the size of a
- readahead read request. The filesystem gets to expand the request in both
- directions, though it's not permitted to reduce it as the numbers may
- represent an allocation already made. If local caching is enabled, it gets
- to expand the request first.
+ readahead request. The filesystem gets to expand the request in both
+ directions, though it must retain the initial region as that may represent
+ an allocation already made. If local caching is enabled, it gets to expand
+ the request first.
Expansion is communicated by changing ->start and ->len in the request
structure. Note that if any change is made, ->len must be increased by at
least as much as ->start is reduced.
- * ``clamp_length()``
-
- [Optional] This is called to allow the filesystem to reduce the size of a
- subrequest. The filesystem can use this, for example, to chop up a request
- that has to be split across multiple servers or to put multiple reads in
- flight.
-
- This should return 0 on success and an error code on error.
-
- * ``issue_read()``
+ * ``prepare_read()``
- [Required] The helpers use this to dispatch a subrequest to the server for
- reading. In the subrequest, ->start, ->len and ->transferred indicate what
- data should be read from the server.
+ [Optional] This is called to allow the filesystem to limit the size of a
+ subrequest. It may also limit the number of individual regions in iterator,
+ such as required by RDMA. This information should be set on stream zero in::
- There is no return value; the netfs_subreq_terminated() function should be
- called to indicate whether or not the operation succeeded and how much data
- it transferred. The filesystem also should not deal with setting folios
- uptodate, unlocking them or dropping their refs - the helpers need to deal
- with this as they have to coordinate with copying to the local cache.
+ rreq->io_streams[0].sreq_max_len
+ rreq->io_streams[0].sreq_max_segs
- Note that the helpers have the folios locked, but not pinned. It is
- possible to use the ITER_XARRAY iov iterator to refer to the range of the
- inode that is being operated upon without the need to allocate large bvec
- tables.
+ The filesystem can use this, for example, to chop up a request that has to
+ be split across multiple servers or to put multiple reads in flight.
- * ``is_still_valid()``
+ Zero should be returned on success and an error code otherwise.
- [Optional] This is called to find out if the data just read from the local
- cache is still valid. It should return true if it is still valid and false
- if not. If it's not still valid, it will be reread from the server.
+ * ``issue_read()``
- * ``check_write_begin()``
+ [Required] Netfslib calls this to dispatch a subrequest to the server for
+ reading. In the subrequest, ->start, ->len and ->transferred indicate what
+ data should be read from the server and ->io_iter indicates the buffer to be
+ used.
- [Optional] This is called from the netfs_write_begin() helper once it has
- allocated/grabbed the folio to be modified to allow the filesystem to flush
- conflicting state before allowing it to be modified.
+ There is no return value; the ``netfs_read_subreq_terminated()`` function
+ should be called to indicate that the subrequest completed either way.
+ ->error, ->transferred and ->flags should be updated before completing. The
+ termination can be done asynchronously.
- It may unlock and discard the folio it was given and set the caller's folio
- pointer to NULL. It should return 0 if everything is now fine (``*foliop``
- left set) or the op should be retried (``*foliop`` cleared) and any other
- error code to abort the operation.
+ Note: the filesystem must not deal with setting folios uptodate, unlocking
+ them or dropping their refs - the library deals with this as it may have to
+ stitch together the results of multiple subrequests that variously overlap
+ the set of folios.
- * ``done``
+ * ``done()``
- [Optional] This is called after the folios in the request have all been
+ [Optional] This is called after the folios in a read request have all been
unlocked (and marked uptodate if applicable).
+ * ``update_i_size()``
+
+ [Optional] This is invoked by netfslib at various points during the write
+ paths to ask the filesystem to update its idea of the file size. If not
+ given, netfslib will set i_size and i_blocks and update the local cache
+ cookie.
+
+ * ``post_modify()``
+
+ [Optional] This is called after netfslib writes to the pagecache or when it
+ allows an mmap'd page to be marked as writable.
+
+ * ``begin_writeback()``
+
+ [Optional] Netfslib calls this when processing a writeback request if it
+ finds a dirty page that isn't simply marked NETFS_FOLIO_COPY_TO_CACHE,
+ indicating it must be written to the server. This allows the filesystem to
+ only set up writeback resources when it knows it's going to have to perform
+ a write.
+
+ * ``prepare_write()``
+ [Optional] This is called to allow the filesystem to limit the size of a
+ subrequest. It may also limit the number of individual regions in iterator,
+ such as required by RDMA. This information should be set on stream to which
+ the subrequest belongs::
-Read Helper Procedure
----------------------
-
-The read helpers work by the following general procedure:
-
- * Set up the request.
-
- * For readahead, allow the local cache and then the network filesystem to
- propose expansions to the read request. This is then proposed to the VM.
- If the VM cannot fully perform the expansion, a partially expanded read will
- be performed, though this may not get written to the cache in its entirety.
-
- * Loop around slicing chunks off of the request to form subrequests:
-
- * If a local cache is present, it gets to do the slicing, otherwise the
- helpers just try to generate maximal slices.
+ rreq->io_streams[subreq->stream_nr].sreq_max_len
+ rreq->io_streams[subreq->stream_nr].sreq_max_segs
- * The network filesystem gets to clamp the size of each slice if it is to be
- the source. This allows rsize and chunking to be implemented.
+ The filesystem can use this, for example, to chop up a request that has to
+ be split across multiple servers or to put multiple writes in flight.
- * The helpers issue a read from the cache or a read from the server or just
- clears the slice as appropriate.
+ This is not permitted to return an error. Instead, in the event of failure,
+ ``netfs_prepare_write_failed()`` must be called.
- * The next slice begins at the end of the last one.
+ * ``issue_write()``
- * As slices finish being read, they terminate.
+ [Required] This is used to dispatch a subrequest to the server for writing.
+ In the subrequest, ->start, ->len and ->transferred indicate what data
+ should be written to the server and ->io_iter indicates the buffer to be
+ used.
- * When all the subrequests have terminated, the subrequests are assessed and
- any that are short or have failed are reissued:
+ There is no return value; the ``netfs_write_subreq_terminated()`` function
+ should be called to indicate that the subrequest completed either way.
+ ->error, ->transferred and ->flags should be updated before completing. The
+ termination can be done asynchronously.
- * Failed cache requests are issued against the server instead.
+ Note: the filesystem must not deal with removing the dirty or writeback
+ marks on folios involved in the operation and should not take refs or pins
+ on them, but should leave retention to netfslib.
- * Failed server requests just fail.
+ * ``retry_request()``
- * Short reads against either source will be reissued against that source
- provided they have transferred some more data:
+ [Optional] Netfslib calls this at the beginning of a retry cycle. This
+ allows the filesystem to examine the state of the request, the subrequests
+ in the indicated stream and of its own data and make adjustments or
+ renegotiate resources.
+
+ * ``invalidate_cache()``
- * The cache may need to skip holes that it can't do DIO from.
+ [Optional] This is called by netfslib to invalidate data stored in the local
+ cache in the event that writing to the local cache fails, providing updated
+ coherency data that netfs can't provide.
- * If NETFS_SREQ_CLEAR_TAIL was set, a short read will be cleared to the
- end of the slice instead of reissuing.
+Terminating a subrequest
+------------------------
- * Once the data is read, the folios that have been fully read/cleared:
+When a subrequest completes, there are a number of functions that the cache or
+subrequest can call to inform netfslib of the status change. One function is
+provided to terminate a write subrequest at the preparation stage and acts
+synchronously:
- * Will be marked uptodate.
+ * ``void netfs_prepare_write_failed(struct netfs_io_subrequest *subreq);``
- * If a cache is present, will be marked with PG_fscache.
+ Indicate that the ->prepare_write() call failed. The ``error`` field should
+ have been updated.
- * Unlocked
+Note that ->prepare_read() can return an error as a read can simply be aborted.
+Dealing with writeback failure is trickier.
- * Any folios that need writing to the cache will then have DIO writes issued.
+The other functions are used for subrequests that got as far as being issued:
- * Synchronous operations will wait for reading to be complete.
+ * ``void netfs_read_subreq_terminated(struct netfs_io_subrequest *subreq);``
- * Writes to the cache will proceed asynchronously and the folios will have the
- PG_fscache mark removed when that completes.
+ Tell netfslib that a read subrequest has terminated. The ``error``,
+ ``flags`` and ``transferred`` fields should have been updated.
- * The request structures will be cleaned up when everything has completed.
+ * ``void netfs_write_subrequest_terminated(void *_op, ssize_t transferred_or_error);``
+ Tell netfslib that a write subrequest has terminated. Either the amount of
+ data processed or the negative error code can be passed in. This is
+ can be used as a kiocb completion function.
-Read Helper Cache API
----------------------
+ * ``void netfs_read_subreq_progress(struct netfs_io_subrequest *subreq);``
-When implementing a local cache to be used by the read helpers, two things are
-required: some way for the network filesystem to initialise the caching for a
-read request and a table of operations for the helpers to call.
+ This is provided to optionally update netfslib on the incremental progress
+ of a read, allowing some folios to be unlocked early and does not actually
+ terminate the subrequest. The ``transferred`` field should have been
+ updated.
-The network filesystem's ->begin_cache_operation() method is called to set up a
-cache and this must call into the cache to do the work. If using fscache, for
-example, the cache would call::
+Local Cache API
+---------------
- int fscache_begin_read_operation(struct netfs_io_request *rreq,
- struct fscache_cookie *cookie);
+Netfslib provides a separate API for a local cache to implement, though it
+provides some somewhat similar routines to the filesystem request API.
-passing in the request pointer and the cookie corresponding to the file.
-
-The netfs_io_request object contains a place for the cache to hang its
+Firstly, the netfs_io_request object contains a place for the cache to hang its
state::
struct netfs_cache_resources {
const struct netfs_cache_ops *ops;
void *cache_priv;
void *cache_priv2;
+ unsigned int debug_id;
+ unsigned int inval_counter;
};
-This contains an operations table pointer and two private pointers. The
-operation table looks like the following::
+This contains an operations table pointer and two private pointers plus the
+debug ID of the fscache cookie for tracing purposes and an invalidation counter
+that is cranked by calls to ``fscache_invalidate()`` allowing cache subrequests
+to be invalidated after completion.
+
+The cache operation table looks like the following::
struct netfs_cache_ops {
void (*end_operation)(struct netfs_cache_resources *cres);
-
void (*expand_readahead)(struct netfs_cache_resources *cres,
loff_t *_start, size_t *_len, loff_t i_size);
-
enum netfs_io_source (*prepare_read)(struct netfs_io_subrequest *subreq,
- loff_t i_size);
-
+ loff_t i_size);
int (*read)(struct netfs_cache_resources *cres,
loff_t start_pos,
struct iov_iter *iter,
bool seek_data,
netfs_io_terminated_t term_func,
void *term_func_priv);
-
- int (*prepare_write)(struct netfs_cache_resources *cres,
- loff_t *_start, size_t *_len, loff_t i_size,
- bool no_space_allocated_yet);
-
- int (*write)(struct netfs_cache_resources *cres,
- loff_t start_pos,
- struct iov_iter *iter,
- netfs_io_terminated_t term_func,
- void *term_func_priv);
-
- int (*query_occupancy)(struct netfs_cache_resources *cres,
- loff_t start, size_t len, size_t granularity,
- loff_t *_data_start, size_t *_data_len);
+ void (*prepare_write_subreq)(struct netfs_io_subrequest *subreq);
+ void (*issue_write)(struct netfs_io_subrequest *subreq);
};
With a termination handler function pointer::
@@ -526,11 +979,17 @@ The methods defined in the table are:
* ``expand_readahead()``
- [Optional] Called at the beginning of a netfs_readahead() operation to allow
- the cache to expand a request in either direction. This allows the cache to
+ [Optional] Called at the beginning of a readahead operation to allow the
+ cache to expand a request in either direction. This allows the cache to
size the request appropriately for the cache granularity.
- The function is passed poiners to the start and length in its parameters,
+ * ``prepare_read()``
+
+ [Required] Called to configure the next slice of a request. ->start and
+ ->len in the subrequest indicate where and how big the next slice can be;
+ the cache gets to reduce the length to match its granularity requirements.
+
+ The function is passed pointers to the start and length in its parameters,
plus the size of the file for reference, and adjusts the start and length
appropriately. It should return one of:
@@ -543,12 +1002,6 @@ The methods defined in the table are:
downloaded from the server or read from the cache - or whether slicing
should be given up at the current point.
- * ``prepare_read()``
-
- [Required] Called to configure the next slice of a request. ->start and
- ->len in the subrequest indicate where and how big the next slice can be;
- the cache gets to reduce the length to match its granularity requirements.
-
* ``read()``
[Required] Called to read from the cache. The start file offset is given
@@ -562,44 +1015,33 @@ The methods defined in the table are:
indicating whether the termination is definitely happening in the caller's
context.
- * ``prepare_write()``
+ * ``prepare_write_subreq()``
- [Required] Called to prepare a write to the cache to take place. This
- involves checking to see whether the cache has sufficient space to honour
- the write. ``*_start`` and ``*_len`` indicate the region to be written; the
- region can be shrunk or it can be expanded to a page boundary either way as
- necessary to align for direct I/O. i_size holds the size of the object and
- is provided for reference. no_space_allocated_yet is set to true if the
- caller is certain that no data has been written to that region - for example
- if it tried to do a read from there already.
+ [Required] This is called to allow the cache to limit the size of a
+ subrequest. It may also limit the number of individual regions in iterator,
+ such as required by DIO/DMA. This information should be set on stream to
+ which the subrequest belongs::
- * ``write()``
+ rreq->io_streams[subreq->stream_nr].sreq_max_len
+ rreq->io_streams[subreq->stream_nr].sreq_max_segs
- [Required] Called to write to the cache. The start file offset is given
- along with an iterator to write from, which gives the length also.
-
- Also provided is a pointer to a termination handler function and private
- data to pass to that function. The termination function should be called
- with the number of bytes transferred or an error code, plus a flag
- indicating whether the termination is definitely happening in the caller's
- context.
+ The filesystem can use this, for example, to chop up a request that has to
+ be split across multiple servers or to put multiple writes in flight.
- * ``query_occupancy()``
+ This is not permitted to return an error. In the event of failure,
+ ``netfs_prepare_write_failed()`` must be called.
- [Required] Called to find out where the next piece of data is within a
- particular region of the cache. The start and length of the region to be
- queried are passed in, along with the granularity to which the answer needs
- to be aligned. The function passes back the start and length of the data,
- if any, available within that region. Note that there may be a hole at the
- front.
+ * ``issue_write()``
- It returns 0 if some data was found, -ENODATA if there was no usable data
- within the region or -ENOBUFS if there is no caching on this file.
+ [Required] This is used to dispatch a subrequest to the cache for writing.
+ In the subrequest, ->start, ->len and ->transferred indicate what data
+ should be written to the cache and ->io_iter indicates the buffer to be
+ used.
-Note that these methods are passed a pointer to the cache resource structure,
-not the read request structure as they could be used in other situations where
-there isn't a read request structure as well, such as writing dirty data to the
-cache.
+ There is no return value; the ``netfs_write_subreq_terminated()`` function
+ should be called to indicate that the subrequest completed either way.
+ ->error, ->transferred and ->flags should be updated before completing. The
+ termination can be done asynchronously.
API Function Reference
@@ -607,4 +1049,3 @@ API Function Reference
.. kernel-doc:: include/linux/netfs.h
.. kernel-doc:: fs/netfs/buffered_read.c
-.. kernel-doc:: fs/netfs/io.c
diff --git a/Documentation/filesystems/nfs/client-identifier.rst b/Documentation/filesystems/nfs/client-identifier.rst
index a94c7a9748d7..4804441155f5 100644
--- a/Documentation/filesystems/nfs/client-identifier.rst
+++ b/Documentation/filesystems/nfs/client-identifier.rst
@@ -131,7 +131,7 @@ deployments, this construction is usually adequate. Often, however,
the node name by itself is not adequately unique, and can change
unexpectedly. Problematic situations include:
- - NFS-root (diskless) clients, where the local DCHP server (or
+ - NFS-root (diskless) clients, where the local DHCP server (or
equivalent) does not provide a unique host name.
- "Containers" within a single Linux host. If each container has
diff --git a/Documentation/filesystems/nfs/exporting.rst b/Documentation/filesystems/nfs/exporting.rst
index 3d97b8d8f735..de64d2d002a2 100644
--- a/Documentation/filesystems/nfs/exporting.rst
+++ b/Documentation/filesystems/nfs/exporting.rst
@@ -122,12 +122,9 @@ are exportable by setting the s_export_op field in the struct
super_block. This field must point to a "struct export_operations"
struct which has the following members:
- encode_fh (optional)
+ encode_fh (mandatory)
Takes a dentry and creates a filehandle fragment which may later be used
- to find or create a dentry for the same object. The default
- implementation creates a filehandle fragment that encodes a 32bit inode
- and generation number for the inode encoded, and if necessary the
- same information for the parent.
+ to find or create a dentry for the same object.
fh_to_dentry (mandatory)
Given a filehandle fragment, this should find the implied object and
@@ -215,3 +212,29 @@ following flags are defined:
This flag causes nfsd to close any open files for this inode _before_
calling into the vfs to do an unlink or a rename that would replace
an existing file.
+
+ EXPORT_OP_REMOTE_FS - Backing storage for this filesystem is remote
+ PF_LOCAL_THROTTLE exists for loopback NFSD, where a thread needs to
+ write to one bdi (the final bdi) in order to free up writes queued
+ to another bdi (the client bdi). Such threads get a private balance
+ of dirty pages so that dirty pages for the client bdi do not imact
+ the daemon writing to the final bdi. For filesystems whose durable
+ storage is not local (such as exported NFS filesystems), this
+ constraint has negative consequences. EXPORT_OP_REMOTE_FS enables
+ an export to disable writeback throttling.
+
+ EXPORT_OP_NOATOMIC_ATTR - Filesystem does not update attributes atomically
+ EXPORT_OP_NOATOMIC_ATTR indicates that the exported filesystem
+ cannot provide the semantics required by the "atomic" boolean in
+ NFSv4's change_info4. This boolean indicates to a client whether the
+ returned before and after change attributes were obtained atomically
+ with the respect to the requested metadata operation (UNLINK,
+ OPEN/CREATE, MKDIR, etc).
+
+ EXPORT_OP_FLUSH_ON_CLOSE - Filesystem flushes file data on close(2)
+ On most filesystems, inodes can remain under writeback after the
+ file is closed. NFSD relies on client activity or local flusher
+ threads to handle writeback. Certain filesystems, such as NFS, flush
+ all of an inode's dirty data on last close. Exports that behave this
+ way should set EXPORT_OP_FLUSH_ON_CLOSE so that NFSD knows to skip
+ waiting for writeback when closing such files.
diff --git a/Documentation/filesystems/nfs/index.rst b/Documentation/filesystems/nfs/index.rst
index 8536134f31fd..a29a212b5b4d 100644
--- a/Documentation/filesystems/nfs/index.rst
+++ b/Documentation/filesystems/nfs/index.rst
@@ -8,9 +8,11 @@ NFS
client-identifier
exporting
+ localio
pnfs
rpc-cache
rpc-server-gss
nfs41-server
+ nfsd-io-modes
knfsd-stats
reexport
diff --git a/Documentation/filesystems/nfs/localio.rst b/Documentation/filesystems/nfs/localio.rst
new file mode 100644
index 000000000000..79808b37d745
--- /dev/null
+++ b/Documentation/filesystems/nfs/localio.rst
@@ -0,0 +1,357 @@
+===========
+NFS LOCALIO
+===========
+
+Overview
+========
+
+The LOCALIO auxiliary RPC protocol allows the Linux NFS client and
+server to reliably handshake to determine if they are on the same
+host. Select "NFS client and server support for LOCALIO auxiliary
+protocol" in menuconfig to enable CONFIG_NFS_LOCALIO in the kernel
+config (both CONFIG_NFS_FS and CONFIG_NFSD must also be enabled).
+
+Once an NFS client and server handshake as "local", the client will
+bypass the network RPC protocol for read, write and commit operations.
+Due to this XDR and RPC bypass, these operations will operate faster.
+
+The LOCALIO auxiliary protocol's implementation, which uses the same
+connection as NFS traffic, follows the pattern established by the NFS
+ACL protocol extension.
+
+The LOCALIO auxiliary protocol is needed to allow robust discovery of
+clients local to their servers. In a private implementation that
+preceded use of this LOCALIO protocol, a fragile sockaddr network
+address based match against all local network interfaces was attempted.
+But unlike the LOCALIO protocol, the sockaddr-based matching didn't
+handle use of iptables or containers.
+
+The robust handshake between local client and server is just the
+beginning, the ultimate use case this locality makes possible is the
+client is able to open files and issue reads, writes and commits
+directly to the server without having to go over the network. The
+requirement is to perform these loopback NFS operations as efficiently
+as possible, this is particularly useful for container use cases
+(e.g. kubernetes) where it is possible to run an IO job local to the
+server.
+
+The performance advantage realized from LOCALIO's ability to bypass
+using XDR and RPC for reads, writes and commits can be extreme, e.g.:
+
+fio for 20 secs with directio, qd of 8, 16 libaio threads:
+ - With LOCALIO:
+ 4K read: IOPS=979k, BW=3825MiB/s (4011MB/s)(74.7GiB/20002msec)
+ 4K write: IOPS=165k, BW=646MiB/s (678MB/s)(12.6GiB/20002msec)
+ 128K read: IOPS=402k, BW=49.1GiB/s (52.7GB/s)(982GiB/20002msec)
+ 128K write: IOPS=11.5k, BW=1433MiB/s (1503MB/s)(28.0GiB/20004msec)
+
+ - Without LOCALIO:
+ 4K read: IOPS=79.2k, BW=309MiB/s (324MB/s)(6188MiB/20003msec)
+ 4K write: IOPS=59.8k, BW=234MiB/s (245MB/s)(4671MiB/20002msec)
+ 128K read: IOPS=33.9k, BW=4234MiB/s (4440MB/s)(82.7GiB/20004msec)
+ 128K write: IOPS=11.5k, BW=1434MiB/s (1504MB/s)(28.0GiB/20011msec)
+
+fio for 20 secs with directio, qd of 8, 1 libaio thread:
+ - With LOCALIO:
+ 4K read: IOPS=230k, BW=898MiB/s (941MB/s)(17.5GiB/20001msec)
+ 4K write: IOPS=22.6k, BW=88.3MiB/s (92.6MB/s)(1766MiB/20001msec)
+ 128K read: IOPS=38.8k, BW=4855MiB/s (5091MB/s)(94.8GiB/20001msec)
+ 128K write: IOPS=11.4k, BW=1428MiB/s (1497MB/s)(27.9GiB/20001msec)
+
+ - Without LOCALIO:
+ 4K read: IOPS=77.1k, BW=301MiB/s (316MB/s)(6022MiB/20001msec)
+ 4K write: IOPS=32.8k, BW=128MiB/s (135MB/s)(2566MiB/20001msec)
+ 128K read: IOPS=24.4k, BW=3050MiB/s (3198MB/s)(59.6GiB/20001msec)
+ 128K write: IOPS=11.4k, BW=1430MiB/s (1500MB/s)(27.9GiB/20001msec)
+
+FAQ
+===
+
+1. What are the use cases for LOCALIO?
+
+ a. Workloads where the NFS client and server are on the same host
+ realize improved IO performance. In particular, it is common when
+ running containerised workloads for jobs to find themselves
+ running on the same host as the knfsd server being used for
+ storage.
+
+2. What are the requirements for LOCALIO?
+
+ a. Bypass use of the network RPC protocol as much as possible. This
+ includes bypassing XDR and RPC for open, read, write and commit
+ operations.
+ b. Allow client and server to autonomously discover if they are
+ running local to each other without making any assumptions about
+ the local network topology.
+ c. Support the use of containers by being compatible with relevant
+ namespaces (e.g. network, user, mount).
+ d. Support all versions of NFS. NFSv3 is of particular importance
+ because it has wide enterprise usage and pNFS flexfiles makes use
+ of it for the data path.
+
+3. Why doesn’t LOCALIO just compare IP addresses or hostnames when
+ deciding if the NFS client and server are co-located on the same
+ host?
+
+ Since one of the main use cases is containerised workloads, we cannot
+ assume that IP addresses will be shared between the client and
+ server. This sets up a requirement for a handshake protocol that
+ needs to go over the same connection as the NFS traffic in order to
+ identify that the client and the server really are running on the
+ same host. The handshake uses a secret that is sent over the wire,
+ and can be verified by both parties by comparing with a value stored
+ in shared kernel memory if they are truly co-located.
+
+4. Does LOCALIO improve pNFS flexfiles?
+
+ Yes, LOCALIO complements pNFS flexfiles by allowing it to take
+ advantage of NFS client and server locality. Policy that initiates
+ client IO as closely to the server where the data is stored naturally
+ benefits from the data path optimization LOCALIO provides.
+
+5. Why not develop a new pNFS layout to enable LOCALIO?
+
+ A new pNFS layout could be developed, but doing so would put the
+ onus on the server to somehow discover that the client is co-located
+ when deciding to hand out the layout.
+ There is value in a simpler approach (as provided by LOCALIO) that
+ allows the NFS client to negotiate and leverage locality without
+ requiring more elaborate modeling and discovery of such locality in a
+ more centralized manner.
+
+6. Why is having the client perform a server-side file OPEN, without
+ using RPC, beneficial? Is the benefit pNFS specific?
+
+ Avoiding the use of XDR and RPC for file opens is beneficial to
+ performance regardless of whether pNFS is used. Especially when
+ dealing with small files its best to avoid going over the wire
+ whenever possible, otherwise it could reduce or even negate the
+ benefits of avoiding the wire for doing the small file I/O itself.
+ Given LOCALIO's requirements the current approach of having the
+ client perform a server-side file open, without using RPC, is ideal.
+ If in the future requirements change then we can adapt accordingly.
+
+7. Why is LOCALIO only supported with UNIX Authentication (AUTH_UNIX)?
+
+ Strong authentication is usually tied to the connection itself. It
+ works by establishing a context that is cached by the server, and
+ that acts as the key for discovering the authorisation token, which
+ can then be passed to rpc.mountd to complete the authentication
+ process. On the other hand, in the case of AUTH_UNIX, the credential
+ that was passed over the wire is used directly as the key in the
+ upcall to rpc.mountd. This simplifies the authentication process, and
+ so makes AUTH_UNIX easier to support.
+
+8. How do export options that translate RPC user IDs behave for LOCALIO
+ operations (eg. root_squash, all_squash)?
+
+ Export options that translate user IDs are managed by nfsd_setuser()
+ which is called by nfsd_setuser_and_check_port() which is called by
+ __fh_verify(). So they get handled exactly the same way for LOCALIO
+ as they do for non-LOCALIO.
+
+9. How does LOCALIO make certain that object lifetimes are managed
+ properly given NFSD and NFS operate in different contexts?
+
+ See the detailed "NFS Client and Server Interlock" section below.
+
+RPC
+===
+
+The LOCALIO auxiliary RPC protocol consists of a single "UUID_IS_LOCAL"
+RPC method that allows the Linux NFS client to verify the local Linux
+NFS server can see the nonce (single-use UUID) the client generated and
+made available in nfs_common. This protocol isn't part of an IETF
+standard, nor does it need to be considering it is Linux-to-Linux
+auxiliary RPC protocol that amounts to an implementation detail.
+
+The UUID_IS_LOCAL method encodes the client generated uuid_t in terms of
+the fixed UUID_SIZE (16 bytes). The fixed size opaque encode and decode
+XDR methods are used instead of the less efficient variable sized
+methods.
+
+The RPC program number for the NFS_LOCALIO_PROGRAM is 400122 (as assigned
+by IANA, see https://www.iana.org/assignments/rpc-program-numbers/ ):
+Linux Kernel Organization 400122 nfslocalio
+
+The LOCALIO protocol spec in rpcgen syntax is::
+
+ /* raw RFC 9562 UUID */
+ #define UUID_SIZE 16
+ typedef u8 uuid_t<UUID_SIZE>;
+
+ program NFS_LOCALIO_PROGRAM {
+ version LOCALIO_V1 {
+ void
+ NULL(void) = 0;
+
+ void
+ UUID_IS_LOCAL(uuid_t) = 1;
+ } = 1;
+ } = 400122;
+
+LOCALIO uses the same transport connection as NFS traffic. As such,
+LOCALIO is not registered with rpcbind.
+
+NFS Common and Client/Server Handshake
+======================================
+
+fs/nfs_common/nfslocalio.c provides interfaces that enable an NFS client
+to generate a nonce (single-use UUID) and associated short-lived
+nfs_uuid_t struct, register it with nfs_common for subsequent lookup and
+verification by the NFS server and if matched the NFS server populates
+members in the nfs_uuid_t struct. The NFS client then uses nfs_common to
+transfer the nfs_uuid_t from its nfs_uuids to the nn->nfsd_serv
+clients_list from the nfs_common's uuids_list. See:
+fs/nfs/localio.c:nfs_local_probe()
+
+nfs_common's nfs_uuids list is the basis for LOCALIO enablement, as such
+it has members that point to nfsd memory for direct use by the client
+(e.g. 'net' is the server's network namespace, through it the client can
+access nn->nfsd_serv with proper rcu read access). It is this client
+and server synchronization that enables advanced usage and lifetime of
+objects to span from the host kernel's nfsd to per-container knfsd
+instances that are connected to nfs client's running on the same local
+host.
+
+NFS Client and Server Interlock
+===============================
+
+LOCALIO provides the nfs_uuid_t object and associated interfaces to
+allow proper network namespace (net-ns) and NFSD object refcounting.
+
+LOCALIO required the introduction and use of NFSD's percpu nfsd_net_ref
+to interlock nfsd_shutdown_net() and nfsd_open_local_fh(), to ensure
+each net-ns is not destroyed while in use by nfsd_open_local_fh(), and
+warrants a more detailed explanation:
+
+ nfsd_open_local_fh() uses nfsd_net_try_get() before opening its
+ nfsd_file handle and then the caller (NFS client) must drop the
+ reference for the nfsd_file and associated net-ns using
+ nfsd_file_put_local() once it has completed its IO.
+
+ This interlock working relies heavily on nfsd_open_local_fh() being
+ afforded the ability to safely deal with the possibility that the
+ NFSD's net-ns (and nfsd_net by association) may have been destroyed
+ by nfsd_destroy_serv() via nfsd_shutdown_net().
+
+This interlock of the NFS client and server has been verified to fix an
+easy to hit crash that would occur if an NFSD instance running in a
+container, with a LOCALIO client mounted, is shutdown. Upon restart of
+the container and associated NFSD, the client would go on to crash due
+to NULL pointer dereference that occurred due to the LOCALIO client's
+attempting to nfsd_open_local_fh() without having a proper reference on
+NFSD's net-ns.
+
+NFS Client issues IO instead of Server
+======================================
+
+Because LOCALIO is focused on protocol bypass to achieve improved IO
+performance, alternatives to the traditional NFS wire protocol (SUNRPC
+with XDR) must be provided to access the backing filesystem.
+
+See fs/nfs/localio.c:nfs_local_open_fh() and
+fs/nfsd/localio.c:nfsd_open_local_fh() for the interface that makes
+focused use of select nfs server objects to allow a client local to a
+server to open a file pointer without needing to go over the network.
+
+The client's fs/nfs/localio.c:nfs_local_open_fh() will call into the
+server's fs/nfsd/localio.c:nfsd_open_local_fh() and carefully access
+both the associated nfsd network namespace and nn->nfsd_serv in terms of
+RCU. If nfsd_open_local_fh() finds that the client no longer sees valid
+nfsd objects (be it struct net or nn->nfsd_serv) it returns -ENXIO
+to nfs_local_open_fh() and the client will try to reestablish the
+LOCALIO resources needed by calling nfs_local_probe() again. This
+recovery is needed if/when an nfsd instance running in a container were
+to reboot while a LOCALIO client is connected to it.
+
+Once the client has an open nfsd_file pointer it will issue reads,
+writes and commits directly to the underlying local filesystem (normally
+done by the nfs server). As such, for these operations, the NFS client
+is issuing IO to the underlying local filesystem that it is sharing with
+the NFS server. See: fs/nfs/localio.c:nfs_local_doio() and
+fs/nfs/localio.c:nfs_local_commit().
+
+With normal NFS that makes use of RPC to issue IO to the server, if an
+application uses O_DIRECT the NFS client will bypass the pagecache but
+the NFS server will not. The NFS server's use of buffered IO affords
+applications to be less precise with their alignment when issuing IO to
+the NFS client. But if all applications properly align their IO, LOCALIO
+can be configured to use end-to-end O_DIRECT semantics from the NFS
+client to the underlying local filesystem, that it is sharing with
+the NFS server, by setting the 'localio_O_DIRECT_semantics' nfs module
+parameter to Y, e.g.:
+
+ echo Y > /sys/module/nfs/parameters/localio_O_DIRECT_semantics
+
+Once enabled, it will cause LOCALIO to use end-to-end O_DIRECT semantics
+(but again, this may cause IO to fail if applications do not properly
+align their IO).
+
+Security
+========
+
+LOCALIO is only supported when UNIX-style authentication (AUTH_UNIX, aka
+AUTH_SYS) is used.
+
+Care is taken to ensure the same NFS security mechanisms are used
+(authentication, etc) regardless of whether LOCALIO or regular NFS
+access is used. The auth_domain established as part of the traditional
+NFS client access to the NFS server is also used for LOCALIO.
+
+Relative to containers, LOCALIO gives the client access to the network
+namespace the server has. This is required to allow the client to access
+the server's per-namespace nfsd_net struct. With traditional NFS, the
+client is afforded this same level of access (albeit in terms of the NFS
+protocol via SUNRPC). No other namespaces (user, mount, etc) have been
+altered or purposely extended from the server to the client.
+
+Module Parameters
+=================
+
+/sys/module/nfs/parameters/localio_enabled (bool)
+controls if LOCALIO is enabled, defaults to Y. If client and server are
+local but 'localio_enabled' is set to N then LOCALIO will not be used.
+
+/sys/module/nfs/parameters/localio_O_DIRECT_semantics (bool)
+controls if O_DIRECT extends down to the underlying filesystem, defaults
+to N. Application IO must be logical blocksize aligned, otherwise
+O_DIRECT will fail.
+
+/sys/module/nfsv3/parameters/nfs3_localio_probe_throttle (uint)
+controls if NFSv3 read and write IOs will trigger (re)enabling of
+LOCALIO every N (nfs3_localio_probe_throttle) IOs, defaults to 0
+(disabled). Must be power-of-2, admin keeps all the pieces if they
+misconfigure (too low a value or non-power-of-2).
+
+Testing
+=======
+
+The LOCALIO auxiliary protocol and associated NFS LOCALIO read, write
+and commit access have proven stable against various test scenarios:
+
+- Client and server both on the same host.
+
+- All permutations of client and server support enablement for both
+ local and remote client and server.
+
+- Testing against NFS storage products that don't support the LOCALIO
+ protocol was also performed.
+
+- Client on host, server within a container (for both v3 and v4.2).
+ The container testing was in terms of podman managed containers and
+ includes successful container stop/restart scenario.
+
+- Formalizing these test scenarios in terms of existing test
+ infrastructure is on-going. Initial regular coverage is provided in
+ terms of ktest running xfstests against a LOCALIO-enabled NFS loopback
+ mount configuration, and includes lockdep and KASAN coverage, see:
+ https://evilpiepirate.org/~testdashboard/ci?user=snitzer&branch=snitm-nfs-next
+ https://github.com/koverstreet/ktest
+
+- Various kdevops testing (in terms of "Chuck's BuildBot") has been
+ performed to regularly verify the LOCALIO changes haven't caused any
+ regressions to non-LOCALIO NFS use cases.
+
+- All of Hammerspace's various sanity tests pass with LOCALIO enabled
+ (this includes numerous pNFS and flexfiles tests).
diff --git a/Documentation/filesystems/nfs/nfsd-io-modes.rst b/Documentation/filesystems/nfs/nfsd-io-modes.rst
new file mode 100644
index 000000000000..0fd6e82478fe
--- /dev/null
+++ b/Documentation/filesystems/nfs/nfsd-io-modes.rst
@@ -0,0 +1,153 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+NFSD IO MODES
+=============
+
+Overview
+========
+
+NFSD has historically always used buffered IO when servicing READ and
+WRITE operations. BUFFERED is NFSD's default IO mode, but it is possible
+to override that default to use either DONTCACHE or DIRECT IO modes.
+
+Experimental NFSD debugfs interfaces are available to allow the NFSD IO
+mode used for READ and WRITE to be configured independently. See both:
+
+- /sys/kernel/debug/nfsd/io_cache_read
+- /sys/kernel/debug/nfsd/io_cache_write
+
+The default value for both io_cache_read and io_cache_write reflects
+NFSD's default IO mode (which is NFSD_IO_BUFFERED=0).
+
+Based on the configured settings, NFSD's IO will either be:
+
+- cached using page cache (NFSD_IO_BUFFERED=0)
+- cached but removed from page cache on completion (NFSD_IO_DONTCACHE=1)
+- not cached stable_how=NFS_UNSTABLE (NFSD_IO_DIRECT=2)
+
+To set an NFSD IO mode, write a supported value (0 - 2) to the
+corresponding IO operation's debugfs interface, e.g.::
+
+ echo 2 > /sys/kernel/debug/nfsd/io_cache_read
+ echo 2 > /sys/kernel/debug/nfsd/io_cache_write
+
+To check which IO mode NFSD is using for READ or WRITE, simply read the
+corresponding IO operation's debugfs interface, e.g.::
+
+ cat /sys/kernel/debug/nfsd/io_cache_read
+ cat /sys/kernel/debug/nfsd/io_cache_write
+
+If you experiment with NFSD's IO modes on a recent kernel and have
+interesting results, please report them to linux-nfs@vger.kernel.org
+
+NFSD DONTCACHE
+==============
+
+DONTCACHE offers a hybrid approach to servicing IO that aims to offer
+the benefits of using DIRECT IO without any of the strict alignment
+requirements that DIRECT IO imposes. To achieve this buffered IO is used
+but the IO is flagged to "drop behind" (meaning associated pages are
+dropped from the page cache) when IO completes.
+
+DONTCACHE aims to avoid what has proven to be a fairly significant
+limition of Linux's memory management subsystem if/when large amounts of
+data is infrequently accessed (e.g. read once _or_ written once but not
+read until much later). Such use-cases are particularly problematic
+because the page cache will eventually become a bottleneck to servicing
+new IO requests.
+
+For more context on DONTCACHE, please see these Linux commit headers:
+
+- Overview: 9ad6344568cc3 ("mm/filemap: change filemap_create_folio()
+ to take a struct kiocb")
+- for READ: 8026e49bff9b1 ("mm/filemap: add read support for
+ RWF_DONTCACHE")
+- for WRITE: 974c5e6139db3 ("xfs: flag as supporting FOP_DONTCACHE")
+
+NFSD_IO_DONTCACHE will fall back to NFSD_IO_BUFFERED if the underlying
+filesystem doesn't indicate support by setting FOP_DONTCACHE.
+
+NFSD DIRECT
+===========
+
+DIRECT IO doesn't make use of the page cache, as such it is able to
+avoid the Linux memory management's page reclaim scalability problems
+without resorting to the hybrid use of page cache that DONTCACHE does.
+
+Some workloads benefit from NFSD avoiding the page cache, particularly
+those with a working set that is significantly larger than available
+system memory. The pathological worst-case workload that NFSD DIRECT has
+proven to help most is: NFS client issuing large sequential IO to a file
+that is 2-3 times larger than the NFS server's available system memory.
+The reason for such improvement is NFSD DIRECT eliminates a lot of work
+that the memory management subsystem would otherwise be required to
+perform (e.g. page allocation, dirty writeback, page reclaim). When
+using NFSD DIRECT, kswapd and kcompactd are no longer commanding CPU
+time trying to find adequate free pages so that forward IO progress can
+be made.
+
+The performance win associated with using NFSD DIRECT was previously
+discussed on linux-nfs, see:
+https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
+
+But in summary:
+
+- NFSD DIRECT can significantly reduce memory requirements
+- NFSD DIRECT can reduce CPU load by avoiding costly page reclaim work
+- NFSD DIRECT can offer more deterministic IO performance
+
+As always, your mileage may vary and so it is important to carefully
+consider if/when it is beneficial to make use of NFSD DIRECT. When
+assessing comparative performance of your workload please be sure to log
+relevant performance metrics during testing (e.g. memory usage, cpu
+usage, IO performance). Using perf to collect perf data that may be used
+to generate a "flamegraph" for work Linux must perform on behalf of your
+test is a really meaningful way to compare the relative health of the
+system and how switching NFSD's IO mode changes what is observed.
+
+If NFSD_IO_DIRECT is specified by writing 2 (or 3 and 4 for WRITE) to
+NFSD's debugfs interfaces, ideally the IO will be aligned relative to
+the underlying block device's logical_block_size. Also the memory buffer
+used to store the READ or WRITE payload must be aligned relative to the
+underlying block device's dma_alignment.
+
+But NFSD DIRECT does handle misaligned IO in terms of O_DIRECT as best
+it can:
+
+Misaligned READ:
+ If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
+ DIO-aligned block (on either end of the READ). The expanded READ is
+ verified to have proper offset/len (logical_block_size) and
+ dma_alignment checking.
+
+Misaligned WRITE:
+ If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
+ middle and end as needed. The large middle segment is DIO-aligned
+ and the start and/or end are misaligned. Buffered IO is used for the
+ misaligned segments and O_DIRECT is used for the middle DIO-aligned
+ segment. DONTCACHE buffered IO is _not_ used for the misaligned
+ segments because using normal buffered IO offers significant RMW
+ performance benefit when handling streaming misaligned WRITEs.
+
+Tracing:
+ The nfsd_read_direct trace event shows how NFSD expands any
+ misaligned READ to the next DIO-aligned block (on either end of the
+ original READ, as needed).
+
+ This combination of trace events is useful for READs::
+
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_direct/enable
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
+ echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
+
+ The nfsd_write_direct trace event shows how NFSD splits a given
+ misaligned WRITE into a DIO-aligned middle segment.
+
+ This combination of trace events is useful for WRITEs::
+
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_direct/enable
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable
+ echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable
diff --git a/Documentation/filesystems/nfs/nfsd-maintainer-entry-profile.rst b/Documentation/filesystems/nfs/nfsd-maintainer-entry-profile.rst
new file mode 100644
index 000000000000..4d6b57dbab2a
--- /dev/null
+++ b/Documentation/filesystems/nfs/nfsd-maintainer-entry-profile.rst
@@ -0,0 +1,547 @@
+NFSD Maintainer Entry Profile
+=============================
+
+A Maintainer Entry Profile supplements the top-level process
+documents (found in Documentation/process/) with customs that are
+specific to a subsystem and its maintainers. A contributor may use
+this document to set their expectations and avoid common mistakes.
+A maintainer may use these profiles to look across subsystems for
+opportunities to converge on best common practices.
+
+Overview
+--------
+The Network File System (NFS) is a standardized family of network
+protocols that enable access to files across a set of network-
+connected peer hosts. Applications on NFS clients access files that
+reside on file systems that are shared by NFS servers. A single
+network peer can act as both an NFS client and an NFS server.
+
+NFSD refers to the NFS server implementation included in the Linux
+kernel. An in-kernel NFS server has fast access to files stored
+in file systems local to that server. NFSD can share files stored
+on most of the file system types native to Linux, including xfs,
+ext4, btrfs, and tmpfs.
+
+Mailing list
+------------
+The linux-nfs@vger.kernel.org mailing list is a public list. Its
+purpose is to enable collaboration among developers working on the
+Linux NFS stack, both client and server. It is not a place for
+conversations that are not related directly to the Linux NFS stack.
+
+The linux-nfs mailing list is archived on `lore.kernel.org <https://lore.kernel.org/linux-nfs/>`_.
+
+The Linux NFS community does not have any chat room.
+
+Reporting bugs
+--------------
+If you experience an NFSD-related bug on a distribution-built
+kernel, please start by working with your Linux distributor.
+
+Bug reports against upstream Linux code bases are welcome on the
+linux-nfs@vger.kernel.org mailing list, where some active triage
+can be done. NFSD bugs may also be reported in the Linux kernel
+community's bugzilla at:
+
+ https://bugzilla.kernel.org
+
+Please file NFSD-related bugs under the "Filesystems/NFSD"
+component. In general, including as much detail as possible is a
+good start, including pertinent system log messages from both
+the client and server.
+
+User space software related to NFSD, such as mountd or the exportfs
+command, is contained in the nfs-utils package. Report problems
+with those components to linux-nfs@vger.kernel.org. You might be
+directed to move the report to a specific bug tracker.
+
+Contributor's Guide
+-------------------
+
+Standards compliance
+~~~~~~~~~~~~~~~~~~~~
+The priority is for NFSD to interoperate fully with the Linux NFS
+client. We also test against other popular NFS client implementa-
+tions regularly at NFS bake-a-thon events (also known as plug-
+fests). Non-Linux NFS clients are not part of upstream NFSD CI/CD.
+
+The NFSD community strives to provide an NFS server implementation
+that interoperates with all standards-compliant NFS client
+implementations. This is done by staying as close as is sensible to
+the normative mandates in the IETF's published NFS, RPC, and GSS-API
+standards.
+
+It is always useful to reference an RFC and section number in a code
+comment where behavior deviates from the standard (and even when the
+behavior is compliant but the implementation is obfuscatory).
+
+On the rare occasion when a deviation from standard-mandated
+behavior is needed, brief documentation of the use case or
+deficiencies in the standard is a required part of in-code
+documentation.
+
+Care must always be taken to avoid leaking local error codes (ie,
+errnos) to clients of NFSD. A proper NFS status code is always
+required in NFS protocol replies.
+
+NFSD administrative interfaces
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+NFSD administrative interfaces include:
+
+- an NFSD or SUNRPC module parameter
+
+- export options in /etc/exports
+
+- files under /proc/fs/nfsd/ or /proc/sys/sunrpc/
+
+- the NFSD netlink protocol
+
+Frequently, a request is made to introduce or modify one of NFSD's
+traditional administrative interfaces. Certainly it is technically
+easy to introduce a new administrative setting. However, there are
+good reasons why the NFSD maintainers prefer to leave that as a last
+resort:
+
+- As with any API, administrative interfaces are difficult to get
+ right.
+
+- Once they are documented and have a legacy of use, administrative
+ interfaces become difficult to modify or remove.
+
+- Every new administrative setting multiplies the NFSD test matrix.
+
+- The cost of one administrative interface is incremental, but costs
+ add up across all of the existing interfaces.
+
+It is often better for everyone if effort is made up front to
+understanding the underlying requirement of the new setting, and
+then trying to make it tune itself (or to become otherwise
+unnecessary).
+
+If a new setting is indeed necessary, first consider adding it to
+the NFSD netlink protocol. Or if it doesn't need to be a reliable
+long term user space feature, it can be added to NFSD's menagerie of
+experimental settings which reside under /sys/kernel/debug/nfsd/ .
+
+Field observability
+~~~~~~~~~~~~~~~~~~~
+NFSD employs several different mechanisms for observing operation,
+including counters, printks, WARNings, and static trace points. Each
+have their strengths and weaknesses. Contributors should select the
+most appropriate tool for their task.
+
+- BUG must be avoided if at all possible, as it will frequently
+ result in a full system crash.
+
+- WARN is appropriate only when a full stack trace is useful.
+
+- printk can show detailed information. These must not be used
+ in code paths where they can be triggered repeatedly by remote
+ users.
+
+- dprintk can show detailed information, but can be enabled only
+ in pre-set groups. The overhead of emitting output makes dprintk
+ inappropriate for frequent operations like I/O.
+
+- Counters are always on, but provide little information about
+ individual events other than how frequently they occur.
+
+- static trace points can be enabled individually or in groups
+ (via a glob). These are generally low overhead, and thus are
+ favored for use in hot paths.
+
+- dynamic tracing, such as kprobes or eBPF, are quite flexible but
+ cannot be used in certain environments (eg, full kernel lock-
+ down).
+
+Testing
+~~~~~~~
+The kdevops project
+
+ https://github.com/linux-kdevops/kdevops
+
+contains several NFS-specific workflows, as well as the community
+standard fstests suite. These workflows are based on open source
+testing tools such as ltp and fio. Contributors are encouraged to
+use these tools without kdevops, or contributors should install and
+use kdevops themselves to verify their patches before submission.
+
+Coding style
+~~~~~~~~~~~~
+Follow the coding style preferences described in
+
+ Documentation/process/coding-style.rst
+
+with the following exceptions:
+
+- Add new local variables to a function in reverse Christmas tree
+ order
+
+- Use the kdoc comment style for
+ + non-static functions
+ + static inline functions
+ + static functions that are callbacks/virtual functions
+
+- All new function names start with ``nfsd_`` for non-NFS-version-
+ specific functions.
+
+- New function names that are specific to NFSv2 or NFSv3, or are
+ used by all minor versions of NFSv4, use ``nfsdN_`` where N is
+ the version.
+
+- New function names specific to an NFSv4 minor version can be
+ named with ``nfsd4M_`` where M is the minor version.
+
+Patch preparation
+~~~~~~~~~~~~~~~~~
+Read and follow all guidelines in
+
+ Documentation/process/submitting-patches.rst
+
+Use tagging to identify all patch authors. However, reviewers and
+testers should be added by replying to the email patch submission.
+Email is extensively used in order to publicly archive review and
+testing attributions. These tags are automatically inserted into
+your patches when they are applied.
+
+The code in the body of the diff already shows /what/ is being
+changed. Thus it is not necessary to repeat that in the patch
+description. Instead, the description should contain one or more
+of:
+
+- A brief problem statement ("what is this patch trying to fix?")
+ with a root-cause analysis.
+
+- End-user visible symptoms or items that a support engineer might
+ use to search for the patch, like stack traces.
+
+- A brief explanation of why the patch is the best way to address
+ the problem.
+
+- Any context that reviewers might need to understand the changes
+ made by the patch.
+
+- Any relevant benchmarking results, and/or functional test results.
+
+As detailed in Documentation/process/submitting-patches.rst,
+identify the point in history that the issue being addressed was
+introduced by using a Fixes: tag.
+
+Mention in the patch description if that point in history cannot be
+determined -- that is, no Fixes: tag can be provided. In this case,
+please make it clear to maintainers whether an LTS backport is
+needed even though there is no Fixes: tag.
+
+The NFSD maintainers prefer to add stable tagging themselves, after
+public discussion in response to the patch submission. Contributors
+may suggest stable tagging, but be aware that many version
+management tools add such stable Cc's when you post your patches.
+Don't add "Cc: stable" unless you are absolutely sure the patch
+needs to go to stable during the initial submission process.
+
+Patch submission
+~~~~~~~~~~~~~~~~
+Patches to NFSD are submitted via the kernel's email-based review
+process that is common to most other kernel subsystems.
+
+Just before each submission, rebase your patch or series on the
+nfsd-testing branch at
+
+ https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git
+
+The NFSD subsystem is maintained separately from the Linux in-kernel
+NFS client. The NFSD maintainers do not normally take submissions
+for client changes, nor can they respond authoritatively to bug
+reports or feature requests for NFS client code.
+
+This means that contributors might be asked to resubmit patches if
+they were emailed to the incorrect set of maintainers and reviewers.
+This is not a rejection, but simply a correction of the submission
+process.
+
+When in doubt, consult the NFSD entry in the MAINTAINERS file to
+see which files and directories fall under the NFSD subsystem.
+
+The proper set of email addresses for NFSD patches are:
+
+To: the NFSD maintainers and reviewers listed in MAINTAINERS
+Cc: linux-nfs@vger.kernel.org and optionally linux-kernel@
+
+If there are other subsystems involved in the patches (for example
+MM or RDMA) their primary mailing list address can be included in
+the Cc: field. Other contributors and interested parties may be
+included there as well.
+
+In general we prefer that contributors use common patch email tools
+such as "git send-email" or "stg email format/send", which tend to
+get the details right without a lot of fuss.
+
+A series consisting of a single patch is not required to have a
+cover letter. However, a cover letter can be included if there is
+substantial context that is not appropriate to include in the
+patch description.
+
+Please note that, with an e-mail based submission process, series
+cover letters are not part of the work that is committed to the
+kernel source code base or its commit history. Therefore always try
+to keep pertinent information in the patch descriptions.
+
+Design documentation is welcome, but as cover letters are not
+preserved, a perhaps better option is to include a patch that adds
+such documentation under Documentation/filesystems/nfs/.
+
+Reviewers will ask about test coverage and what use cases the
+patches are expected to address. Please be prepared to answer these
+questions.
+
+Review comments from maintainers might be politely stated, but in
+general, these are not optional to address when they are actionable.
+If necessary, the maintainers retain the right to not apply patches
+when contributors refuse to address reasonable requests.
+
+Post changes to kernel source code and user space source code as
+separate series. You can connect the two series with comments in
+your cover letters.
+
+Generally the NFSD maintainers ask for a reposts even for simple
+modifications in order to publicly archive the request and the
+resulting repost before it is pulled into the NFSD trees. This
+also enables us to rebuild a patch series quickly without missing
+changes that might have been discussed via email.
+
+Avoid frequently reposting large series with only small changes. As
+a rule of thumb, posting substantial changes more than once a week
+will result in reviewer overload.
+
+Remember, there are only a handful of subsystem maintainers and
+reviewers, but potentially many sources of contributions. The
+maintainers and reviewers, therefore, are always the less scalable
+resource. Be kind to your friendly neighborhood maintainer.
+
+Patch Acceptance
+~~~~~~~~~~~~~~~~
+There isn't a formal review process for NFSD, but we like to see
+at least two Reviewed-by: notices for patches that are more than
+simple clean-ups. Reviews are done in public on
+linux-nfs@vger.kernel.org and are archived on lore.kernel.org.
+
+Currently the NFSD patch queues are maintained in branches here:
+
+ https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git
+
+The NFSD maintainers apply patches initially to the nfsd-testing
+branch, which is always open to new submissions. Patches can be
+applied while review is ongoing. nfsd-testing is a topic branch,
+so it can change frequently, it will be rebased, and your patch
+might get dropped if there is a problem with it.
+
+Generally a script-generated "thank you" email will indicate when
+your patch has been added to the nfsd-testing branch. You can track
+the progress of your patch using the linux-nfs patchworks instance:
+
+ https://patchwork.kernel.org/project/linux-nfs/list/
+
+While your patch is in nfsd-testing, it is exposed to a variety of
+test environments, including community zero-day bots, static
+analysis tools, and NFSD continuous integration testing. The soak
+period is three to four weeks.
+
+Each patch that survives in nfsd-testing for the soak period without
+changes is moved to the nfsd-next branch.
+
+The nfsd-next branch is automatically merged into linux-next and
+fs-next on a nightly basis.
+
+Patches that survive in nfsd-next are included in the next NFSD
+merge window pull request. These windows typically occur once every
+63 days (nine weeks).
+
+When the upstream merge window closes, the nfsd-next branch is
+renamed nfsd-fixes, and a new nfsd-next branch is created, based on
+the upstream -rc1 tag.
+
+Fixes that are destined for an upstream -rc release also run the
+nfsd-testing gauntlet, but are then applied to the nfsd-fixes
+branch. That branch is made available for Linus to pull after a
+short time. In order to limit the risk of introducing regressions,
+we limit such fixes to emergency situations or fixes to breakage
+that occurred during the most recent upstream merge.
+
+Please make it clear when submitting an emergency patch that
+immediate action (either application to -rc or LTS backport) is
+needed.
+
+Sensitive patch submissions and bug reports
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+CVEs are generated by specific members of the Linux kernel community
+and several external entities. The Linux NFS community does not emit
+or assign CVEs. CVEs are assigned after an issue and its fix are
+known.
+
+However, the NFSD maintainers sometimes receive sensitive security
+reports, and at times these are significant enough to need to be
+embargoed. In such rare cases, fixes can be developed and reviewed
+out of the public eye.
+
+Please be aware that many version management tools add the stable
+Cc's when you post your patches. This is generally a nuisance, but
+it can result in outing an embargoed security issue accidentally.
+Don't add "Cc: stable" unless you are absolutely sure the patch
+needs to go to stable@ during the initial submission process.
+
+Patches that are merged without ever appearing on any list, and
+which carry a Reported-by: or Fixes: tag are detected as suspicious
+by security-focused people. We encourage that, after any private
+review, security-sensitive patches should be posted to linux-nfs@
+for the usual public review, archiving, and test period.
+
+LLM-generated submissions
+~~~~~~~~~~~~~~~~~~~~~~~~~
+The Linux kernel community as a whole is still exploring the new
+world of LLM-generated code. The NFSD maintainers will entertain
+submission of patches that are partially or wholly generated by
+LLM-based development tools. Such submissions are held to the
+same standards as submissions created entirely by human authors:
+
+- The human contributor identifies themselves via a Signed-off-by:
+ tag. This tag counts as a DoC.
+
+- The human contributor is solely responsible for code provenance
+ and any contamination by inadvertently-included code with a
+ conflicting license, as usual.
+
+- The human contributor must be able to answer and address review
+ questions. A patch description such as "This fixed my problem
+ but I don't know why" is not acceptable.
+
+- The contribution is subjected to the same test regimen as all
+ other submissions.
+
+- An indication (via a Generated-by: tag or otherwise) that the
+ contribution is LLM-generated is not required.
+
+It is easy to address review comments and fix requests in LLM
+generated code. So easy, in fact, that it becomes tempting to repost
+refreshed code immediately. Please resist that temptation.
+
+As always, please avoid reposting series revisions more than once
+every 24 hours.
+
+Clean-up patches
+~~~~~~~~~~~~~~~~
+The NFSD maintainers discourage patches which perform simple clean-
+ups, which are not in the context of other work. For example:
+
+* Addressing ``checkpatch.pl`` warnings after merge
+* Addressing :ref:`Local variable ordering<rcs>` issues
+* Addressing long-standing whitespace damage
+
+This is because it is felt that the churn that such changes produce
+comes at a greater cost than the value of such clean-ups.
+
+Conversely, spelling and grammar fixes are encouraged.
+
+Stable and LTS support
+----------------------
+Upstream NFSD continuous integration testing runs against LTS trees
+whenever they are updated.
+
+Please indicate when a patch containing a fix needs to be considered
+for LTS kernels, either via a Fixes: tag or explicit mention.
+
+Feature requests
+----------------
+There is no one way to make an official feature request, but
+discussion about the request should eventually make its way to
+the linux-nfs@vger.kernel.org mailing list for public review by
+the community.
+
+Subsystem boundaries
+~~~~~~~~~~~~~~~~~~~~
+NFSD itself is not much more than a protocol engine. This means its
+primary responsibility is to translate the NFS protocol into API
+calls in the Linux kernel. For example, NFSD is not responsible for
+knowing exactly how bytes or file attributes are managed on a block
+device. It relies on other kernel subsystems for that.
+
+If the subsystems on which NFSD relies do not implement a particular
+feature, even if the standard NFS protocols do support that feature,
+that usually means NFSD cannot provide that feature without
+substantial development work in other areas of the kernel.
+
+Specificity
+~~~~~~~~~~~
+Feature requests can come from anywhere, and thus can often be
+nebulous. A requester might not understand what a "use case" or
+"user story" is. These descriptive paradigms are often used by
+developers and architects to understand what is required of a
+design, but are terms of art in the software trade, not used in
+the everyday world.
+
+In order to prevent contributors and maintainers from becoming
+overwhelmed, we won't be afraid of saying "no" politely to
+underspecified requests.
+
+Community roles and their authority
+-----------------------------------
+The purpose of Linux subsystem communities is to provide expertise
+and active stewardship of a narrow set of source files in the Linux
+kernel. This can include managing user space tooling as well.
+
+To contextualize the structure of the Linux NFS community that
+is responsible for stewardship of the NFS server code base, we
+define the community roles here.
+
+- **Contributor** : Anyone who submits a code change, bug fix,
+ recommendation, documentation fix, and so on. A contributor can
+ submit regularly or infrequently.
+
+- **Outside Contributor** : A contributor who is not a regular actor
+ in the Linux NFS community. This can mean someone who contributes
+ to other parts of the kernel, or someone who just noticed a
+ misspelling in a comment and sent a patch.
+
+- **Reviewer** : Someone who is named in the MAINTAINERS file as a
+ reviewer is an area expert who can request changes to contributed
+ code, and expects that contributors will address the request.
+
+- **External Reviewer** : Someone who is not named in the
+ MAINTAINERS file as a reviewer, but who is an area expert.
+ Examples include Linux kernel contributors with networking,
+ security, or persistent storage expertise, or developers who
+ contribute primarily to other NFS implementations.
+
+One or more people will take on the following roles. These people
+are often generically referred to as "maintainers", and are
+identified in the MAINTAINERS file with the "M:" tag under the NFSD
+subsystem.
+
+- **Upstream Release Manager** : This role is responsible for
+ curating contributions into a branch, reviewing test results, and
+ then sending a pull request during merge windows. There is a
+ trust relationship between the release manager and Linus.
+
+- **Bug Triager** : Someone who is a first responder to bug reports
+ submitted to the linux-nfs mailing list or bug trackers, and helps
+ troubleshoot and identify next steps.
+
+- **Security Lead** : The security lead handles contacts from the
+ security community to resolve immediate issues, as well as dealing
+ with long-term security issues such as supply chain concerns. For
+ upstream, that's usually whether contributions violate licensing
+ or other intellectual property agreements.
+
+- **Testing Lead** : The testing lead builds and runs the test
+ infrastructure for the subsystem. The testing lead may ask for
+ patches to be dropped because of ongoing high defect rates.
+
+- **LTS Maintainer** : The LTS maintainer is responsible for managing
+ the Fixes: and Cc: stable annotations on patches, and seeing that
+ patches that cannot be automatically applied to LTS kernels get
+ proper manual backports as necessary.
+
+- **Community Manager** : This umpire role can be asked to call balls
+ and strikes during conflicts, but is also responsible for ensuring
+ the health of the relationships within the community and for
+ facilitating discussions on long-term topics such as how to manage
+ growing technical debt.
diff --git a/Documentation/filesystems/nfs/reexport.rst b/Documentation/filesystems/nfs/reexport.rst
index ff9ae4a46530..044be965d75e 100644
--- a/Documentation/filesystems/nfs/reexport.rst
+++ b/Documentation/filesystems/nfs/reexport.rst
@@ -26,9 +26,13 @@ Reboot recovery
---------------
The NFS protocol's normal reboot recovery mechanisms don't work for the
-case when the reexport server reboots. Clients will lose any locks
-they held before the reboot, and further IO will result in errors.
-Closing and reopening files should clear the errors.
+case when the reexport server reboots because the source server has not
+rebooted, and so it is not in grace. Since the source server is not in
+grace, it cannot offer any guarantees that the file won't have been
+changed between the locks getting lost and any attempt to recover them.
+The same applies to delegations and any associated locks. Clients are
+not allowed to get file locks or delegations from a reexport server, any
+attempts will fail with operation not supported.
Filehandle limits
-----------------
diff --git a/Documentation/filesystems/nfs/rpc-cache.rst b/Documentation/filesystems/nfs/rpc-cache.rst
index bb164eea969b..339efd75016a 100644
--- a/Documentation/filesystems/nfs/rpc-cache.rst
+++ b/Documentation/filesystems/nfs/rpc-cache.rst
@@ -78,7 +78,7 @@ Creating a Cache
include taking references to shared objects.
void update(struct cache_head \*orig, struct cache_head \*new)
- Set the 'content' fileds in 'new' from 'orig'.
+ Set the 'content' fields in 'new' from 'orig'.
int cache_show(struct seq_file \*m, struct cache_detail \*cd, struct cache_head \*h)
Optional. Used to provide a /proc file that lists the
diff --git a/Documentation/filesystems/nfs/rpc-server-gss.rst b/Documentation/filesystems/nfs/rpc-server-gss.rst
index ccaea9e7cea2..5c1a1c58fc27 100644
--- a/Documentation/filesystems/nfs/rpc-server-gss.rst
+++ b/Documentation/filesystems/nfs/rpc-server-gss.rst
@@ -29,7 +29,7 @@ The Linux kernel, at the moment, supports only the KRB5 mechanism, and
depends on GSSAPI extensions that are KRB5 specific.
GSSAPI is a complex library, and implementing it completely in kernel is
-unwarranted. However GSSAPI operations are fundementally separable in 2
+unwarranted. However GSSAPI operations are fundamentally separable in 2
parts:
- initial context establishment
diff --git a/Documentation/filesystems/nilfs2.rst b/Documentation/filesystems/nilfs2.rst
index 6c49f04e9e0a..e3a5c8977f2c 100644
--- a/Documentation/filesystems/nilfs2.rst
+++ b/Documentation/filesystems/nilfs2.rst
@@ -231,7 +231,7 @@ file structures (nilfs_finfo), and per block structures (nilfs_binfo)::
The logs include regular files, directory files, symbolic link files
-and several meta data files. The mata data files are the files used
+and several meta data files. The meta data files are the files used
to maintain file system meta data. The current version of NILFS2 uses
the following meta data files::
diff --git a/Documentation/filesystems/ntfs.rst b/Documentation/filesystems/ntfs.rst
deleted file mode 100644
index 5bb093a26485..000000000000
--- a/Documentation/filesystems/ntfs.rst
+++ /dev/null
@@ -1,466 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-================================
-The Linux NTFS filesystem driver
-================================
-
-
-.. Table of contents
-
- - Overview
- - Web site
- - Features
- - Supported mount options
- - Known bugs and (mis-)features
- - Using NTFS volume and stripe sets
- - The Device-Mapper driver
- - The Software RAID / MD driver
- - Limitations when using the MD driver
-
-
-Overview
-========
-
-Linux-NTFS comes with a number of user-space programs known as ntfsprogs.
-These include mkntfs, a full-featured ntfs filesystem format utility,
-ntfsundelete used for recovering files that were unintentionally deleted
-from an NTFS volume and ntfsresize which is used to resize an NTFS partition.
-See the web site for more information.
-
-To mount an NTFS 1.2/3.x (Windows NT4/2000/XP/2003) volume, use the file
-system type 'ntfs'. The driver currently supports read-only mode (with no
-fault-tolerance, encryption or journalling) and very limited, but safe, write
-support.
-
-For fault tolerance and raid support (i.e. volume and stripe sets), you can
-use the kernel's Software RAID / MD driver. See section "Using Software RAID
-with NTFS" for details.
-
-
-Web site
-========
-
-There is plenty of additional information on the linux-ntfs web site
-at http://www.linux-ntfs.org/
-
-The web site has a lot of additional information, such as a comprehensive
-FAQ, documentation on the NTFS on-disk format, information on the Linux-NTFS
-userspace utilities, etc.
-
-
-Features
-========
-
-- This is a complete rewrite of the NTFS driver that used to be in the 2.4 and
- earlier kernels. This new driver implements NTFS read support and is
- functionally equivalent to the old ntfs driver and it also implements limited
- write support. The biggest limitation at present is that files/directories
- cannot be created or deleted. See below for the list of write features that
- are so far supported. Another limitation is that writing to compressed files
- is not implemented at all. Also, neither read nor write access to encrypted
- files is so far implemented.
-- The new driver has full support for sparse files on NTFS 3.x volumes which
- the old driver isn't happy with.
-- The new driver supports execution of binaries due to mmap() now being
- supported.
-- The new driver supports loopback mounting of files on NTFS which is used by
- some Linux distributions to enable the user to run Linux from an NTFS
- partition by creating a large file while in Windows and then loopback
- mounting the file while in Linux and creating a Linux filesystem on it that
- is used to install Linux on it.
-- A comparison of the two drivers using::
-
- time find . -type f -exec md5sum "{}" \;
-
- run three times in sequence with each driver (after a reboot) on a 1.4GiB
- NTFS partition, showed the new driver to be 20% faster in total time elapsed
- (from 9:43 minutes on average down to 7:53). The time spent in user space
- was unchanged but the time spent in the kernel was decreased by a factor of
- 2.5 (from 85 CPU seconds down to 33).
-- The driver does not support short file names in general. For backwards
- compatibility, we implement access to files using their short file names if
- they exist. The driver will not create short file names however, and a
- rename will discard any existing short file name.
-- The new driver supports exporting of mounted NTFS volumes via NFS.
-- The new driver supports async io (aio).
-- The new driver supports fsync(2), fdatasync(2), and msync(2).
-- The new driver supports readv(2) and writev(2).
-- The new driver supports access time updates (including mtime and ctime).
-- The new driver supports truncate(2) and open(2) with O_TRUNC. But at present
- only very limited support for highly fragmented files, i.e. ones which have
- their data attribute split across multiple extents, is included. Another
- limitation is that at present truncate(2) will never create sparse files,
- since to mark a file sparse we need to modify the directory entry for the
- file and we do not implement directory modifications yet.
-- The new driver supports write(2) which can both overwrite existing data and
- extend the file size so that you can write beyond the existing data. Also,
- writing into sparse regions is supported and the holes are filled in with
- clusters. But at present only limited support for highly fragmented files,
- i.e. ones which have their data attribute split across multiple extents, is
- included. Another limitation is that write(2) will never create sparse
- files, since to mark a file sparse we need to modify the directory entry for
- the file and we do not implement directory modifications yet.
-
-Supported mount options
-=======================
-
-In addition to the generic mount options described by the manual page for the
-mount command (man 8 mount, also see man 5 fstab), the NTFS driver supports the
-following mount options:
-
-======================= =======================================================
-iocharset=name Deprecated option. Still supported but please use
- nls=name in the future. See description for nls=name.
-
-nls=name Character set to use when returning file names.
- Unlike VFAT, NTFS suppresses names that contain
- unconvertible characters. Note that most character
- sets contain insufficient characters to represent all
- possible Unicode characters that can exist on NTFS.
- To be sure you are not missing any files, you are
- advised to use nls=utf8 which is capable of
- representing all Unicode characters.
-
-utf8=<bool> Option no longer supported. Currently mapped to
- nls=utf8 but please use nls=utf8 in the future and
- make sure utf8 is compiled either as module or into
- the kernel. See description for nls=name.
-
-uid=
-gid=
-umask= Provide default owner, group, and access mode mask.
- These options work as documented in mount(8). By
- default, the files/directories are owned by root and
- he/she has read and write permissions, as well as
- browse permission for directories. No one else has any
- access permissions. I.e. the mode on all files is by
- default rw------- and for directories rwx------, a
- consequence of the default fmask=0177 and dmask=0077.
- Using a umask of zero will grant all permissions to
- everyone, i.e. all files and directories will have mode
- rwxrwxrwx.
-
-fmask=
-dmask= Instead of specifying umask which applies both to
- files and directories, fmask applies only to files and
- dmask only to directories.
-
-sloppy=<BOOL> If sloppy is specified, ignore unknown mount options.
- Otherwise the default behaviour is to abort mount if
- any unknown options are found.
-
-show_sys_files=<BOOL> If show_sys_files is specified, show the system files
- in directory listings. Otherwise the default behaviour
- is to hide the system files.
- Note that even when show_sys_files is specified, "$MFT"
- will not be visible due to bugs/mis-features in glibc.
- Further, note that irrespective of show_sys_files, all
- files are accessible by name, i.e. you can always do
- "ls -l \$UpCase" for example to specifically show the
- system file containing the Unicode upcase table.
-
-case_sensitive=<BOOL> If case_sensitive is specified, treat all file names as
- case sensitive and create file names in the POSIX
- namespace. Otherwise the default behaviour is to treat
- file names as case insensitive and to create file names
- in the WIN32/LONG name space. Note, the Linux NTFS
- driver will never create short file names and will
- remove them on rename/delete of the corresponding long
- file name.
- Note that files remain accessible via their short file
- name, if it exists. If case_sensitive, you will need
- to provide the correct case of the short file name.
-
-disable_sparse=<BOOL> If disable_sparse is specified, creation of sparse
- regions, i.e. holes, inside files is disabled for the
- volume (for the duration of this mount only). By
- default, creation of sparse regions is enabled, which
- is consistent with the behaviour of traditional Unix
- filesystems.
-
-errors=opt What to do when critical filesystem errors are found.
- Following values can be used for "opt":
-
- ======== =========================================
- continue DEFAULT, try to clean-up as much as
- possible, e.g. marking a corrupt inode as
- bad so it is no longer accessed, and then
- continue.
- recover At present only supported is recovery of
- the boot sector from the backup copy.
- If read-only mount, the recovery is done
- in memory only and not written to disk.
- ======== =========================================
-
- Note that the options are additive, i.e. specifying::
-
- errors=continue,errors=recover
-
- means the driver will attempt to recover and if that
- fails it will clean-up as much as possible and
- continue.
-
-mft_zone_multiplier= Set the MFT zone multiplier for the volume (this
- setting is not persistent across mounts and can be
- changed from mount to mount but cannot be changed on
- remount). Values of 1 to 4 are allowed, 1 being the
- default. The MFT zone multiplier determines how much
- space is reserved for the MFT on the volume. If all
- other space is used up, then the MFT zone will be
- shrunk dynamically, so this has no impact on the
- amount of free space. However, it can have an impact
- on performance by affecting fragmentation of the MFT.
- In general use the default. If you have a lot of small
- files then use a higher value. The values have the
- following meaning:
-
- ===== =================================
- Value MFT zone size (% of volume size)
- ===== =================================
- 1 12.5%
- 2 25%
- 3 37.5%
- 4 50%
- ===== =================================
-
- Note this option is irrelevant for read-only mounts.
-======================= =======================================================
-
-
-Known bugs and (mis-)features
-=============================
-
-- The link count on each directory inode entry is set to 1, due to Linux not
- supporting directory hard links. This may well confuse some user space
- applications, since the directory names will have the same inode numbers.
- This also speeds up ntfs_read_inode() immensely. And we haven't found any
- problems with this approach so far. If you find a problem with this, please
- let us know.
-
-
-Please send bug reports/comments/feedback/abuse to the Linux-NTFS development
-list at sourceforge: linux-ntfs-dev@lists.sourceforge.net
-
-
-Using NTFS volume and stripe sets
-=================================
-
-For support of volume and stripe sets, you can either use the kernel's
-Device-Mapper driver or the kernel's Software RAID / MD driver. The former is
-the recommended one to use for linear raid. But the latter is required for
-raid level 5. For striping and mirroring, either driver should work fine.
-
-
-The Device-Mapper driver
-------------------------
-
-You will need to create a table of the components of the volume/stripe set and
-how they fit together and load this into the kernel using the dmsetup utility
-(see man 8 dmsetup).
-
-Linear volume sets, i.e. linear raid, has been tested and works fine. Even
-though untested, there is no reason why stripe sets, i.e. raid level 0, and
-mirrors, i.e. raid level 1 should not work, too. Stripes with parity, i.e.
-raid level 5, unfortunately cannot work yet because the current version of the
-Device-Mapper driver does not support raid level 5. You may be able to use the
-Software RAID / MD driver for raid level 5, see the next section for details.
-
-To create the table describing your volume you will need to know each of its
-components and their sizes in sectors, i.e. multiples of 512-byte blocks.
-
-For NT4 fault tolerant volumes you can obtain the sizes using fdisk. So for
-example if one of your partitions is /dev/hda2 you would do::
-
- $ fdisk -ul /dev/hda
-
- Disk /dev/hda: 81.9 GB, 81964302336 bytes
- 255 heads, 63 sectors/track, 9964 cylinders, total 160086528 sectors
- Units = sectors of 1 * 512 = 512 bytes
-
- Device Boot Start End Blocks Id System
- /dev/hda1 * 63 4209029 2104483+ 83 Linux
- /dev/hda2 4209030 37768814 16779892+ 86 NTFS
- /dev/hda3 37768815 46170809 4200997+ 83 Linux
-
-And you would know that /dev/hda2 has a size of 37768814 - 4209030 + 1 =
-33559785 sectors.
-
-For Win2k and later dynamic disks, you can for example use the ldminfo utility
-which is part of the Linux LDM tools (the latest version at the time of
-writing is linux-ldm-0.0.8.tar.bz2). You can download it from:
-
- http://www.linux-ntfs.org/
-
-Simply extract the downloaded archive (tar xvjf linux-ldm-0.0.8.tar.bz2), go
-into it (cd linux-ldm-0.0.8) and change to the test directory (cd test). You
-will find the precompiled (i386) ldminfo utility there. NOTE: You will not be
-able to compile this yourself easily so use the binary version!
-
-Then you would use ldminfo in dump mode to obtain the necessary information::
-
- $ ./ldminfo --dump /dev/hda
-
-This would dump the LDM database found on /dev/hda which describes all of your
-dynamic disks and all the volumes on them. At the bottom you will see the
-VOLUME DEFINITIONS section which is all you really need. You may need to look
-further above to determine which of the disks in the volume definitions is
-which device in Linux. Hint: Run ldminfo on each of your dynamic disks and
-look at the Disk Id close to the top of the output for each (the PRIVATE HEADER
-section). You can then find these Disk Ids in the VBLK DATABASE section in the
-<Disk> components where you will get the LDM Name for the disk that is found in
-the VOLUME DEFINITIONS section.
-
-Note you will also need to enable the LDM driver in the Linux kernel. If your
-distribution did not enable it, you will need to recompile the kernel with it
-enabled. This will create the LDM partitions on each device at boot time. You
-would then use those devices (for /dev/hda they would be /dev/hda1, 2, 3, etc)
-in the Device-Mapper table.
-
-You can also bypass using the LDM driver by using the main device (e.g.
-/dev/hda) and then using the offsets of the LDM partitions into this device as
-the "Start sector of device" when creating the table. Once again ldminfo would
-give you the correct information to do this.
-
-Assuming you know all your devices and their sizes things are easy.
-
-For a linear raid the table would look like this (note all values are in
-512-byte sectors)::
-
- # Offset into Size of this Raid type Device Start sector
- # volume device of device
- 0 1028161 linear /dev/hda1 0
- 1028161 3903762 linear /dev/hdb2 0
- 4931923 2103211 linear /dev/hdc1 0
-
-For a striped volume, i.e. raid level 0, you will need to know the chunk size
-you used when creating the volume. Windows uses 64kiB as the default, so it
-will probably be this unless you changes the defaults when creating the array.
-
-For a raid level 0 the table would look like this (note all values are in
-512-byte sectors)::
-
- # Offset Size Raid Number Chunk 1st Start 2nd Start
- # into of the type of size Device in Device in
- # volume volume stripes device device
- 0 2056320 striped 2 128 /dev/hda1 0 /dev/hdb1 0
-
-If there are more than two devices, just add each of them to the end of the
-line.
-
-Finally, for a mirrored volume, i.e. raid level 1, the table would look like
-this (note all values are in 512-byte sectors)::
-
- # Ofs Size Raid Log Number Region Should Number Source Start Target Start
- # in of the type type of log size sync? of Device in Device in
- # vol volume params mirrors Device Device
- 0 2056320 mirror core 2 16 nosync 2 /dev/hda1 0 /dev/hdb1 0
-
-If you are mirroring to multiple devices you can specify further targets at the
-end of the line.
-
-Note the "Should sync?" parameter "nosync" means that the two mirrors are
-already in sync which will be the case on a clean shutdown of Windows. If the
-mirrors are not clean, you can specify the "sync" option instead of "nosync"
-and the Device-Mapper driver will then copy the entirety of the "Source Device"
-to the "Target Device" or if you specified multiple target devices to all of
-them.
-
-Once you have your table, save it in a file somewhere (e.g. /etc/ntfsvolume1),
-and hand it over to dmsetup to work with, like so::
-
- $ dmsetup create myvolume1 /etc/ntfsvolume1
-
-You can obviously replace "myvolume1" with whatever name you like.
-
-If it all worked, you will now have the device /dev/device-mapper/myvolume1
-which you can then just use as an argument to the mount command as usual to
-mount the ntfs volume. For example::
-
- $ mount -t ntfs -o ro /dev/device-mapper/myvolume1 /mnt/myvol1
-
-(You need to create the directory /mnt/myvol1 first and of course you can use
-anything you like instead of /mnt/myvol1 as long as it is an existing
-directory.)
-
-It is advisable to do the mount read-only to see if the volume has been setup
-correctly to avoid the possibility of causing damage to the data on the ntfs
-volume.
-
-
-The Software RAID / MD driver
------------------------------
-
-An alternative to using the Device-Mapper driver is to use the kernel's
-Software RAID / MD driver. For which you need to set up your /etc/raidtab
-appropriately (see man 5 raidtab).
-
-Linear volume sets, i.e. linear raid, as well as stripe sets, i.e. raid level
-0, have been tested and work fine (though see section "Limitations when using
-the MD driver with NTFS volumes" especially if you want to use linear raid).
-Even though untested, there is no reason why mirrors, i.e. raid level 1, and
-stripes with parity, i.e. raid level 5, should not work, too.
-
-You have to use the "persistent-superblock 0" option for each raid-disk in the
-NTFS volume/stripe you are configuring in /etc/raidtab as the persistent
-superblock used by the MD driver would damage the NTFS volume.
-
-Windows by default uses a stripe chunk size of 64k, so you probably want the
-"chunk-size 64k" option for each raid-disk, too.
-
-For example, if you have a stripe set consisting of two partitions /dev/hda5
-and /dev/hdb1 your /etc/raidtab would look like this::
-
- raiddev /dev/md0
- raid-level 0
- nr-raid-disks 2
- nr-spare-disks 0
- persistent-superblock 0
- chunk-size 64k
- device /dev/hda5
- raid-disk 0
- device /dev/hdb1
- raid-disk 1
-
-For linear raid, just change the raid-level above to "raid-level linear", for
-mirrors, change it to "raid-level 1", and for stripe sets with parity, change
-it to "raid-level 5".
-
-Note for stripe sets with parity you will also need to tell the MD driver
-which parity algorithm to use by specifying the option "parity-algorithm
-which", where you need to replace "which" with the name of the algorithm to
-use (see man 5 raidtab for available algorithms) and you will have to try the
-different available algorithms until you find one that works. Make sure you
-are working read-only when playing with this as you may damage your data
-otherwise. If you find which algorithm works please let us know (email the
-linux-ntfs developers list linux-ntfs-dev@lists.sourceforge.net or drop in on
-IRC in channel #ntfs on the irc.freenode.net network) so we can update this
-documentation.
-
-Once the raidtab is setup, run for example raid0run -a to start all devices or
-raid0run /dev/md0 to start a particular md device, in this case /dev/md0.
-
-Then just use the mount command as usual to mount the ntfs volume using for
-example::
-
- mount -t ntfs -o ro /dev/md0 /mnt/myntfsvolume
-
-It is advisable to do the mount read-only to see if the md volume has been
-setup correctly to avoid the possibility of causing damage to the data on the
-ntfs volume.
-
-
-Limitations when using the Software RAID / MD driver
------------------------------------------------------
-
-Using the md driver will not work properly if any of your NTFS partitions have
-an odd number of sectors. This is especially important for linear raid as all
-data after the first partition with an odd number of sectors will be offset by
-one or more sectors so if you mount such a partition with write support you
-will cause massive damage to the data on the volume which will only become
-apparent when you try to use the volume again under Windows.
-
-So when using linear raid, make sure that all your partitions have an even
-number of sectors BEFORE attempting to use it. You have been warned!
-
-Even better is to simply use the Device-Mapper for linear raid and then you do
-not have this problem with odd numbers of sectors.
diff --git a/Documentation/filesystems/ntfs3.rst b/Documentation/filesystems/ntfs3.rst
index f0cf05cad2ba..2b86a9b3a6de 100644
--- a/Documentation/filesystems/ntfs3.rst
+++ b/Documentation/filesystems/ntfs3.rst
@@ -112,7 +112,7 @@ this table marked with no it means default is without **no**.
Todo list
=========
- Full journaling support over JBD. Currently journal replaying is supported
- which is not necessarily as effectice as JBD would be.
+ which is not necessarily as effective as JBD would be.
References
==========
diff --git a/Documentation/filesystems/ocfs2-online-filecheck.rst b/Documentation/filesystems/ocfs2-online-filecheck.rst
index 2257bb53edc1..9e8449416e0b 100644
--- a/Documentation/filesystems/ocfs2-online-filecheck.rst
+++ b/Documentation/filesystems/ocfs2-online-filecheck.rst
@@ -58,33 +58,33 @@ inode, fixing inode and setting the size of result record history.
# echo "<inode>" > /sys/fs/ocfs2/<devname>/filecheck/check
# cat /sys/fs/ocfs2/<devname>/filecheck/check
-The output is like this::
+ The output is like this::
INO DONE ERROR
39502 1 GENERATION
- <INO> lists the inode numbers.
- <DONE> indicates whether the operation has been finished.
- <ERROR> says what kind of errors was found. For the detailed error numbers,
- please refer to the file linux/fs/ocfs2/filecheck.h.
+ <INO> lists the inode numbers.
+ <DONE> indicates whether the operation has been finished.
+ <ERROR> says what kind of errors was found. For the detailed error numbers,
+ please refer to the file linux/fs/ocfs2/filecheck.h.
2. If you determine to fix this inode, do::
# echo "<inode>" > /sys/fs/ocfs2/<devname>/filecheck/fix
# cat /sys/fs/ocfs2/<devname>/filecheck/fix
-The output is like this:::
+ The output is like this::
INO DONE ERROR
39502 1 SUCCESS
-This time, the <ERROR> column indicates whether this fix is successful or not.
+ This time, the <ERROR> column indicates whether this fix is successful or not.
3. The record cache is used to store the history of check/fix results. It's
-default size is 10, and can be adjust between the range of 10 ~ 100. You can
-adjust the size like this::
+ default size is 10, and can be adjust between the range of 10 ~ 100. You can
+ adjust the size like this::
- # echo "<size>" > /sys/fs/ocfs2/<devname>/filecheck/set
+ # echo "<size>" > /sys/fs/ocfs2/<devname>/filecheck/set
Fixing stuff
============
diff --git a/Documentation/filesystems/orangefs.rst b/Documentation/filesystems/orangefs.rst
index 463e37694250..931159e61796 100644
--- a/Documentation/filesystems/orangefs.rst
+++ b/Documentation/filesystems/orangefs.rst
@@ -274,7 +274,7 @@ then contains:
of kcalloced memory. This memory is used as an array of pointers
to each of the pages in the IO buffer through a call to get_user_pages.
* desc_array - a pointer to ``desc_count * (sizeof(struct orangefs_bufmap_desc))``
- bytes of kcalloced memory. This memory is further intialized:
+ bytes of kcalloced memory. This memory is further initialized:
user_desc is the kernel's copy of the IO buffer's ORANGEFS_dev_map_desc
structure. user_desc->ptr points to the IO buffer.
diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
index eb7d2c88ddec..ab989807a2cb 100644
--- a/Documentation/filesystems/overlayfs.rst
+++ b/Documentation/filesystems/overlayfs.rst
@@ -9,7 +9,7 @@ Overlay Filesystem
This document describes a prototype for a new approach to providing
overlay-filesystem functionality in Linux (sometimes referred to as
union-filesystems). An overlay-filesystem tries to present a
-filesystem which is the result over overlaying one filesystem on top
+filesystem which is the result of overlaying one filesystem on top
of the other.
@@ -39,7 +39,7 @@ objects in the original filesystem.
On 64bit systems, even if all overlay layers are not on the same
underlying filesystem, the same compliant behavior could be achieved
with the "xino" feature. The "xino" feature composes a unique object
-identifier from the real object st_ino and an underlying fsid index.
+identifier from the real object st_ino and an underlying fsid number.
The "xino" feature uses the high inode number bits for fsid, because the
underlying filesystems rarely use the high inode number bits. In case
the underlying inode number does overflow into the high xino bits, overlay
@@ -61,7 +61,7 @@ Inode properties
|Configuration | Persistent | Uniform | st_ino == d_ino | d_ino == i_ino |
| | st_ino | st_dev | | [*] |
+==============+=====+======+=====+======+========+========+========+=======+
-| | dir | !dir | dir | !dir | dir + !dir | dir | !dir |
+| | dir | !dir | dir | !dir | dir | !dir | dir | !dir |
+--------------+-----+------+-----+------+--------+--------+--------+-------+
| All layers | Y | Y | Y | Y | Y | Y | Y | Y |
| on same fs | | | | | | | | |
@@ -118,7 +118,7 @@ Where both upper and lower objects are directories, a merged directory
is formed.
At mount time, the two directories given as mount options "lowerdir" and
-"upperdir" are combined into a merged directory:
+"upperdir" are combined into a merged directory::
mount -t overlay overlay -olowerdir=/lower,upperdir=/upper,\
workdir=/work /merged
@@ -145,7 +145,9 @@ filesystem, an overlay filesystem needs to record in the upper filesystem
that files have been removed. This is done using whiteouts and opaque
directories (non-directories are always opaque).
-A whiteout is created as a character device with 0/0 device number.
+A whiteout is created as a character device with 0/0 device number or
+as a zero-size regular file with the xattr "trusted.overlay.whiteout".
+
When a whiteout is found in the upper level of a merged directory, any
matching name in the lower level is ignored, and the whiteout itself
is also hidden.
@@ -154,6 +156,13 @@ A directory is made opaque by setting the xattr "trusted.overlay.opaque"
to "y". Where the upper filesystem contains an opaque directory, any
directory in the lower filesystem with the same name is ignored.
+An opaque directory should not contain any whiteouts, because they do not
+serve any purpose. A merge directory containing regular files with the xattr
+"trusted.overlay.whiteout", should be additionally marked by setting the xattr
+"trusted.overlay.opaque" to "x" on the merge directory itself.
+This is needed to avoid the overhead of checking the "trusted.overlay.whiteout"
+on all entries during readdir in the common case.
+
readdir
-------
@@ -172,12 +181,12 @@ directory is being read. This is unlikely to be noticed by many
programs.
seek offsets are assigned sequentially when the directories are read.
-Thus if
+Thus if:
- - read part of a directory
- - remember an offset, and close the directory
- - re-open the directory some time later
- - seek to the remembered offset
+ - read part of a directory
+ - remember an offset, and close the directory
+ - re-open the directory some time later
+ - seek to the remembered offset
there may be little correlation between the old and new locations in
the list of filenames, particularly if anything has changed in the
@@ -195,7 +204,7 @@ handle it in two different ways:
1. return EXDEV error: this error is returned by rename(2) when trying to
move a file or directory across filesystem boundaries. Hence
- applications are usually prepared to hande this error (mv(1) for example
+ applications are usually prepared to handle this error (mv(1) for example
recursively copies the directory tree). This is the default behavior.
2. If the "redirect_dir" feature is enabled, then the directory will be
@@ -235,7 +244,7 @@ Mount options:
Redirects are not created and not followed.
- "redirect_dir=off":
If "redirect_always_follow" is enabled in the kernel/module config,
- this "off" traslates to "follow", otherwise it translates to "nofollow".
+ this "off" translates to "follow", otherwise it translates to "nofollow".
When the NFS export feature is enabled, every copied up directory is
indexed by the file handle of the lower inode and a file handle of the
@@ -257,7 +266,7 @@ Non-directories
Objects that are not directories (files, symlinks, device-special
files etc.) are presented either from the upper or lower filesystem as
appropriate. When a file in the lower filesystem is accessed in a way
-the requires write-access, such as opening for write access, changing
+that requires write-access, such as opening for write access, changing
some metadata etc., the file is first copied from the lower filesystem
to the upper filesystem (copy_up). Note that creating a hard-link
also requires copy_up, though of course creation of a symlink does
@@ -283,21 +292,35 @@ rename or unlink will of course be noticed and handled).
Permission model
----------------
+An overlay filesystem stashes credentials that will be used when
+accessing lower or upper filesystems.
+
+In the old mount api the credentials of the task calling mount(2) are
+stashed. In the new mount api the credentials of the task creating the
+superblock through FSCONFIG_CMD_CREATE command of fsconfig(2) are
+stashed.
+
+Starting with kernel v6.15 it is possible to use the "override_creds"
+mount option which will cause the credentials of the calling task to be
+recorded. Note that "override_creds" is only meaningful when used with
+the new mount api as the old mount api combines setting options and
+superblock creation in a single mount(2) syscall.
+
Permission checking in the overlay filesystem follows these principles:
1) permission check SHOULD return the same result before and after copy up
2) task creating the overlay mount MUST NOT gain additional privileges
- 3) non-mounting task MAY gain additional privileges through the overlay,
- compared to direct access on underlying lower or upper filesystems
+ 3) task[*] MAY gain additional privileges through the overlay,
+ compared to direct access on underlying lower or upper filesystems
-This is achieved by performing two permission checks on each access
+This is achieved by performing two permission checks on each access:
a) check if current task is allowed access based on local DAC (owner,
group, mode and posix acl), as well as MAC checks
- b) check if mounting task would be allowed real operation on lower or
+ b) check if stashed credentials would be allowed real operation on lower or
upper layer based on underlying filesystem permissions, again including
MAC checks
@@ -306,16 +329,16 @@ are copied up. On the other hand it can result in server enforced
permissions (used by NFS, for example) being ignored (3).
Check (b) ensures that no task gains permissions to underlying layers that
-the mounting task does not have (2). This also means that it is possible
+the stashed credentials do not have (2). This also means that it is possible
to create setups where the consistency rule (1) does not hold; normally,
-however, the mounting task will have sufficient privileges to perform all
-operations.
+however, the stashed credentials will have sufficient privileges to
+perform all operations.
-Another way to demonstrate this model is drawing parallels between
+Another way to demonstrate this model is drawing parallels between::
mount -t overlay overlay -olowerdir=/lower,upperdir=/upper,... /merged
-and
+and::
cp -a /lower /upper
mount --bind /upper /merged
@@ -328,7 +351,7 @@ Multiple lower layers
---------------------
Multiple lower layers can now be given using the colon (":") as a
-separator character between the directory names. For example:
+separator character between the directory names. For example::
mount -t overlay overlay -olowerdir=/lower1:/lower2:/lower3 /merged
@@ -339,14 +362,30 @@ The specified lower directories will be stacked beginning from the
rightmost one and going left. In the above example lower1 will be the
top, lower2 the middle and lower3 the bottom layer.
+Note: directory names containing colons can be provided as lower layer by
+escaping the colons with a single backslash. For example::
+
+ mount -t overlay overlay -olowerdir=/a\:lower\:\:dir /merged
+
+Since kernel version v6.8, directory names containing colons can also
+be configured as lower layer using the "lowerdir+" mount options and the
+fsconfig syscall from new mount api. For example::
+
+ fsconfig(fs_fd, FSCONFIG_SET_STRING, "lowerdir+", "/a:lower::dir", 0);
+
+In the latter case, colons in lower layer directory names will be escaped
+as an octal characters (\072) when displayed in /proc/self/mountinfo.
Metadata only copy up
---------------------
-When metadata only copy up feature is enabled, overlayfs will only copy
+When the "metacopy" feature is enabled, overlayfs will only copy
up metadata (as opposed to whole file), when a metadata specific operation
-like chown/chmod is performed. Full file will be copied up later when
-file is opened for WRITE operation.
+like chown/chmod is performed. An upper file in this state is marked with
+"trusted.overlayfs.metacopy" xattr which indicates that the upper file
+contains no data. The data will be copied up later when file is opened for
+WRITE operation. After the lower file's data is copied up,
+the "trusted.overlayfs.metacopy" xattr is removed from the upper file.
In other words, this is delayed data copy up operation and data is copied
up when there is a need to actually modify data.
@@ -386,13 +425,13 @@ of information from up to three different layers:
The "lower data" file can be on any lower layer, except from the top most
lower layer.
-Below the top most lower layer, any number of lower most layers may be defined
+Below the topmost lower layer, any number of lowermost layers may be defined
as "data-only" lower layers, using double colon ("::") separators.
A normal lower layer is not allowed to be below a data-only layer, so single
colon separators are not allowed to the right of double colon ("::") separators.
-For example:
+For example::
mount -t overlay overlay -olowerdir=/l1:/l2:/l3::/do1::/do2 /merged
@@ -404,6 +443,87 @@ Only the data of the files in the "data-only" lower layers may be visible
when a "metacopy" file in one of the lower layers above it, has a "redirect"
to the absolute path of the "lower data" file in the "data-only" lower layer.
+Instead of explicitly enabling "metacopy=on" it is sufficient to specify at
+least one data-only layer to enable redirection of data to a data-only layer.
+In this case other forms of metacopy are rejected. Note: this way, data-only
+layers may be used together with "userxattr", in which case careful attention
+must be given to privileges needed to change the "user.overlay.redirect" xattr
+to prevent misuse.
+
+Since kernel version v6.8, "data-only" lower layers can also be added using
+the "datadir+" mount options and the fsconfig syscall from new mount api.
+For example::
+
+ fsconfig(fs_fd, FSCONFIG_SET_STRING, "lowerdir+", "/l1", 0);
+ fsconfig(fs_fd, FSCONFIG_SET_STRING, "lowerdir+", "/l2", 0);
+ fsconfig(fs_fd, FSCONFIG_SET_STRING, "lowerdir+", "/l3", 0);
+ fsconfig(fs_fd, FSCONFIG_SET_STRING, "datadir+", "/do1", 0);
+ fsconfig(fs_fd, FSCONFIG_SET_STRING, "datadir+", "/do2", 0);
+
+
+Specifying layers via file descriptors
+--------------------------------------
+
+Since kernel v6.13, overlayfs supports specifying layers via file descriptors in
+addition to specifying them as paths. This feature is available for the
+"datadir+", "lowerdir+", "upperdir", and "workdir+" mount options with the
+fsconfig syscall from the new mount api::
+
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "lowerdir+", NULL, fd_lower1);
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "lowerdir+", NULL, fd_lower2);
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "lowerdir+", NULL, fd_lower3);
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "datadir+", NULL, fd_data1);
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "datadir+", NULL, fd_data2);
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "workdir", NULL, fd_work);
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "upperdir", NULL, fd_upper);
+
+
+fs-verity support
+-----------------
+
+During metadata copy up of a lower file, if the source file has
+fs-verity enabled and overlay verity support is enabled, then the
+digest of the lower file is added to the "trusted.overlay.metacopy"
+xattr. This is then used to verify the content of the lower file
+each the time the metacopy file is opened.
+
+When a layer containing verity xattrs is used, it means that any such
+metacopy file in the upper layer is guaranteed to match the content
+that was in the lower at the time of the copy-up. If at any time
+(during a mount, after a remount, etc) such a file in the lower is
+replaced or modified in any way, access to the corresponding file in
+overlayfs will result in EIO errors (either on open, due to overlayfs
+digest check, or from a later read due to fs-verity) and a detailed
+error is printed to the kernel logs. For more details of how fs-verity
+file access works, see :ref:`Documentation/filesystems/fsverity.rst
+<accessing_verity_files>`.
+
+Verity can be used as a general robustness check to detect accidental
+changes in the overlayfs directories in use. But, with additional care
+it can also give more powerful guarantees. For example, if the upper
+layer is fully trusted (by using dm-verity or something similar), then
+an untrusted lower layer can be used to supply validated file content
+for all metacopy files. If additionally the untrusted lower
+directories are specified as "Data-only", then they can only supply
+such file content, and the entire mount can be trusted to match the
+upper layer.
+
+This feature is controlled by the "verity" mount option, which
+supports these values:
+
+- "off":
+ The metacopy digest is never generated or used. This is the
+ default if verity option is not specified.
+- "on":
+ Whenever a metacopy file specifies an expected digest, the
+ corresponding data file must match the specified digest. When
+ generating a metacopy file the verity digest will be set in it
+ based on the source file (if it has one).
+- "require":
+ Same as "on", but additionally all metacopy files must specify a
+ digest (or EIO is returned on open). This means metadata copy up
+ will only be used if the data file has fs-verity enabled,
+ otherwise a full copy-up is used.
Sharing and copying layers
--------------------------
@@ -417,34 +537,58 @@ Using an upper layer path and/or a workdir path that are already used by
another overlay mount is not allowed and may fail with EBUSY. Using
partially overlapping paths is not allowed and may fail with EBUSY.
If files are accessed from two overlayfs mounts which share or overlap the
-upper layer and/or workdir path the behavior of the overlay is undefined,
+upper layer and/or workdir path, the behavior of the overlay is undefined,
though it will not result in a crash or deadlock.
Mounting an overlay using an upper layer path, where the upper layer path
was previously used by another mounted overlay in combination with a
-different lower layer path, is allowed, unless the "inodes index" feature
-or "metadata only copy up" feature is enabled.
+different lower layer path, is allowed, unless the "index" or "metacopy"
+features are enabled.
-With the "inodes index" feature, on the first time mount, an NFS file
+With the "index" feature, on the first time mount, an NFS file
handle of the lower layer root directory, along with the UUID of the lower
filesystem, are encoded and stored in the "trusted.overlay.origin" extended
attribute on the upper layer root directory. On subsequent mount attempts,
the lower root directory file handle and lower filesystem UUID are compared
to the stored origin in upper root directory. On failure to verify the
lower root origin, mount will fail with ESTALE. An overlayfs mount with
-"inodes index" enabled will fail with EOPNOTSUPP if the lower filesystem
+"index" enabled will fail with EOPNOTSUPP if the lower filesystem
does not support NFS export, lower filesystem does not have a valid UUID or
if the upper filesystem does not support extended attributes.
-For "metadata only copy up" feature there is no verification mechanism at
+For the "metacopy" feature, there is no verification mechanism at
mount time. So if same upper is mounted with different set of lower, mount
probably will succeed but expect the unexpected later on. So don't do it.
It is quite a common practice to copy overlay layers to a different
directory tree on the same or different underlying filesystem, and even
-to a different machine. With the "inodes index" feature, trying to mount
+to a different machine. With the "index" feature, trying to mount
the copied layers will fail the verification of the lower root file handle.
+Nesting overlayfs mounts
+------------------------
+
+It is possible to use a lower directory that is stored on an overlayfs
+mount. For regular files this does not need any special care. However, files
+that have overlayfs attributes, such as whiteouts or "overlay.*" xattrs, will
+be interpreted by the underlying overlayfs mount and stripped out. In order to
+allow the second overlayfs mount to see the attributes they must be escaped.
+
+Overlayfs specific xattrs are escaped by using a special prefix of
+"overlay.overlay.". So, a file with a "trusted.overlay.overlay.metacopy" xattr
+in the lower dir will be exposed as a regular file with a
+"trusted.overlay.metacopy" xattr in the overlayfs mount. This can be nested by
+repeating the prefix multiple time, as each instance only removes one prefix.
+
+A lower dir with a regular whiteout will always be handled by the overlayfs
+mount, so to support storing an effective whiteout file in an overlayfs mount an
+alternative form of whiteout is supported. This form is a regular, zero-size
+file with the "overlay.whiteout" xattr set, inside a directory with the
+"overlay.opaque" xattr set to "x" (see `whiteouts and opaque directories`_).
+These alternative whiteouts are never created by overlayfs, but can be used by
+userspace tools (like containers) that generate lower layers.
+These alternative whiteouts can be escaped using the standard xattr escape
+mechanism in order to properly nest to any depth.
Non-standard behavior
---------------------
@@ -454,20 +598,21 @@ filesystem.
This is the list of cases that overlayfs doesn't currently handle:
-a) POSIX mandates updating st_atime for reads. This is currently not
-done in the case when the file resides on a lower layer.
+ a) POSIX mandates updating st_atime for reads. This is currently not
+ done in the case when the file resides on a lower layer.
-b) If a file residing on a lower layer is opened for read-only and then
-memory mapped with MAP_SHARED, then subsequent changes to the file are not
-reflected in the memory mapping.
+ b) If a file residing on a lower layer is opened for read-only and then
+ memory mapped with MAP_SHARED, then subsequent changes to the file are not
+ reflected in the memory mapping.
-c) If a file residing on a lower layer is being executed, then opening that
-file for write or truncating the file will not be denied with ETXTBSY.
+ c) If a file residing on a lower layer is being executed, then opening that
+ file for write or truncating the file will not be denied with ETXTBSY.
The following options allow overlayfs to act more like a standards
compliant filesystem:
-1) "redirect_dir"
+redirect_dir
+````````````
Enabled with the mount option or module option: "redirect_dir=on" or with
the kernel config option CONFIG_OVERLAY_FS_REDIRECT_DIR=y.
@@ -475,7 +620,8 @@ the kernel config option CONFIG_OVERLAY_FS_REDIRECT_DIR=y.
If this feature is disabled, then rename(2) on a lower or merged directory
will fail with EXDEV ("Invalid cross-device link").
-2) "inode index"
+index
+`````
Enabled with the mount option or module option "index=on" or with the
kernel config option CONFIG_OVERLAY_FS_INDEX=y.
@@ -484,7 +630,8 @@ If this feature is disabled and a file with multiple hard links is copied
up, then this will "break" the link. Changes will not be propagated to
other names referring to the same inode.
-3) "xino"
+xino
+````
Enabled with the mount option "xino=auto" or "xino=on", with the module
option "xino_auto=on" or with the kernel config option
@@ -511,7 +658,7 @@ a crash or deadlock.
Offline changes, when the overlay is not mounted, are allowed to the
upper tree. Offline changes to the lower tree are only allowed if the
-"metadata only copy up", "inode index", "xino" and "redirect_dir" features
+"metacopy", "index", "xino" and "redirect_dir" features
have not been used. If the lower tree is modified and any of these
features has been used, the behavior of the overlay is undefined,
though it will not result in a crash or deadlock.
@@ -551,12 +698,13 @@ directory inode.
When encoding a file handle from an overlay filesystem object, the
following rules apply:
-1. For a non-upper object, encode a lower file handle from lower inode
-2. For an indexed object, encode a lower file handle from copy_up origin
-3. For a pure-upper object and for an existing non-indexed upper object,
- encode an upper file handle from upper inode
+ 1. For a non-upper object, encode a lower file handle from lower inode
+ 2. For an indexed object, encode a lower file handle from copy_up origin
+ 3. For a pure-upper object and for an existing non-indexed upper object,
+ encode an upper file handle from upper inode
The encoded overlay file handle includes:
+
- Header including path type information (e.g. lower/upper)
- UUID of the underlying filesystem
- Underlying filesystem encoding of underlying inode
@@ -566,15 +714,15 @@ are stored in extended attribute "trusted.overlay.origin".
When decoding an overlay file handle, the following steps are followed:
-1. Find underlying layer by UUID and path type information.
-2. Decode the underlying filesystem file handle to underlying dentry.
-3. For a lower file handle, lookup the handle in index directory by name.
-4. If a whiteout is found in index, return ESTALE. This represents an
- overlay object that was deleted after its file handle was encoded.
-5. For a non-directory, instantiate a disconnected overlay dentry from the
- decoded underlying dentry, the path type and index inode, if found.
-6. For a directory, use the connected underlying decoded dentry, path type
- and index, to lookup a connected overlay dentry.
+ 1. Find underlying layer by UUID and path type information.
+ 2. Decode the underlying filesystem file handle to underlying dentry.
+ 3. For a lower file handle, lookup the handle in index directory by name.
+ 4. If a whiteout is found in index, return ESTALE. This represents an
+ overlay object that was deleted after its file handle was encoded.
+ 5. For a non-directory, instantiate a disconnected overlay dentry from the
+ decoded underlying dentry, the path type and index inode, if found.
+ 6. For a directory, use the connected underlying decoded dentry, path type
+ and index, to lookup a connected overlay dentry.
Decoding a non-directory file handle may return a disconnected dentry.
copy_up of that disconnected dentry will create an upper index entry with
@@ -610,6 +758,31 @@ can be useful in case the underlying disk is copied and the UUID of this copy
is changed. This is only applicable if all lower/upper/work directories are on
the same filesystem, otherwise it will fallback to normal behaviour.
+
+UUID and fsid
+-------------
+
+The UUID of overlayfs instance itself and the fsid reported by statfs(2) are
+controlled by the "uuid" mount option, which supports these values:
+
+- "null":
+ UUID of overlayfs is null. fsid is taken from upper most filesystem.
+- "off":
+ UUID of overlayfs is null. fsid is taken from upper most filesystem.
+ UUID of underlying layers is ignored.
+- "on":
+ UUID of overlayfs is generated and used to report a unique fsid.
+ UUID is stored in xattr "trusted.overlay.uuid", making overlayfs fsid
+ unique and persistent. This option requires an overlayfs with upper
+ filesystem that supports xattrs.
+- "auto": (default)
+ UUID is taken from xattr "trusted.overlay.uuid" if it exists.
+ Upgrade to "uuid=on" on first time mount of new overlay filesystem that
+ meets the prerequisites.
+ Downgrade to "uuid=null" for existing overlay filesystems that were never
+ mounted with "uuid=on".
+
+
Volatile mount
--------------
@@ -621,20 +794,20 @@ without significant effort.
The advantage of mounting with the "volatile" option is that all forms of
sync calls to the upper filesystem are omitted.
-In order to avoid a giving a false sense of safety, the syncfs (and fsync)
+In order to avoid giving a false sense of safety, the syncfs (and fsync)
semantics of volatile mounts are slightly different than that of the rest of
VFS. If any writeback error occurs on the upperdir's filesystem after a
volatile mount takes place, all sync functions will return an error. Once this
condition is reached, the filesystem will not recover, and every subsequent sync
-call will return an error, even if the upperdir has not experience a new error
+call will return an error, even if the upperdir has not experienced a new error
since the last sync call.
When overlay is mounted with "volatile" option, the directory
"$workdir/work/incompat/volatile" is created. During next mount, overlay
checks for this directory and refuses to mount if present. This is a strong
-indicator that user should throw away upper and work directories and create
-fresh one. In very limited cases where the user knows that the system has
-not crashed and contents of upperdir are intact, The "volatile" directory
+indicator that the user should discard upper and work directories and create
+fresh ones. In very limited cases where the user knows that the system has
+not crashed and contents of upperdir are intact, the "volatile" directory
can be removed.
@@ -652,9 +825,9 @@ Testsuite
There's a testsuite originally developed by David Howells and currently
maintained by Amir Goldstein at:
- https://github.com/amir73il/unionmount-testsuite.git
+https://github.com/amir73il/unionmount-testsuite.git
-Run as root:
+Run as root::
# cd unionmount-testsuite
# ./run --ov --verify
diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst
index 2b2df6aa5432..9ced1135608e 100644
--- a/Documentation/filesystems/path-lookup.rst
+++ b/Documentation/filesystems/path-lookup.rst
@@ -531,7 +531,7 @@ this retry process in the next article.
Automount points are locations in the filesystem where an attempt to
lookup a name can trigger changes to how that lookup should be
handled, in particular by mounting a filesystem there. These are
-covered in greater detail in autofs.txt in the Linux documentation
+covered in greater detail in autofs.rst in the Linux documentation
tree, but a few notes specifically related to path lookup are in order
here.
diff --git a/Documentation/filesystems/path-lookup.txt b/Documentation/filesystems/path-lookup.txt
index 1aa7ce099f6f..d2cf2852e1f8 100644
--- a/Documentation/filesystems/path-lookup.txt
+++ b/Documentation/filesystems/path-lookup.txt
@@ -379,4 +379,4 @@ Papers and other documentation on dcache locking
2. http://lse.sourceforge.net/locking/dcache/dcache.html
-3. path-lookup.md in this directory.
+3. path-lookup.rst in this directory.
diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index d2d684ae7798..3397937ed838 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -177,7 +177,7 @@ settles down a bit.
**mandatory**
s_export_op is now required for exporting a filesystem.
-isofs, ext2, ext3, resierfs, fat
+isofs, ext2, ext3, fat
can be used as examples of very different filesystems.
---
@@ -211,7 +211,7 @@ test and set for you.
e.g.::
inode = iget_locked(sb, ino);
- if (inode->i_state & I_NEW) {
+ if (inode_state_read_once(inode) & I_NEW) {
err = read_inode_from_disk(inode);
if (err < 0) {
iget_failed(inode);
@@ -313,7 +313,7 @@ done.
**mandatory**
-block truncatation on error exit from ->write_begin, and ->direct_IO
+block truncation on error exit from ->write_begin, and ->direct_IO
moved from generic methods (block_write_begin, cont_write_begin,
nobh_write_begin, blockdev_direct_IO*) to callers. Take a look at
ext2_write_failed and callers for an example.
@@ -340,8 +340,8 @@ of those. Caller makes sure async writeback cannot be running for the inode whil
->drop_inode() returns int now; it's called on final iput() with
inode->i_lock held and it returns true if filesystems wants the inode to be
-dropped. As before, generic_drop_inode() is still the default and it's been
-updated appropriately. generic_delete_inode() is also alive and it consists
+dropped. As before, inode_generic_drop() is still the default and it's been
+updated appropriately. inode_just_drop() is also alive and it consists
simply of return 1. Note that all actual eviction work is done by caller after
->drop_inode() returns.
@@ -470,7 +470,7 @@ has been taken to VFS and filesystems need to provide a non-NULL
**mandatory**
If you implement your own ->llseek() you must handle SEEK_HOLE and
-SEEK_DATA. You can hanle this by returning -EINVAL, but it would be nicer to
+SEEK_DATA. You can handle this by returning -EINVAL, but it would be nicer to
support it in some way. The generic handler assumes that the entire file is
data and there is a virtual hole at the end of the file. So if the provided
offset is less than i_size and SEEK_DATA is specified, return the same offset.
@@ -517,7 +517,7 @@ The witch is dead! Well, 2/3 of it, anyway. ->d_revalidate() and
->create() doesn't take ``struct nameidata *``; unlike the previous
two, it gets "is it an O_EXCL or equivalent?" boolean argument. Note that
-local filesystems can ignore tha argument - they are guaranteed that the
+local filesystems can ignore this argument - they are guaranteed that the
object doesn't exist. It's remote/distributed ones that might care...
---
@@ -537,7 +537,7 @@ vfs_readdir() is gone; switch to iterate_dir() instead
**mandatory**
-->readdir() is gone now; switch to ->iterate()
+->readdir() is gone now; switch to ->iterate_shared()
**mandatory**
@@ -693,24 +693,19 @@ parallel now.
---
-**recommended**
+**mandatory**
-->iterate_shared() is added; it's a parallel variant of ->iterate().
+->iterate_shared() is added.
Exclusion on struct file level is still provided (as well as that
between it and lseek on the same struct file), but if your directory
has been opened several times, you can get these called in parallel.
Exclusion between that method and all directory-modifying ones is
still provided, of course.
-Often enough ->iterate() can serve as ->iterate_shared() without any
-changes - it is a read-only operation, after all. If you have any
-per-inode or per-dentry in-core data structures modified by ->iterate(),
-you might need something to serialize the access to them. If you
-do dcache pre-seeding, you'll need to switch to d_alloc_parallel() for
-that; look for in-tree examples.
-
-Old method is only used if the new one is absent; eventually it will
-be removed. Switch while you still can; the old one won't stay.
+If you have any per-inode or per-dentry in-core data structures modified
+by ->iterate_shared(), you might need something to serialize the access
+to them. If you do dcache pre-seeding, you'll need to switch to
+d_alloc_parallel() for that; look for in-tree examples.
---
@@ -863,7 +858,7 @@ be misspelled d_alloc_anon().
**mandatory**
-[should've been added in 2016] stale comment in finish_open() nonwithstanding,
+[should've been added in 2016] stale comment in finish_open() notwithstanding,
failure exits in ->atomic_open() instances should *NOT* fput() the file,
no matter what. Everything is handled by the caller.
@@ -930,9 +925,9 @@ should be done by looking at FMODE_LSEEK in file->f_mode.
filldir_t (readdir callbacks) calling conventions have changed. Instead of
returning 0 or -E... it returns bool now. false means "no more" (as -E... used
to) and true - "keep going" (as 0 in old calling conventions). Rationale:
-callers never looked at specific -E... values anyway. ->iterate() and
-->iterate_shared() instance require no changes at all, all filldir_t ones in
-the tree converted.
+callers never looked at specific -E... values anyway. -> iterate_shared()
+instances require no changes at all, all filldir_t ones in the tree
+converted.
---
@@ -943,3 +938,399 @@ file pointer instead of struct dentry pointer. d_tmpfile() is similarly
changed to simplify callers. The passed file is in a non-open state and on
success must be opened before returning (e.g. by calling
finish_open_simple()).
+
+---
+
+**mandatory**
+
+Calling convention for ->huge_fault has changed. It now takes a page
+order instead of an enum page_entry_size, and it may be called without the
+mmap_lock held. All in-tree users have been audited and do not seem to
+depend on the mmap_lock being held, but out of tree users should verify
+for themselves. If they do need it, they can return VM_FAULT_RETRY to
+be called with the mmap_lock held.
+
+---
+
+**mandatory**
+
+The order of opening block devices and matching or creating superblocks has
+changed.
+
+The old logic opened block devices first and then tried to find a
+suitable superblock to reuse based on the block device pointer.
+
+The new logic tries to find a suitable superblock first based on the device
+number, and opening the block device afterwards.
+
+Since opening block devices cannot happen under s_umount because of lock
+ordering requirements s_umount is now dropped while opening block devices and
+reacquired before calling fill_super().
+
+In the old logic concurrent mounters would find the superblock on the list of
+superblocks for the filesystem type. Since the first opener of the block device
+would hold s_umount they would wait until the superblock became either born or
+was discarded due to initialization failure.
+
+Since the new logic drops s_umount concurrent mounters could grab s_umount and
+would spin. Instead they are now made to wait using an explicit wait-wake
+mechanism without having to hold s_umount.
+
+---
+
+**mandatory**
+
+The holder of a block device is now the superblock.
+
+The holder of a block device used to be the file_system_type which wasn't
+particularly useful. It wasn't possible to go from block device to owning
+superblock without matching on the device pointer stored in the superblock.
+This mechanism would only work for a single device so the block layer couldn't
+find the owning superblock of any additional devices.
+
+In the old mechanism reusing or creating a superblock for a racing mount(2) and
+umount(2) relied on the file_system_type as the holder. This was severely
+underdocumented however:
+
+(1) Any concurrent mounter that managed to grab an active reference on an
+ existing superblock was made to wait until the superblock either became
+ ready or until the superblock was removed from the list of superblocks of
+ the filesystem type. If the superblock is ready the caller would simple
+ reuse it.
+
+(2) If the mounter came after deactivate_locked_super() but before
+ the superblock had been removed from the list of superblocks of the
+ filesystem type the mounter would wait until the superblock was shutdown,
+ reuse the block device and allocate a new superblock.
+
+(3) If the mounter came after deactivate_locked_super() and after
+ the superblock had been removed from the list of superblocks of the
+ filesystem type the mounter would reuse the block device and allocate a new
+ superblock (the bd_holder point may still be set to the filesystem type).
+
+Because the holder of the block device was the file_system_type any concurrent
+mounter could open the block devices of any superblock of the same
+file_system_type without risking seeing EBUSY because the block device was
+still in use by another superblock.
+
+Making the superblock the owner of the block device changes this as the holder
+is now a unique superblock and thus block devices associated with it cannot be
+reused by concurrent mounters. So a concurrent mounter in (2) could suddenly
+see EBUSY when trying to open a block device whose holder was a different
+superblock.
+
+The new logic thus waits until the superblock and the devices are shutdown in
+->kill_sb(). Removal of the superblock from the list of superblocks of the
+filesystem type is now moved to a later point when the devices are closed:
+
+(1) Any concurrent mounter managing to grab an active reference on an existing
+ superblock is made to wait until the superblock is either ready or until
+ the superblock and all devices are shutdown in ->kill_sb(). If the
+ superblock is ready the caller will simply reuse it.
+
+(2) If the mounter comes after deactivate_locked_super() but before
+ the superblock has been removed from the list of superblocks of the
+ filesystem type the mounter is made to wait until the superblock and the
+ devices are shut down in ->kill_sb() and the superblock is removed from the
+ list of superblocks of the filesystem type. The mounter will allocate a new
+ superblock and grab ownership of the block device (the bd_holder pointer of
+ the block device will be set to the newly allocated superblock).
+
+(3) This case is now collapsed into (2) as the superblock is left on the list
+ of superblocks of the filesystem type until all devices are shutdown in
+ ->kill_sb(). In other words, if the superblock isn't on the list of
+ superblock of the filesystem type anymore then it has given up ownership of
+ all associated block devices (the bd_holder pointer is NULL).
+
+As this is a VFS level change it has no practical consequences for filesystems
+other than that all of them must use one of the provided kill_litter_super(),
+kill_anon_super(), or kill_block_super() helpers.
+
+---
+
+**mandatory**
+
+Lock ordering has been changed so that s_umount ranks above open_mutex again.
+All places where s_umount was taken under open_mutex have been fixed up.
+
+---
+
+**mandatory**
+
+export_operations ->encode_fh() no longer has a default implementation to
+encode FILEID_INO32_GEN* file handles.
+Filesystems that used the default implementation may use the generic helper
+generic_encode_ino32_fh() explicitly.
+
+---
+
+**mandatory**
+
+If ->rename() update of .. on cross-directory move needs an exclusion with
+directory modifications, do *not* lock the subdirectory in question in your
+->rename() - it's done by the caller now [that item should've been added in
+28eceeda130f "fs: Lock moved directories"].
+
+---
+
+**mandatory**
+
+On same-directory ->rename() the (tautological) update of .. is not protected
+by any locks; just don't do it if the old parent is the same as the new one.
+We really can't lock two subdirectories in same-directory rename - not without
+deadlocks.
+
+---
+
+**mandatory**
+
+lock_rename() and lock_rename_child() may fail in cross-directory case, if
+their arguments do not have a common ancestor. In that case ERR_PTR(-EXDEV)
+is returned, with no locks taken. In-tree users updated; out-of-tree ones
+would need to do so.
+
+---
+
+**mandatory**
+
+The list of children anchored in parent dentry got turned into hlist now.
+Field names got changed (->d_children/->d_sib instead of ->d_subdirs/->d_child
+for anchor/entries resp.), so any affected places will be immediately caught
+by compiler.
+
+---
+
+**mandatory**
+
+->d_delete() instances are now called for dentries with ->d_lock held
+and refcount equal to 0. They are not permitted to drop/regain ->d_lock.
+None of in-tree instances did anything of that sort. Make sure yours do not...
+
+---
+
+**mandatory**
+
+->d_prune() instances are now called without ->d_lock held on the parent.
+->d_lock on dentry itself is still held; if you need per-parent exclusions (none
+of the in-tree instances did), use your own spinlock.
+
+->d_iput() and ->d_release() are called with victim dentry still in the
+list of parent's children. It is still unhashed, marked killed, etc., just not
+removed from parent's ->d_children yet.
+
+Anyone iterating through the list of children needs to be aware of the
+half-killed dentries that might be seen there; taking ->d_lock on those will
+see them negative, unhashed and with negative refcount, which means that most
+of the in-kernel users would've done the right thing anyway without any adjustment.
+
+---
+
+**recommended**
+
+Block device freezing and thawing have been moved to holder operations.
+
+Before this change, get_active_super() would only be able to find the
+superblock of the main block device, i.e., the one stored in sb->s_bdev. Block
+device freezing now works for any block device owned by a given superblock, not
+just the main block device. The get_active_super() helper and bd_fsfreeze_sb
+pointer are gone.
+
+---
+
+**mandatory**
+
+set_blocksize() takes opened struct file instead of struct block_device now
+and it *must* be opened exclusive.
+
+---
+
+**mandatory**
+
+->d_revalidate() gets two extra arguments - inode of parent directory and
+name our dentry is expected to have. Both are stable (dir is pinned in
+non-RCU case and will stay around during the call in RCU case, and name
+is guaranteed to stay unchanging). Your instance doesn't have to use
+either, but it often helps to avoid a lot of painful boilerplate.
+Note that while name->name is stable and NUL-terminated, it may (and
+often will) have name->name[name->len] equal to '/' rather than '\0' -
+in normal case it points into the pathname being looked up.
+NOTE: if you need something like full path from the root of filesystem,
+you are still on your own - this assists with simple cases, but it's not
+magic.
+
+---
+
+**recommended**
+
+kern_path_locked() and user_path_locked() no longer return a negative
+dentry so this doesn't need to be checked. If the name cannot be found,
+ERR_PTR(-ENOENT) is returned.
+
+---
+
+**recommended**
+
+lookup_one_qstr_excl() is changed to return errors in more cases, so
+these conditions don't require explicit checks:
+
+ - if LOOKUP_CREATE is NOT given, then the dentry won't be negative,
+ ERR_PTR(-ENOENT) is returned instead
+ - if LOOKUP_EXCL IS given, then the dentry won't be positive,
+ ERR_PTR(-EEXIST) is rreturned instread
+
+LOOKUP_EXCL now means "target must not exist". It can be combined with
+LOOK_CREATE or LOOKUP_RENAME_TARGET.
+
+---
+
+**mandatory**
+invalidate_inodes() is gone use evict_inodes() instead.
+
+---
+
+**mandatory**
+
+->mkdir() now returns a dentry. If the created inode is found to
+already be in cache and have a dentry (often IS_ROOT()), it will need to
+be spliced into the given name in place of the given dentry. That dentry
+now needs to be returned. If the original dentry is used, NULL should
+be returned. Any error should be returned with ERR_PTR().
+
+In general, filesystems which use d_instantiate_new() to install the new
+inode can safely return NULL. Filesystems which may not have an I_NEW inode
+should use d_drop();d_splice_alias() and return the result of the latter.
+
+If a positive dentry cannot be returned for some reason, in-kernel
+clients such as cachefiles, nfsd, smb/server may not perform ideally but
+will fail-safe.
+
+---
+
+** mandatory**
+
+lookup_one(), lookup_one_unlocked(), lookup_one_positive_unlocked() now
+take a qstr instead of a name and len. These, not the "one_len"
+versions, should be used whenever accessing a filesystem from outside
+that filesysmtem, through a mount point - which will have a mnt_idmap.
+
+---
+
+** mandatory**
+
+Functions try_lookup_one_len(), lookup_one_len(),
+lookup_one_len_unlocked() and lookup_positive_unlocked() have been
+renamed to try_lookup_noperm(), lookup_noperm(),
+lookup_noperm_unlocked(), lookup_noperm_positive_unlocked(). They now
+take a qstr instead of separate name and length. QSTR() can be used
+when strlen() is needed for the length.
+
+These function no longer do any permission checking - they previously
+checked that the caller has 'X' permission on the parent. They must
+ONLY be used internally by a filesystem on itself when it knows that
+permissions are irrelevant or in a context where permission checks have
+already been performed such as after vfs_path_parent_lookup()
+
+---
+
+** mandatory**
+
+d_hash_and_lookup() is no longer exported or available outside the VFS.
+Use try_lookup_noperm() instead. This adds name validation and takes
+arguments in the opposite order but is otherwise identical.
+
+Using try_lookup_noperm() will require linux/namei.h to be included.
+
+---
+
+**mandatory**
+
+Calling conventions for ->d_automount() have changed; we should *not* grab
+an extra reference to new mount - it should be returned with refcount 1.
+
+---
+
+collect_mounts()/drop_collected_mounts()/iterate_mounts() are gone now.
+Replacement is collect_paths()/drop_collected_path(), with no special
+iterator needed. Instead of a cloned mount tree, the new interface returns
+an array of struct path, one for each mount collect_mounts() would've
+created. These struct path point to locations in the caller's namespace
+that would be roots of the cloned mounts.
+
+---
+
+**mandatory**
+
+If your filesystem sets the default dentry_operations, use set_default_d_op()
+rather than manually setting sb->s_d_op.
+
+---
+
+**mandatory**
+
+d_set_d_op() is no longer exported (or public, for that matter); _if_
+your filesystem really needed that, make use of d_splice_alias_ops()
+to have them set. Better yet, think hard whether you need different
+->d_op for different dentries - if not, just use set_default_d_op()
+at mount time and be done with that. Currently procfs is the only
+thing that really needs ->d_op varying between dentries.
+
+---
+
+**highly recommended**
+
+The file operations mmap() callback is deprecated in favour of
+mmap_prepare(). This passes a pointer to a vm_area_desc to the callback
+rather than a VMA, as the VMA at this stage is not yet valid.
+
+The vm_area_desc provides the minimum required information for a filesystem
+to initialise state upon memory mapping of a file-backed region, and output
+parameters for the file system to set this state.
+
+In nearly all cases, this is all that is required for a filesystem. However, if
+a filesystem needs to perform an operation such a pre-population of page tables,
+then that action can be specified in the vm_area_desc->action field, which can
+be configured using the mmap_action_*() helpers.
+
+---
+
+**mandatory**
+
+Several functions are renamed:
+
+- kern_path_locked -> start_removing_path
+- kern_path_create -> start_creating_path
+- user_path_create -> start_creating_user_path
+- user_path_locked_at -> start_removing_user_path_at
+- done_path_create -> end_creating_path
+
+---
+
+**mandatory**
+
+Calling conventions for vfs_parse_fs_string() have changed; it does *not*
+take length anymore (value ? strlen(value) : 0 is used). If you want
+a different length, use
+
+ vfs_parse_fs_qstr(fc, key, &QSTR_LEN(value, len))
+
+instead.
+
+---
+
+**mandatory**
+
+vfs_mkdir() now returns a dentry - the one returned by ->mkdir(). If
+that dentry is different from the dentry passed in, including if it is
+an IS_ERR() dentry pointer, the original dentry is dput().
+
+When vfs_mkdir() returns an error, and so both dputs() the original
+dentry and doesn't provide a replacement, it also unlocks the parent.
+Consequently the return value from vfs_mkdir() can be passed to
+end_creating() and the parent will be unlocked precisely when necessary.
+
+---
+
+**mandatory**
+
+kill_litter_super() is gone; convert to DCACHE_PERSISTENT use (as all
+in-tree filesystems have done).
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 7897a7dafcbc..8256e857e2d7 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -48,6 +48,7 @@ fixes/update part 1.1 Stefani Seibold <stefani@seibold.net> June 9 2009
3.11 /proc/<pid>/patch_state - Livepatch patch operation state
3.12 /proc/<pid>/arch_status - Task architecture specific information
3.13 /proc/<pid>/fd - List of symlinks to open files
+ 3.14 /proc/<pid/ksm_stat - Information about the process's ksm status.
4 Configuring procfs
4.1 Mount options
@@ -60,19 +61,6 @@ Preface
0.1 Introduction/Credits
------------------------
-This documentation is part of a soon (or so we hope) to be released book on
-the SuSE Linux distribution. As there is no complete documentation for the
-/proc file system and we've used many freely available sources to write these
-chapters, it seems only fair to give the work back to the Linux community.
-This work is based on the 2.2.* kernel version and the upcoming 2.4.*. I'm
-afraid it's still far from complete, but we hope it will be useful. As far as
-we know, it is the first 'all-in-one' document about the /proc file system. It
-is focused on the Intel x86 hardware, so if you are looking for PPC, ARM,
-SPARC, AXP, etc., features, you probably won't find what you are looking for.
-It also only covers IPv4 networking, not IPv6 nor other protocols - sorry. But
-additions and patches are welcome and will be added to this document if you
-mail them to Bodo.
-
We'd like to thank Alan Cox, Rik van Riel, and Alexey Kuznetsov and a lot of
other people for help compiling this documentation. We'd also like to extend a
special thank you to Andi Kleen for documentation, which we relied on heavily
@@ -80,17 +68,9 @@ to create this document, as well as the additional information he provided.
Thanks to everybody else who contributed source or docs to the Linux kernel
and helped create a great piece of software... :)
-If you have any comments, corrections or additions, please don't hesitate to
-contact Bodo Bauer at bb@ricochet.net. We'll be happy to add them to this
-document.
-
The latest version of this document is available online at
https://www.kernel.org/doc/html/latest/filesystems/proc.html
-If the above direction does not works for you, you could try the kernel
-mailing list at linux-kernel@vger.kernel.org and/or try to reach me at
-comandante@zaralinux.com.
-
0.2 Legal Stuff
---------------
@@ -127,6 +107,16 @@ process running on the system, which is named after the process ID (PID).
The link 'self' points to the process reading the file system. Each process
subdirectory has the entries listed in Table 1-1.
+A process can read its own information from /proc/PID/* with no extra
+permissions. When reading /proc/PID/* information for other processes, reading
+process is required to have either CAP_SYS_PTRACE capability with
+PTRACE_MODE_READ access permissions, or, alternatively, CAP_PERFMON
+capability. This applies to all read-only information like `maps`, `environ`,
+`pagemap`, etc. The only exception is `mem` file due to its read-write nature,
+which requires CAP_SYS_PTRACE capabilities with more elevated
+PTRACE_MODE_ATTACH permissions; CAP_PERFMON capability does not grant access
+to /proc/PID/mem for other processes.
+
Note that an open file descriptor to /proc/<pid> or to any of its
contained files or subdirectories does not prevent <pid> being reused
for some other process in the event that <pid> exits. Operations on
@@ -280,8 +270,9 @@ It's slow but very precise.
HugetlbPages size of hugetlb memory portions
CoreDumping process's memory is currently being dumped
(killing the process may lead to a corrupted core)
- THP_enabled process is allowed to use THP (returns 0 when
- PR_SET_THP_DISABLE is set on the process
+ THP_enabled process is allowed to use THP (returns 0 when
+ PR_SET_THP_DISABLE is set on the process to disable
+ THP completely, not just partially)
Threads number of threads
SigQ number of signals queued/max. number for queue
SigPnd bitmap of pending signals for the thread
@@ -443,6 +434,15 @@ is not associated with a file:
or if empty, the mapping is anonymous.
+Starting with 6.11 kernel, /proc/PID/maps provides an alternative
+ioctl()-based API that gives ability to flexibly and efficiently query and
+filter individual VMAs. This interface is binary and is meant for more
+efficient and easy programmatic use. `struct procmap_query`, defined in
+linux/fs.h UAPI header, serves as an input/output argument to the
+`PROCMAP_QUERY` ioctl() command. See comments in linus/fs.h UAPI header for
+details on query semantics, supported flags, data returned, and general API
+usage information.
+
The /proc/PID/smaps is an extension based on maps, showing the memory
consumption for each of the process's mappings. For each mapping (aka Virtual
Memory Area, or VMA) there is a series of lines such as the following::
@@ -461,6 +461,7 @@ Memory Area, or VMA) there is a series of lines such as the following::
Private_Dirty: 0 kB
Referenced: 892 kB
Anonymous: 0 kB
+ KSM: 0 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
@@ -474,14 +475,15 @@ Memory Area, or VMA) there is a series of lines such as the following::
THPeligible: 0
VmFlags: rd ex mr mw me dw
-The first of these lines shows the same information as is displayed for the
-mapping in /proc/PID/maps. Following lines show the size of the mapping
-(size); the size of each page allocated when backing a VMA (KernelPageSize),
-which is usually the same as the size in the page table entries; the page size
-used by the MMU when backing a VMA (in most cases, the same as KernelPageSize);
-the amount of the mapping that is currently resident in RAM (RSS); the
-process' proportional share of this mapping (PSS); and the number of clean and
-dirty shared and private pages in the mapping.
+The first of these lines shows the same information as is displayed for
+the mapping in /proc/PID/maps. Following lines show the size of the
+mapping (size); the size of each page allocated when backing a VMA
+(KernelPageSize), which is usually the same as the size in the page table
+entries; the page size used by the MMU when backing a VMA (in most cases,
+the same as KernelPageSize); the amount of the mapping that is currently
+resident in RAM (RSS); the process's proportional share of this mapping
+(PSS); and the number of clean and dirty shared and private pages in the
+mapping.
The "proportional set size" (PSS) of a process is the count of pages it has
in memory, where each page is divided by the number of processes sharing it.
@@ -490,9 +492,25 @@ process, its PSS will be 1500. "Pss_Dirty" is the portion of PSS which
consists of dirty pages. ("Pss_Clean" is not included, but it can be
calculated by subtracting "Pss_Dirty" from "Pss".)
-Note that even a page which is part of a MAP_SHARED mapping, but has only
-a single pte mapped, i.e. is currently used by only one process, is accounted
-as private and not as shared.
+Traditionally, a page is accounted as "private" if it is mapped exactly once,
+and a page is accounted as "shared" when mapped multiple times, even when
+mapped in the same process multiple times. Note that this accounting is
+independent of MAP_SHARED.
+
+In some kernel configurations, the semantics of pages part of a larger
+allocation (e.g., THP) can differ: a page is accounted as "private" if all
+pages part of the corresponding large allocation are *certainly* mapped in the
+same process, even if the page is mapped multiple times in that process. A
+page is accounted as "shared" if any page page of the larger allocation
+is *maybe* mapped in a different process. In some cases, a large allocation
+might be treated as "maybe mapped by multiple processes" even though this
+is no longer the case.
+
+Some kernel configurations do not track the precise number of times a page part
+of a larger allocation is mapped. In this case, when calculating the PSS, the
+average number of mappings per page in this larger allocation might be used
+as an approximation for the number of mappings of a page. The PSS calculation
+will be imprecise in this case.
"Referenced" indicates the amount of memory currently marked as referenced or
accessed.
@@ -501,18 +519,21 @@ accessed.
a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
and a page is modified, the file page is replaced by a private anonymous copy.
+"KSM" reports how many of the pages are KSM pages. Note that KSM-placed zeropages
+are not included, only actual KSM pages.
+
"LazyFree" shows the amount of memory which is marked by madvise(MADV_FREE).
The memory isn't freed immediately with madvise(). It's freed in memory
pressure if the memory is clean. Please note that the printed value might
be lower than the real value due to optimizations used in the current
implementation. If this is not desirable please file a bug report.
-"AnonHugePages" shows the ammount of memory backed by transparent hugepage.
+"AnonHugePages" shows the amount of memory backed by transparent hugepage.
-"ShmemPmdMapped" shows the ammount of shared (shmem/tmpfs) memory backed by
+"ShmemPmdMapped" shows the amount of shared (shmem/tmpfs) memory backed by
huge pages.
-"Shared_Hugetlb" and "Private_Hugetlb" show the ammounts of memory backed by
+"Shared_Hugetlb" and "Private_Hugetlb" show the amounts of memory backed by
hugetlbfs page which is *not* counted in "RSS" or "PSS" field for historical
reasons. And these are not included in {Shared,Private}_{Clean,Dirty} field.
@@ -524,15 +545,15 @@ replaced by copy-on-write) part of the underlying shmem object out on swap.
does not take into account swapped out page of underlying shmem objects.
"Locked" indicates whether the mapping is locked in memory or not.
-"THPeligible" indicates whether the mapping is eligible for allocating THP
-pages as well as the THP is PMD mappable or not - 1 if true, 0 otherwise.
-It just shows the current status.
+"THPeligible" indicates whether the mapping is eligible for allocating
+naturally aligned THP pages of any currently enabled size. 1 if true, 0
+otherwise.
"VmFlags" field deserves a separate description. This member represents the
kernel flags associated with the particular virtual memory area in two letter
encoded manner. The codes are the following:
- == =======================================
+ == =============================================================
rd readable
wr writeable
ex executable
@@ -543,7 +564,6 @@ encoded manner. The codes are the following:
ms may share
gd stack segment growns down
pf pure PFN range
- dw disabled write to the mapped file
lo pages are locked in memory
io memory mapped I/O area
sr sequential read advise provided
@@ -561,12 +581,18 @@ encoded manner. The codes are the following:
mm mixed map area
hg huge page advise flag
nh no huge page advise flag
- mg mergable advise flag
+ mg mergeable advise flag
bt arm64 BTI guarded page
mt arm64 MTE allocation tags are enabled
um userfaultfd missing tracking
uw userfaultfd wr-protect tracking
- == =======================================
+ ui userfaultfd minor fault
+ ss shadow/guarded control stack page
+ sl sealed
+ lf lock on fault pages
+ dp always lazily freeable mapping
+ gu maybe contains guard regions (if not set, definitely doesn't)
+ == =============================================================
Note that there is no guarantee that every flag and associated mnemonic will
be present in all further kernel releases. Things get changed, the flags may
@@ -669,6 +695,11 @@ Where:
node locality page counters (N0 == node0, N1 == node1, ...) and the kernel page
size, in KB, that is backing the mapping up.
+Note that some kernel configurations do not track the precise number of times
+a page part of a larger allocation (e.g., THP) is mapped. In these
+configurations, "mapmax" might corresponds to the average number of mappings
+per page in such a larger allocation instead.
+
1.2 Kernel data
---------------
@@ -683,10 +714,17 @@ files are there, and which are missing.
============ ===============================================================
File Content
============ ===============================================================
+ allocinfo Memory allocations profiling information
apm Advanced power management info
+ bootconfig Kernel command line obtained from boot config,
+ and, if there were kernel parameters from the
+ boot loader, a "# Parameters from bootloader:"
+ line followed by a line containing those
+ parameters prefixed by "# ". (5.5)
buddyinfo Kernel memory allocator information (see text) (2.5)
bus Directory containing bus specific information
- cmdline Kernel command line
+ cmdline Kernel command line, both from bootloader and embedded
+ in the kernel image
cpuinfo Info about the CPU
devices Available devices (block and character)
dma Used DMS channels
@@ -942,6 +980,48 @@ also be allocatable although a lot of filesystem metadata may have to be
reclaimed to achieve this.
+allocinfo
+~~~~~~~~~
+
+Provides information about memory allocations at all locations in the code
+base. Each allocation in the code is identified by its source file, line
+number, module (if originates from a loadable module) and the function calling
+the allocation. The number of bytes allocated and number of calls at each
+location are reported. The first line indicates the version of the file, the
+second line is the header listing fields in the file.
+If file version is 2.0 or higher then each line may contain additional
+<key>:<value> pairs representing extra information about the call site.
+For example if the counters are not accurate, the line will be appended with
+"accurate:no" pair.
+
+Supported markers in v2:
+accurate:no
+
+ Absolute values of the counters in this line are not accurate
+ because of the failure to allocate memory to track some of the
+ allocations made at this location. Deltas in these counters are
+ accurate, therefore counters can be used to track allocation size
+ and count changes.
+
+Example output.
+
+::
+
+ > tail -n +3 /proc/allocinfo | sort -rn
+ 127664128 31168 mm/page_ext.c:270 func:alloc_page_ext
+ 56373248 4737 mm/slub.c:2259 func:alloc_slab_page
+ 14880768 3633 mm/readahead.c:247 func:page_cache_ra_unbounded
+ 14417920 3520 mm/mm_init.c:2530 func:alloc_large_system_hash
+ 13377536 234 block/blk-mq.c:3421 func:blk_mq_alloc_rqs
+ 11718656 2861 mm/filemap.c:1919 func:__filemap_get_folio
+ 9192960 2800 kernel/fork.c:307 func:alloc_thread_stack_node
+ 4206592 4 net/netfilter/nf_conntrack_core.c:2567 func:nf_ct_alloc_hashtable
+ 4136960 1010 drivers/staging/ctagmod/ctagmod.c:20 [ctagmod] func:ctagmod_start
+ 3940352 962 mm/memory.c:4214 func:alloc_anon_folio
+ 2894464 22613 fs/kernfs/dir.c:615 func:__kernfs_new_node
+ ...
+
+
meminfo
~~~~~~~
@@ -1007,6 +1087,8 @@ Example output. You may not have all of these fields.
FilePmdMapped: 0 kB
CmaTotal: 0 kB
CmaFree: 0 kB
+ Unaccepted: 0 kB
+ Balloon: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
@@ -1079,9 +1161,15 @@ Dirty
Writeback
Memory which is actively being written back to the disk
AnonPages
- Non-file backed pages mapped into userspace page tables
+ Non-file backed pages mapped into userspace page tables. Note that
+ some kernel configurations might consider all pages part of a
+ larger allocation (e.g., THP) as "mapped", as soon as a single
+ page is mapped.
Mapped
- files which have been mmaped, such as libraries
+ files which have been mmapped, such as libraries. Note that some
+ kernel configurations might consider all pages part of a larger
+ allocation (e.g., THP) as "mapped", as soon as a single page is
+ mapped.
Shmem
Total memory used by shared memory (shmem) and tmpfs
KReclaimable
@@ -1099,15 +1187,17 @@ KernelStack
PageTables
Memory consumed by userspace page tables
SecPageTables
- Memory consumed by secondary page tables, this currently
- currently includes KVM mmu allocations on x86 and arm64.
+ Memory consumed by secondary page tables, this currently includes
+ KVM mmu and IOMMU allocations on x86 and arm64.
NFS_Unstable
- Always zero. Previous counted pages which had been written to
+ Always zero. Previously counted pages which had been written to
the server, but has not been committed to stable storage.
Bounce
- Memory used for block device "bounce buffers"
+ Always zero. Previously memory used for block device
+ "bounce buffers".
WritebackTmp
- Memory used by FUSE for temporary writeback buffers
+ Always zero. Previously memory used by FUSE for temporary
+ writeback buffers.
CommitLimit
Based on the overcommit ratio ('vm.overcommit_ratio'),
this is the total amount of memory currently available to
@@ -1175,6 +1265,10 @@ CmaTotal
Memory reserved for the Contiguous Memory Allocator (CMA)
CmaFree
Free remaining memory in the CMA reserves
+Unaccepted
+ Memory that has not been accepted by the guest
+Balloon
+ Memory returned to Host by VM Balloon Drivers
HugePages_Total, HugePages_Free, HugePages_Rsvd, HugePages_Surp, Hugepagesize, Hugetlb
See Documentation/admin-guide/mm/hugetlbpage.rst.
DirectMap4k, DirectMap2M, DirectMap1G
@@ -1888,8 +1982,8 @@ For more information on mount propagation see:
These files provide a method to access a task's comm value. It also allows for
a task to set its own or one of its thread siblings comm value. The comm value
is limited in size compared to the cmdline value, so writing anything longer
-then the kernel's TASK_COMM_LEN (currently 16 chars) will result in a truncated
-comm value.
+then the kernel's TASK_COMM_LEN (currently 16 chars, including the NUL
+terminator) will result in a truncated comm value.
3.7 /proc/<pid>/task/<tid>/children - Information about task children
@@ -2066,6 +2160,20 @@ DMA Buffer files
where 'size' is the size of the DMA buffer in bytes. 'count' is the file count of
the DMA buffer file. 'exp_name' is the name of the DMA buffer exporter.
+VFIO Device files
+~~~~~~~~~~~~~~~~~
+
+::
+
+ pos: 0
+ flags: 02000002
+ mnt_id: 17
+ ino: 5122
+ vfio-device-syspath: /sys/devices/pci0000:e0/0000:e0:01.1/0000:e1:00.0/0000:e2:05.0/0000:e8:00.0
+
+where 'vfio-device-syspath' is the sysfs path corresponding to the VFIO device
+file.
+
3.9 /proc/<pid>/map_files - Information about memory mapped files
---------------------------------------------------------------------
This directory contains symbolic links which represent memory mapped files
@@ -2181,6 +2289,74 @@ The number of open files for the process is stored in 'size' member
of stat() output for /proc/<pid>/fd for fast access.
-------------------------------------------------------
+3.14 /proc/<pid/ksm_stat - Information about the process's ksm status
+---------------------------------------------------------------------
+When CONFIG_KSM is enabled, each process has this file which displays
+the information of ksm merging status.
+
+Example
+~~~~~~~
+
+::
+
+ / # cat /proc/self/ksm_stat
+ ksm_rmap_items 0
+ ksm_zero_pages 0
+ ksm_merging_pages 0
+ ksm_process_profit 0
+ ksm_merge_any: no
+ ksm_mergeable: no
+
+Description
+~~~~~~~~~~~
+
+ksm_rmap_items
+^^^^^^^^^^^^^^
+
+The number of ksm_rmap_item structures in use. The structure
+ksm_rmap_item stores the reverse mapping information for virtual
+addresses. KSM will generate a ksm_rmap_item for each ksm-scanned page of
+the process.
+
+ksm_zero_pages
+^^^^^^^^^^^^^^
+
+When /sys/kernel/mm/ksm/use_zero_pages is enabled, it represent how many
+empty pages are merged with kernel zero pages by KSM.
+
+ksm_merging_pages
+^^^^^^^^^^^^^^^^^
+
+It represents how many pages of this process are involved in KSM merging
+(not including ksm_zero_pages). It is the same with what
+/proc/<pid>/ksm_merging_pages shows.
+
+ksm_process_profit
+^^^^^^^^^^^^^^^^^^
+
+The profit that KSM brings (Saved bytes). KSM can save memory by merging
+identical pages, but also can consume additional memory, because it needs
+to generate a number of rmap_items to save each scanned page's brief rmap
+information. Some of these pages may be merged, but some may not be abled
+to be merged after being checked several times, which are unprofitable
+memory consumed.
+
+ksm_merge_any
+^^^^^^^^^^^^^
+
+It specifies whether the process's 'mm is added by prctl() into the
+candidate list of KSM or not, and if KSM scanning is fully enabled at
+process level.
+
+ksm_mergeable
+^^^^^^^^^^^^^
+
+It specifies whether any VMAs of the process''s mms are currently
+applicable to KSM.
+
+More information about KSM can be found in
+Documentation/admin-guide/mm/ksm.rst.
+
Chapter 4: Configuring procfs
=============================
@@ -2194,6 +2370,7 @@ The following mount options are supported:
hidepid= Set /proc/<pid>/ access mode.
gid= Set the group authorized to learn processes information.
subset= Show only the specified subset of procfs.
+ pidns= Specify a the namespace used by this procfs.
========= ========================================================
hidepid=off or hidepid=0 means classic mode - everybody may access all
@@ -2210,7 +2387,7 @@ arguments are now protected against local eavesdroppers.
hidepid=invisible or hidepid=2 means hidepid=1 plus all /proc/<pid>/ will be
fully invisible to other users. It doesn't mean that it hides a fact whether a
process with a specific pid value exists (it can be learned by other means, e.g.
-by "kill -0 $PID"), but it hides process' uid and gid, which may be learned by
+by "kill -0 $PID"), but it hides process's uid and gid, which may be learned by
stat()'ing /proc/<pid>/ otherwise. It greatly complicates an intruder's task of
gathering information about running processes, whether some daemon runs with
elevated privileges, whether other user runs some sensitive program, whether
@@ -2226,10 +2403,17 @@ information about processes information, just add identd to this group.
subset=pid hides all top level files and directories in the procfs that
are not related to tasks.
+pidns= specifies a pid namespace (either as a string path to something like
+`/proc/$pid/ns/pid`, or a file descriptor when using `FSCONFIG_SET_FD`) that
+will be used by the procfs instance when translating pids. By default, procfs
+will use the calling process's active pid namespace. Note that the pid
+namespace of an existing procfs instance cannot be modified (attempting to do
+so will give an `-EBUSY` error).
+
Chapter 5: Filesystem behavior
==============================
-Originally, before the advent of pid namepsace, procfs was a global file
+Originally, before the advent of pid namespace, procfs was a global file
system. It means that there was only one procfs instance in the system.
When pid namespace was added, a separate procfs instance was mounted in
diff --git a/Documentation/filesystems/propagate_umount.txt b/Documentation/filesystems/propagate_umount.txt
new file mode 100644
index 000000000000..9a7eb96df300
--- /dev/null
+++ b/Documentation/filesystems/propagate_umount.txt
@@ -0,0 +1,484 @@
+ Notes on propagate_umount()
+
+Umount propagation starts with a set of mounts we are already going to
+take out. Ideally, we would like to add all downstream cognates to
+that set - anything with the same mountpoint as one of the removed
+mounts and with parent that would receive events from the parent of that
+mount. However, there are some constraints the resulting set must
+satisfy.
+
+It is convenient to define several properties of sets of mounts:
+
+1) A set S of mounts is non-shifting if for any mount X belonging
+to S all subtrees mounted strictly inside of X (i.e. not overmounting
+the root of X) contain only elements of S.
+
+2) A set S is non-revealing if all locked mounts that belong to S have
+parents that also belong to S.
+
+3) A set S is closed if it contains all children of its elements.
+
+The set of mounts taken out by umount(2) must be non-shifting and
+non-revealing; the first constraint is what allows to reparent
+any remaining mounts and the second is what prevents the exposure
+of any concealed mountpoints.
+
+propagate_umount() takes the original set as an argument and tries to
+extend that set. The original set is a full subtree and its root is
+unlocked; what matters is that it's closed and non-revealing.
+Resulting set may not be closed; there might still be mounts outside
+of that set, but only on top of stacks of root-overmounting elements
+of set. They can be reparented to the place where the bottom of
+stack is attached to a mount that will survive. NOTE: doing that
+will violate a constraint on having no more than one mount with
+the same parent/mountpoint pair; however, the caller (umount_tree())
+will immediately remedy that - it may keep unmounted element attached
+to parent, but only if the parent itself is unmounted. Since all
+conflicts created by reparenting have common parent *not* in the
+set and one side of the conflict (bottom of the stack of overmounts)
+is in the set, it will be resolved. However, we rely upon umount_tree()
+doing that pretty much immediately after the call of propagate_umount().
+
+Algorithm is based on two statements:
+ 1) for any set S, there is a maximal non-shifting subset of S
+and it can be calculated in O(#S) time.
+ 2) for any non-shifting set S, there is a maximal non-revealing
+subset of S. That subset is also non-shifting and it can be calculated
+in O(#S) time.
+
+ Finding candidates.
+
+We are given a closed set U and we want to find all mounts that have
+the same mountpoint as some mount m in U *and* whose parent receives
+propagation from the parent of the same mount m. Naive implementation
+would be
+ S = {}
+ for each m in U
+ add m to S
+ p = parent(m)
+ for each q in Propagation(p) - {p}
+ child = look_up(q, mountpoint(m))
+ if child
+ add child to S
+but that can lead to excessive work - there might be propagation among the
+subtrees of U, in which case we'd end up examining the same candidates
+many times. Since propagation is transitive, the same will happen to
+everything downstream of that candidate and it's not hard to construct
+cases where the approach above leads to the time quadratic by the actual
+number of candidates.
+
+Note that if we run into a candidate we'd already seen, it must've been
+added on an earlier iteration of the outer loop - all additions made
+during one iteration of the outer loop have different parents. So
+if we find a child already added to the set, we know that everything
+in Propagation(parent(child)) with the same mountpoint has been already
+added.
+ S = {}
+ for each m in U
+ if m in S
+ continue
+ add m to S
+ p = parent(m)
+ q = propagation_next(p, p)
+ while q
+ child = look_up(q, mountpoint(m))
+ if child
+ if child in S
+ q = skip_them(q, p)
+ continue;
+ add child to S
+ q = propagation_next(q, p)
+where
+skip_them(q, p)
+ keep walking Propagation(p) from q until we find something
+ not in Propagation(q)
+
+would get rid of that problem, but we need a sane implementation of
+skip_them(). That's not hard to do - split propagation_next() into
+"down into mnt_slave_list" and "forward-and-up" parts, with the
+skip_them() being "repeat the forward-and-up part until we get NULL
+or something that isn't a peer of the one we are skipping".
+
+Note that there can be no absolute roots among the extra candidates -
+they all come from mount lookups. Absolute root among the original
+set is _currently_ impossible, but it might be worth protecting
+against.
+
+ Maximal non-shifting subsets.
+
+Let's call a mount m in a set S forbidden in that set if there is a
+subtree mounted strictly inside m and containing mounts that do not
+belong to S.
+
+The set is non-shifting when none of its elements are forbidden in it.
+
+If mount m is forbidden in a set S, it is forbidden in any subset S' it
+belongs to. In other words, it can't belong to any of the non-shifting
+subsets of S. If we had a way to find a forbidden mount or show that
+there's none, we could use it to find the maximal non-shifting subset
+simply by finding and removing them until none remain.
+
+Suppose mount m is forbidden in S; then any mounts forbidden in S - {m}
+must have been forbidden in S itself. Indeed, since m has descendents
+that do not belong to S, any subtree that fits into S will fit into
+S - {m} as well.
+
+So in principle we could go through elements of S, checking if they
+are forbidden in S and removing the ones that are. Removals will
+not invalidate the checks done for earlier mounts - if they were not
+forbidden at the time we checked, they won't become forbidden later.
+It's too costly to be practical, but there is a similar approach that
+is linear by size of S.
+
+Let's say that mount x in a set S is forbidden by mount y, if
+ * both x and y belong to S.
+ * there is a chain of mounts starting at x and leaving S
+ immediately after passing through y, with the first
+ mountpoint strictly inside x.
+Note 1: x may be equal to y - that's the case when something not
+belonging to S is mounted strictly inside x.
+Note 2: if y does not belong to S, it can't forbid anything in S.
+Note 3: if y has no children outside of S, it can't forbid anything in S.
+
+It's easy to show that mount x is forbidden in S if and only if x is
+forbidden in S by some mount y. And it's easy to find all mounts in S
+forbidden by a given mount.
+
+Consider the following operation:
+ Trim(S, m) = S - {x : x is forbidden by m in S}
+
+Note that if m does not belong to S or has no children outside of S we
+are guaranteed that Trim(S, m) is equal to S.
+
+The following is true: if x is forbidden by y in Trim(S, m), it was
+already forbidden by y in S.
+
+Proof: Suppose x is forbidden by y in Trim(S, m). Then there is a
+chain of mounts (x_0 = x, ..., x_k = y, x_{k+1} = r), such that x_{k+1}
+is the first element that doesn't belong to Trim(S, m) and the
+mountpoint of x_1 is strictly inside x. If mount r belongs to S, it must
+have been removed by Trim(S, m), i.e. it was forbidden in S by m.
+Then there was a mount chain from r to some child of m that stayed in
+S all the way until m, but that's impossible since x belongs to Trim(S, m)
+and prepending (x_0, ..., x_k) to that chain demonstrates that x is also
+forbidden in S by m, and thus can't belong to Trim(S, m).
+Therefore r can not belong to S and our chain demonstrates that
+x is forbidden by y in S. QED.
+
+Corollary: no mount is forbidden by m in Trim(S, m). Indeed, any
+such mount would have been forbidden by m in S and thus would have been
+in the part of S removed in Trim(S, m).
+
+Corollary: no mount is forbidden by m in Trim(Trim(S, m), n). Indeed,
+any such would have to have been forbidden by m in Trim(S, m), which
+is impossible.
+
+Corollary: after
+ S = Trim(S, x_1)
+ S = Trim(S, x_2)
+ ...
+ S = Trim(S, x_k)
+no mount remaining in S will be forbidden by either of x_1,...,x_k.
+
+The following will reduce S to its maximal non-shifting subset:
+ visited = {}
+ while S contains elements not belonging to visited
+ let m be an arbitrary such element of S
+ S = Trim(S, m)
+ add m to visited
+
+S never grows, so the number of elements of S not belonging to visited
+decreases at least by one on each iteration. When the loop terminates,
+all mounts remaining in S belong to visited. It's easy to see that at
+the beginning of each iteration no mount remaining in S will be forbidden
+by any element of visited. In other words, no mount remaining in S will
+be forbidden, i.e. final value of S will be non-shifting. It will be
+the maximal non-shifting subset, since we were removing only forbidden
+elements.
+
+ There are two difficulties in implementing the above in linear
+time, both due to the fact that Trim() might need to remove more than one
+element. Naive implementation of Trim() is vulnerable to running into a
+long chain of mounts, each mounted on top of parent's root. Nothing in
+that chain is forbidden, so nothing gets removed from it. We need to
+recognize such chains and avoid walking them again on subsequent calls of
+Trim(), otherwise we will end up with worst-case time being quadratic by
+the number of elements in S. Another difficulty is in implementing the
+outer loop - we need to iterate through all elements of a shrinking set.
+That would be trivial if we never removed more than one element at a time
+(linked list, with list_for_each_entry_safe for iterator), but we may
+need to remove more than one entry, possibly including the ones we have
+already visited.
+
+ Let's start with naive algorithm for Trim():
+
+Trim_one(m)
+ found = false
+ for each n in children(m)
+ if n not in S
+ found = true
+ if (mountpoint(n) != root(m))
+ remove m from S
+ break
+ if found
+ Trim_ancestors(m)
+
+Trim_ancestors(m)
+ for (; parent(m) in S; m = parent(m)) {
+ if (mountpoint(m) != root(parent(m)))
+ remove parent(m) from S
+ }
+
+If m belongs to S, Trim_one(m) will replace S with Trim(S, m).
+Proof:
+ Consider the chains excluding elements from Trim(S, m). The last
+two elements in such chain are m and some child of m that does not belong
+to S. If m has no such children, Trim(S, m) is equal to S.
+ m itself is removed if and only if the chain has exactly two
+elements, i.e. when the last element does not overmount the root of m.
+In other words, that happens when m has a child not in S that does not
+overmount the root of m.
+ All other elements to remove will be ancestors of m, such that
+the entire descent chain from them to m is contained in S. Let
+(x_0, x_1, ..., x_k = m) be the longest such chain. x_i needs to be
+removed if and only if x_{i+1} does not overmount its root. It's easy
+to see that Trim_ancestors(m) will iterate through that chain from
+x_k to x_1 and that it will remove exactly the elements that need to be
+removed.
+
+ Note that if the loop in Trim_ancestors() walks into an already
+visited element, we are guaranteed that remaining iterations will see
+only elements that had already been visited and remove none of them.
+That's the weakness that makes it vulnerable to long chains of full
+overmounts.
+
+ It's easy to deal with, if we can afford setting marks on
+elements of S; we would mark all elements already visited by
+Trim_ancestors() and have it bail out as soon as it sees an already
+marked element.
+
+ The problems with iterating through the set can be dealt with in
+several ways, depending upon the representation we choose for our set.
+One useful observation is that we are given a closed subset in S - the
+original set passed to propagate_umount(). Its elements can neither
+forbid anything nor be forbidden by anything - all their descendents
+belong to S, so they can not occur anywhere in any excluding chain.
+In other words, the elements of that subset will remain in S until
+the end and Trim_one(S, m) is a no-op for all m from that subset.
+
+ That suggests keeping S as a disjoint union of a closed set U
+('will be unmounted, no matter what') and the set of all elements of
+S that do not belong to U. That set ('candidates') is all we need
+to iterate through. Let's represent it as a subset in a cyclic list,
+consisting of all list elements that are marked as candidates (initially -
+all of them). Then we could have Trim_ancestors() only remove the mark,
+leaving the elements on the list. Then Trim_one() would never remove
+anything other than its argument from the containing list, allowing to
+use list_for_each_entry_safe() as iterator.
+
+ Assuming that representation we get the following:
+
+ list_for_each_entry_safe(m, ..., Candidates, ...)
+ Trim_one(m)
+where
+Trim_one(m)
+ if (m is not marked as a candidate)
+ strip the "seen by Trim_ancestors" mark from m
+ remove m from the Candidates list
+ return
+
+ remove_this = false
+ found = false
+ for each n in children(m)
+ if n not in S
+ found = true
+ if (mountpoint(n) != root(m))
+ remove_this = true
+ break
+ if found
+ Trim_ancestors(m)
+ if remove_this
+ strip the "seen by Trim_ancestors" mark from m
+ strip the "candidate" mark from m
+ remove m from the Candidate list
+
+Trim_ancestors(m)
+ for (p = parent(m); p is marked as candidate ; m = p, p = parent(p)) {
+ if m is marked as seen by Trim_ancestors
+ return
+ mark m as seen by Trim_ancestors
+ if (mountpoint(m) != root(p))
+ strip the "candidate" mark from p
+ }
+
+ Terminating condition in the loop in Trim_ancestors() is correct,
+since that loop will never run into p belonging to U - p is always
+an ancestor of argument of Trim_one() and since U is closed, the argument
+of Trim_one() would also have to belong to U. But Trim_one() is never
+called for elements of U. In other words, p belongs to S if and only
+if it belongs to candidates.
+
+ Time complexity:
+* we get no more than O(#S) calls of Trim_one()
+* the loop over children in Trim_one() never looks at the same child
+twice through all the calls.
+* iterations of that loop for children in S are no more than O(#S)
+in the worst case
+* at most two children that are not elements of S are considered per
+call of Trim_one().
+* the loop in Trim_ancestors() sets its mark once per iteration and
+no element of S has is set more than once.
+
+ In the end we may have some elements excluded from S by
+Trim_ancestors() still stuck on the list. We could do a separate
+loop removing them from the list (also no worse than O(#S) time),
+but it's easier to leave that until the next phase - there we will
+iterate through the candidates anyway.
+
+ The caller has already removed all elements of U from their parents'
+lists of children, which means that checking if child belongs to S is
+equivalent to checking if it's marked as a candidate; we'll never see
+the elements of U in the loop over children in Trim_one().
+
+ What's more, if we see that children(m) is empty and m is not
+locked, we can immediately move m into the committed subset (remove
+from the parent's list of children, etc.). That's one fewer mount we'll
+have to look into when we check the list of children of its parent *and*
+when we get to building the non-revealing subset.
+
+ Maximal non-revealing subsets
+
+If S is not a non-revealing subset, there is a locked element x in S
+such that parent of x is not in S.
+
+Obviously, no non-revealing subset of S may contain x. Removing such
+elements one by one will obviously end with the maximal non-revealing
+subset (possibly empty one). Note that removal of an element will
+require removal of all its locked children, etc.
+
+If the set had been non-shifting, it will remain non-shifting after
+such removals.
+Proof: suppose S was non-shifting, x is a locked element of S, parent of x
+is not in S and S - {x} is not non-shifting. Then there is an element m
+in S - {x} and a subtree mounted strictly inside m, such that m contains
+an element not in S - {x}. Since S is non-shifting, everything in
+that subtree must belong to S. But that means that this subtree must
+contain x somewhere *and* that parent of x either belongs that subtree
+or is equal to m. Either way it must belong to S. Contradiction.
+
+// same representation as for finding maximal non-shifting subsets:
+// S is a disjoint union of a non-revealing set U (the ones we are committed
+// to unmount) and a set of candidates, represented as a subset of list
+// elements that have "is a candidate" mark on them.
+// Elements of U are removed from their parents' lists of children.
+// In the end candidates becomes empty and maximal non-revealing non-shifting
+// subset of S is now in U
+ while (Candidates list is non-empty)
+ handle_locked(first(Candidates))
+
+handle_locked(m)
+ if m is not marked as a candidate
+ strip the "seen by Trim_ancestors" mark from m
+ remove m from the list
+ return
+ cutoff = m
+ for (p = m; p in candidates; p = parent(p)) {
+ strip the "seen by Trim_ancestors" mark from p
+ strip the "candidate" mark from p
+ remove p from the Candidates list
+ if (!locked(p))
+ cutoff = parent(p)
+ }
+ if p in U
+ cutoff = p
+ while m != cutoff
+ remove m from children(parent(m))
+ add m to U
+ m = parent(m)
+
+Let (x_0, ..., x_n = m) be the maximal chain of descent of m within S.
+* If it contains some elements of U, let x_k be the last one of those.
+Then union of U with {x_{k+1}, ..., x_n} is obviously non-revealing.
+* otherwise if all its elements are locked, then none of {x_0, ..., x_n}
+may be elements of a non-revealing subset of S.
+* otherwise let x_k be the first unlocked element of the chain. Then none
+of {x_0, ..., x_{k-1}} may be an element of a non-revealing subset of
+S and union of U and {x_k, ..., x_n} is non-revealing.
+
+handle_locked(m) finds which of these cases applies and adjusts Candidates
+and U accordingly. U remains non-revealing, union of Candidates and
+U still contains any non-revealing subset of S and after the call of
+handle_locked(m) m is guaranteed to be not in Candidates list. So having
+it called for each element of S would suffice to empty Candidates,
+leaving U the maximal non-revealing subset of S.
+
+However, handle_locked(m) is a no-op when m belongs to U, so it's enough
+to have it called for elements of Candidates list until none remain.
+
+Time complexity: number of calls of handle_locked() is limited by
+#Candidates, each iteration of the first loop in handle_locked() removes
+an element from the list, so their total number of executions is also
+limited by #Candidates; number of iterations in the second loop is no
+greater than the number of iterations of the first loop.
+
+
+ Reparenting
+
+After we'd calculated the final set, we still need to deal with
+reparenting - if an element of the final set has a child not in it,
+we need to reparent such child.
+
+Such children can only be root-overmounting (otherwise the set wouldn't
+be non-shifting) and their parents can not belong to the original set,
+since the original is guaranteed to be closed.
+
+
+ Putting all of that together
+
+The plan is to
+ * find all candidates
+ * trim down to maximal non-shifting subset
+ * trim down to maximal non-revealing subset
+ * reparent anything that needs to be reparented
+ * return the resulting set to the caller
+
+For the 2nd and 3rd steps we want to separate the set into growing
+non-revealing subset, initially containing the original set ("U" in
+terms of the pseudocode above) and everything we are still not sure about
+("candidates"). It means that for the output of the 1st step we'd like
+the extra candidates separated from the stuff already in the original set.
+For the 4th step we would like the additions to U separate from the
+original set.
+
+So let's go for
+ * original set ("set"). Linkage via mnt_list
+ * undecided candidates ("candidates"). Subset of a list,
+consisting of all its elements marked with a new flag (T_UMOUNT_CANDIDATE).
+Initially all elements of the list will be marked that way; in the
+end the list will become empty and no mounts will remain marked with
+that flag.
+ * Reuse T_MARKED for "has been already seen by trim_ancestors()".
+ * anything in U that hadn't been in the original set - elements of
+candidates will gradually be either discarded or moved there. In other
+words, it's the candidates we have already decided to unmount. Its role
+is reasonably close to the old "to_umount", so let's use that name.
+Linkage via mnt_list.
+
+For gather_candidates() we'll need to maintain both candidates (S -
+set) and intersection of S with set. Use T_UMOUNT_CANDIDATE for
+all elements we encounter, putting the ones not already in the original
+set into the list of candidates. When we are done, strip that flag from
+all elements of the original set. That gives a cheap way to check
+if element belongs to S (in gather_candidates) and to candidates
+itself (at later stages). Call that predicate is_candidate(); it would
+be m->mnt_t_flags & T_UMOUNT_CANDIDATE.
+
+All elements of the original set are marked with MNT_UMOUNT and we'll
+need the same for elements added when joining the contents of to_umount
+to set in the end. Let's set MNT_UMOUNT at the time we add an element
+to to_umount; that's close to what the old 'umount_one' is doing, so
+let's keep that name. It also gives us another predicate we need -
+"belongs to union of set and to_umount"; will_be_unmounted() for now.
+
+Removals from the candidates list should strip both T_MARKED and
+T_UMOUNT_CANDIDATE; call it remove_from_candidates_list().
diff --git a/Documentation/filesystems/qnx6.rst b/Documentation/filesystems/qnx6.rst
index 523b798f04e7..560f3d470422 100644
--- a/Documentation/filesystems/qnx6.rst
+++ b/Documentation/filesystems/qnx6.rst
@@ -135,7 +135,7 @@ inode.
Character and block special devices do not exist in QNX as those files
are handled by the QNX kernel/drivers and created in /dev independent of the
-underlaying filesystem.
+underlying filesystem.
Long filenames
--------------
diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.rst b/Documentation/filesystems/ramfs-rootfs-initramfs.rst
index 447f767c6462..a9d271e171c3 100644
--- a/Documentation/filesystems/ramfs-rootfs-initramfs.rst
+++ b/Documentation/filesystems/ramfs-rootfs-initramfs.rst
@@ -290,11 +290,11 @@ Why cpio rather than tar?
This decision was made back in December, 2001. The discussion started here:
- http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1538.html
+- https://lore.kernel.org/lkml/a03cke$640$1@cesium.transmeta.com/
And spawned a second thread (specifically on tar vs cpio), starting here:
- http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1587.html
+- https://lore.kernel.org/lkml/3C25A06D.7030408@zytor.com/
The quick and dirty summary version (which is no substitute for reading
the above threads) is:
@@ -310,12 +310,12 @@ the above threads) is:
either way about the archive format, and there are alternative tools,
such as:
- http://freecode.com/projects/afio
+ https://linux.die.net/man/1/afio
2) The cpio archive format chosen by the kernel is simpler and cleaner (and
thus easier to create and parse) than any of the (literally dozens of)
various tar archive formats. The complete initramfs archive format is
- explained in buffer-format.txt, created in usr/gen_init_cpio.c, and
+ explained in buffer-format.rst, created in usr/gen_init_cpio.c, and
extracted in init/initramfs.c. All three together come to less than 26k
total of human-readable text.
@@ -331,12 +331,12 @@ the above threads) is:
5) Al Viro made the decision (quote: "tar is ugly as hell and not going to be
supported on the kernel side"):
- http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1540.html
+ - https://lore.kernel.org/lkml/Pine.GSO.4.21.0112222109050.21702-100000@weyl.math.psu.edu/
explained his reasoning:
- - http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1550.html
- - http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1638.html
+ - https://lore.kernel.org/lkml/Pine.GSO.4.21.0112222240530.21702-100000@weyl.math.psu.edu/
+ - https://lore.kernel.org/lkml/Pine.GSO.4.21.0112230849550.23300-100000@weyl.math.psu.edu/
and, most importantly, designed and implemented the initramfs code.
diff --git a/Documentation/filesystems/relay.rst b/Documentation/filesystems/relay.rst
index 04ad083cfe62..301ff4c6e6c6 100644
--- a/Documentation/filesystems/relay.rst
+++ b/Documentation/filesystems/relay.rst
@@ -32,7 +32,7 @@ functions in the relay interface code - please see that for details.
Semantics
=========
-Each relay channel has one buffer per CPU, each buffer has one or more
+Each relay channel has one buffer per CPU; each buffer has one or more
sub-buffers. Messages are written to the first sub-buffer until it is
too full to contain a new message, in which case it is written to
the next (if available). Messages are never split across sub-buffers.
@@ -40,7 +40,7 @@ At this point, userspace can be notified so it empties the first
sub-buffer, while the kernel continues writing to the next.
When notified that a sub-buffer is full, the kernel knows how many
-bytes of it are padding i.e. unused space occurring because a complete
+bytes of it are padding, i.e., unused space occurring because a complete
message couldn't fit into a sub-buffer. Userspace can use this
knowledge to copy only valid data.
@@ -71,7 +71,7 @@ klog and relay-apps example code
================================
The relay interface itself is ready to use, but to make things easier,
-a couple simple utility functions and a set of examples are provided.
+a couple of simple utility functions and a set of examples are provided.
The relay-apps example tarball, available on the relay sourceforge
site, contains a set of self-contained examples, each consisting of a
@@ -91,7 +91,7 @@ registered will data actually be logged (see the klog and kleak
examples for details).
It is of course possible to use the relay interface from scratch,
-i.e. without using any of the relay-apps example code or klog, but
+i.e., without using any of the relay-apps example code or klog, but
you'll have to implement communication between userspace and kernel,
allowing both to convey the state of buffers (full, empty, amount of
padding). The read() interface both removes padding and internally
@@ -119,7 +119,7 @@ mmap() results in channel buffer being mapped into the caller's
must map the entire file, which is NRBUF * SUBBUFSIZE.
read() read the contents of a channel buffer. The bytes read are
- 'consumed' by the reader, i.e. they won't be available
+ 'consumed' by the reader, i.e., they won't be available
again to subsequent reads. If the channel is being used
in no-overwrite mode (the default), it can be read at any
time even if there's an active kernel writer. If the
@@ -138,7 +138,7 @@ poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are
notified when sub-buffer boundaries are crossed.
close() decrements the channel buffer's refcount. When the refcount
- reaches 0, i.e. when no process or kernel client has the
+ reaches 0, i.e., when no process or kernel client has the
buffer open, the channel buffer is freed.
=========== ============================================================
@@ -149,7 +149,7 @@ host filesystem must be mounted. For example::
.. Note::
- the host filesystem doesn't need to be mounted for kernel
+ The host filesystem doesn't need to be mounted for kernel
clients to create or use channels - it only needs to be
mounted when user space applications need access to the buffer
data.
@@ -301,16 +301,6 @@ user-defined data with a channel, and is immediately available
(including in create_buf_file()) via chan->private_data or
buf->chan->private_data.
-Buffer-only channels
---------------------
-
-These channels have no files associated and can be created with
-relay_open(NULL, NULL, ...). Such channels are useful in scenarios such
-as when doing early tracing in the kernel, before the VFS is up. In these
-cases, one may open a buffer-only channel and then call
-relay_late_setup_files() when the kernel is ready to handle files,
-to expose the buffered data to the userspace.
-
Channel 'modes'
---------------
@@ -325,7 +315,7 @@ section, as it pertains mainly to mmap() implementations.
In 'overwrite' mode, also known as 'flight recorder' mode, writes
continuously cycle around the buffer and will never fail, but will
unconditionally overwrite old data regardless of whether it's actually
-been consumed. In no-overwrite mode, writes will fail, i.e. data will
+been consumed. In no-overwrite mode, writes will fail, i.e., data will
be lost, if the number of unconsumed sub-buffers equals the total
number of sub-buffers in the channel. It should be clear that if
there is no consumer or if the consumer can't consume sub-buffers fast
@@ -344,7 +334,7 @@ initialize the next sub-buffer if appropriate 2) finalize the previous
sub-buffer if appropriate and 3) return a boolean value indicating
whether or not to actually move on to the next sub-buffer.
-To implement 'no-overwrite' mode, the userspace client would provide
+To implement 'no-overwrite' mode, the userspace client provides
an implementation of the subbuf_start() callback something like the
following::
@@ -364,9 +354,9 @@ following::
return 1;
}
-If the current buffer is full, i.e. all sub-buffers remain unconsumed,
+If the current buffer is full, i.e., all sub-buffers remain unconsumed,
the callback returns 0 to indicate that the buffer switch should not
-occur yet, i.e. until the consumer has had a chance to read the
+occur yet, i.e., until the consumer has had a chance to read the
current set of ready sub-buffers. For the relay_buf_full() function
to make sense, the consumer is responsible for notifying the relay
interface when sub-buffers have been consumed via
@@ -400,7 +390,7 @@ consulted.
The default subbuf_start() implementation, used if the client doesn't
define any callbacks, or doesn't define the subbuf_start() callback,
-implements the simplest possible 'no-overwrite' mode, i.e. it does
+implements the simplest possible 'no-overwrite' mode, i.e., it does
nothing but return 0.
Header information can be reserved at the beginning of each sub-buffer
@@ -467,7 +457,7 @@ rather than open and close a new channel for each use. relay_reset()
can be used for this purpose - it resets a channel to its initial
state without reallocating channel buffer memory or destroying
existing mappings. It should however only be called when it's safe to
-do so, i.e. when the channel isn't currently being written to.
+do so, i.e., when the channel isn't currently being written to.
Finally, there are a couple of utility callbacks that can be used for
different purposes. buf_mapped() is called whenever a channel buffer
diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
new file mode 100644
index 000000000000..8c8ce678148a
--- /dev/null
+++ b/Documentation/filesystems/resctrl.rst
@@ -0,0 +1,1932 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+=====================================================
+User Interface for Resource Control feature (resctrl)
+=====================================================
+
+:Copyright: |copy| 2016 Intel Corporation
+:Authors: - Fenghua Yu <fenghua.yu@intel.com>
+ - Tony Luck <tony.luck@intel.com>
+ - Vikas Shivappa <vikas.shivappa@intel.com>
+
+
+Intel refers to this feature as Intel Resource Director Technology(Intel(R) RDT).
+AMD refers to this feature as AMD Platform Quality of Service(AMD QoS).
+
+This feature is enabled by the CONFIG_X86_CPU_RESCTRL and the x86 /proc/cpuinfo
+flag bits:
+
+=============================================================== ================================
+RDT (Resource Director Technology) Allocation "rdt_a"
+CAT (Cache Allocation Technology) "cat_l3", "cat_l2"
+CDP (Code and Data Prioritization) "cdp_l3", "cdp_l2"
+CQM (Cache QoS Monitoring) "cqm_llc", "cqm_occup_llc"
+MBM (Memory Bandwidth Monitoring) "cqm_mbm_total", "cqm_mbm_local"
+MBA (Memory Bandwidth Allocation) "mba"
+SMBA (Slow Memory Bandwidth Allocation) ""
+BMEC (Bandwidth Monitoring Event Configuration) ""
+ABMC (Assignable Bandwidth Monitoring Counters) ""
+SDCIAE (Smart Data Cache Injection Allocation Enforcement) ""
+=============================================================== ================================
+
+Historically, new features were made visible by default in /proc/cpuinfo. This
+resulted in the feature flags becoming hard to parse by humans. Adding a new
+flag to /proc/cpuinfo should be avoided if user space can obtain information
+about the feature from resctrl's info directory.
+
+To use the feature mount the file system::
+
+ # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps][,debug]] /sys/fs/resctrl
+
+mount options are:
+
+"cdp":
+ Enable code/data prioritization in L3 cache allocations.
+"cdpl2":
+ Enable code/data prioritization in L2 cache allocations.
+"mba_MBps":
+ Enable the MBA Software Controller(mba_sc) to specify MBA
+ bandwidth in MiBps
+"debug":
+ Make debug files accessible. Available debug files are annotated with
+ "Available only with debug option".
+
+L2 and L3 CDP are controlled separately.
+
+RDT features are orthogonal. A particular system may support only
+monitoring, only control, or both monitoring and control. Cache
+pseudo-locking is a unique way of using cache control to "pin" or
+"lock" data in the cache. Details can be found in
+"Cache Pseudo-Locking".
+
+
+The mount succeeds if either of allocation or monitoring is present, but
+only those files and directories supported by the system will be created.
+For more details on the behavior of the interface during monitoring
+and allocation, see the "Resource alloc and monitor groups" section.
+
+Info directory
+==============
+
+The 'info' directory contains information about the enabled
+resources. Each resource has its own subdirectory. The subdirectory
+names reflect the resource names.
+
+Most of the files in the resource's subdirectory are read-only, and
+describe properties of the resource. Resources that support global
+configuration options also include writable files that can be used
+to modify those settings.
+
+Each subdirectory contains the following files with respect to
+allocation:
+
+Cache resource(L3/L2) subdirectory contains the following files
+related to allocation:
+
+"num_closids":
+ The number of CLOSIDs which are valid for this
+ resource. The kernel uses the smallest number of
+ CLOSIDs of all enabled resources as limit.
+"cbm_mask":
+ The bitmask which is valid for this resource.
+ This mask is equivalent to 100%.
+"min_cbm_bits":
+ The minimum number of consecutive bits which
+ must be set when writing a mask.
+
+"shareable_bits":
+ Bitmask of shareable resource with other executing entities
+ (e.g. I/O). Applies to all instances of this resource. User
+ can use this when setting up exclusive cache partitions.
+ Note that some platforms support devices that have their
+ own settings for cache use which can over-ride these bits.
+
+ When "io_alloc" is enabled, a portion of each cache instance can
+ be configured for shared use between hardware and software.
+ "bit_usage" should be used to see which portions of each cache
+ instance is configured for hardware use via "io_alloc" feature
+ because every cache instance can have its "io_alloc" bitmask
+ configured independently via "io_alloc_cbm".
+
+"bit_usage":
+ Annotated capacity bitmasks showing how all
+ instances of the resource are used. The legend is:
+
+ "0":
+ Corresponding region is unused. When the system's
+ resources have been allocated and a "0" is found
+ in "bit_usage" it is a sign that resources are
+ wasted.
+
+ "H":
+ Corresponding region is used by hardware only
+ but available for software use. If a resource
+ has bits set in "shareable_bits" or "io_alloc_cbm"
+ but not all of these bits appear in the resource
+ groups' schemata then the bits appearing in
+ "shareable_bits" or "io_alloc_cbm" but no
+ resource group will be marked as "H".
+ "X":
+ Corresponding region is available for sharing and
+ used by hardware and software. These are the bits
+ that appear in "shareable_bits" or "io_alloc_cbm"
+ as well as a resource group's allocation.
+ "S":
+ Corresponding region is used by software
+ and available for sharing.
+ "E":
+ Corresponding region is used exclusively by
+ one resource group. No sharing allowed.
+ "P":
+ Corresponding region is pseudo-locked. No
+ sharing allowed.
+"sparse_masks":
+ Indicates if non-contiguous 1s value in CBM is supported.
+
+ "0":
+ Only contiguous 1s value in CBM is supported.
+ "1":
+ Non-contiguous 1s value in CBM is supported.
+
+"io_alloc":
+ "io_alloc" enables system software to configure the portion of
+ the cache allocated for I/O traffic. File may only exist if the
+ system supports this feature on some of its cache resources.
+
+ "disabled":
+ Resource supports "io_alloc" but the feature is disabled.
+ Portions of cache used for allocation of I/O traffic cannot
+ be configured.
+ "enabled":
+ Portions of cache used for allocation of I/O traffic
+ can be configured using "io_alloc_cbm".
+ "not supported":
+ Support not available for this resource.
+
+ The feature can be modified by writing to the interface, for example:
+
+ To enable::
+
+ # echo 1 > /sys/fs/resctrl/info/L3/io_alloc
+
+ To disable::
+
+ # echo 0 > /sys/fs/resctrl/info/L3/io_alloc
+
+ The underlying implementation may reduce resources available to
+ general (CPU) cache allocation. See architecture specific notes
+ below. Depending on usage requirements the feature can be enabled
+ or disabled.
+
+ On AMD systems, io_alloc feature is supported by the L3 Smart
+ Data Cache Injection Allocation Enforcement (SDCIAE). The CLOSID for
+ io_alloc is the highest CLOSID supported by the resource. When
+ io_alloc is enabled, the highest CLOSID is dedicated to io_alloc and
+ no longer available for general (CPU) cache allocation. When CDP is
+ enabled, io_alloc routes I/O traffic using the highest CLOSID allocated
+ for the instruction cache (CDP_CODE), making this CLOSID no longer
+ available for general (CPU) cache allocation for both the CDP_CODE
+ and CDP_DATA resources.
+
+"io_alloc_cbm":
+ Capacity bitmasks that describe the portions of cache instances to
+ which I/O traffic from supported I/O devices are routed when "io_alloc"
+ is enabled.
+
+ CBMs are displayed in the following format:
+
+ <cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+
+ Example::
+
+ # cat /sys/fs/resctrl/info/L3/io_alloc_cbm
+ 0=ffff;1=ffff
+
+ CBMs can be configured by writing to the interface.
+
+ Example::
+
+ # echo 1=ff > /sys/fs/resctrl/info/L3/io_alloc_cbm
+ # cat /sys/fs/resctrl/info/L3/io_alloc_cbm
+ 0=ffff;1=00ff
+
+ # echo "0=ff;1=f" > /sys/fs/resctrl/info/L3/io_alloc_cbm
+ # cat /sys/fs/resctrl/info/L3/io_alloc_cbm
+ 0=00ff;1=000f
+
+ When CDP is enabled "io_alloc_cbm" associated with the CDP_DATA and CDP_CODE
+ resources may reflect the same values. For example, values read from and
+ written to /sys/fs/resctrl/info/L3DATA/io_alloc_cbm may be reflected by
+ /sys/fs/resctrl/info/L3CODE/io_alloc_cbm and vice versa.
+
+Memory bandwidth(MB) subdirectory contains the following files
+with respect to allocation:
+
+"min_bandwidth":
+ The minimum memory bandwidth percentage which
+ user can request.
+
+"bandwidth_gran":
+ The granularity in which the memory bandwidth
+ percentage is allocated. The allocated
+ b/w percentage is rounded off to the next
+ control step available on the hardware. The
+ available bandwidth control steps are:
+ min_bandwidth + N * bandwidth_gran.
+
+"delay_linear":
+ Indicates if the delay scale is linear or
+ non-linear. This field is purely informational
+ only.
+
+"thread_throttle_mode":
+ Indicator on Intel systems of how tasks running on threads
+ of a physical core are throttled in cases where they
+ request different memory bandwidth percentages:
+
+ "max":
+ the smallest percentage is applied
+ to all threads
+ "per-thread":
+ bandwidth percentages are directly applied to
+ the threads running on the core
+
+If RDT monitoring is available there will be an "L3_MON" directory
+with the following files:
+
+"num_rmids":
+ The number of RMIDs available. This is the
+ upper bound for how many "CTRL_MON" + "MON"
+ groups can be created.
+
+"mon_features":
+ Lists the monitoring events if
+ monitoring is enabled for the resource.
+ Example::
+
+ # cat /sys/fs/resctrl/info/L3_MON/mon_features
+ llc_occupancy
+ mbm_total_bytes
+ mbm_local_bytes
+
+ If the system supports Bandwidth Monitoring Event
+ Configuration (BMEC), then the bandwidth events will
+ be configurable. The output will be::
+
+ # cat /sys/fs/resctrl/info/L3_MON/mon_features
+ llc_occupancy
+ mbm_total_bytes
+ mbm_total_bytes_config
+ mbm_local_bytes
+ mbm_local_bytes_config
+
+"mbm_total_bytes_config", "mbm_local_bytes_config":
+ Read/write files containing the configuration for the mbm_total_bytes
+ and mbm_local_bytes events, respectively, when the Bandwidth
+ Monitoring Event Configuration (BMEC) feature is supported.
+ The event configuration settings are domain specific and affect
+ all the CPUs in the domain. When either event configuration is
+ changed, the bandwidth counters for all RMIDs of both events
+ (mbm_total_bytes as well as mbm_local_bytes) are cleared for that
+ domain. The next read for every RMID will report "Unavailable"
+ and subsequent reads will report the valid value.
+
+ Following are the types of events supported:
+
+ ==== ========================================================
+ Bits Description
+ ==== ========================================================
+ 6 Dirty Victims from the QOS domain to all types of memory
+ 5 Reads to slow memory in the non-local NUMA domain
+ 4 Reads to slow memory in the local NUMA domain
+ 3 Non-temporal writes to non-local NUMA domain
+ 2 Non-temporal writes to local NUMA domain
+ 1 Reads to memory in the non-local NUMA domain
+ 0 Reads to memory in the local NUMA domain
+ ==== ========================================================
+
+ By default, the mbm_total_bytes configuration is set to 0x7f to count
+ all the event types and the mbm_local_bytes configuration is set to
+ 0x15 to count all the local memory events.
+
+ Examples:
+
+ * To view the current configuration::
+ ::
+
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
+ 0=0x7f;1=0x7f;2=0x7f;3=0x7f
+
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
+ 0=0x15;1=0x15;3=0x15;4=0x15
+
+ * To change the mbm_total_bytes to count only reads on domain 0,
+ the bits 0, 1, 4 and 5 needs to be set, which is 110011b in binary
+ (in hexadecimal 0x33):
+ ::
+
+ # echo "0=0x33" > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
+
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
+ 0=0x33;1=0x7f;2=0x7f;3=0x7f
+
+ * To change the mbm_local_bytes to count all the slow memory reads on
+ domain 0 and 1, the bits 4 and 5 needs to be set, which is 110000b
+ in binary (in hexadecimal 0x30):
+ ::
+
+ # echo "0=0x30;1=0x30" > /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
+
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
+ 0=0x30;1=0x30;3=0x15;4=0x15
+
+"mbm_assign_mode":
+ The supported counter assignment modes. The enclosed brackets indicate which mode
+ is enabled. The MBM events associated with counters may reset when "mbm_assign_mode"
+ is changed.
+ ::
+
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
+ [mbm_event]
+ default
+
+ "mbm_event":
+
+ mbm_event mode allows users to assign a hardware counter to an RMID, event
+ pair and monitor the bandwidth usage as long as it is assigned. The hardware
+ continues to track the assigned counter until it is explicitly unassigned by
+ the user. Each event within a resctrl group can be assigned independently.
+
+ In this mode, a monitoring event can only accumulate data while it is backed
+ by a hardware counter. Use "mbm_L3_assignments" found in each CTRL_MON and MON
+ group to specify which of the events should have a counter assigned. The number
+ of counters available is described in the "num_mbm_cntrs" file. Changing the
+ mode may cause all counters on the resource to reset.
+
+ Moving to mbm_event counter assignment mode requires users to assign the counters
+ to the events. Otherwise, the MBM event counters will return 'Unassigned' when read.
+
+ The mode is beneficial for AMD platforms that support more CTRL_MON
+ and MON groups than available hardware counters. By default, this
+ feature is enabled on AMD platforms with the ABMC (Assignable Bandwidth
+ Monitoring Counters) capability, ensuring counters remain assigned even
+ when the corresponding RMID is not actively used by any processor.
+
+ "default":
+
+ In default mode, resctrl assumes there is a hardware counter for each
+ event within every CTRL_MON and MON group. On AMD platforms, it is
+ recommended to use the mbm_event mode, if supported, to prevent reset of MBM
+ events between reads resulting from hardware re-allocating counters. This can
+ result in misleading values or display "Unavailable" if no counter is assigned
+ to the event.
+
+ * To enable "mbm_event" counter assignment mode:
+ ::
+
+ # echo "mbm_event" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
+
+ * To enable "default" monitoring mode:
+ ::
+
+ # echo "default" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
+
+"num_mbm_cntrs":
+ The maximum number of counters (total of available and assigned counters) in
+ each domain when the system supports mbm_event mode.
+
+ For example, on a system with maximum of 32 memory bandwidth monitoring
+ counters in each of its L3 domains:
+ ::
+
+ # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
+ 0=32;1=32
+
+"available_mbm_cntrs":
+ The number of counters available for assignment in each domain when mbm_event
+ mode is enabled on the system.
+
+ For example, on a system with 30 available [hardware] assignable counters
+ in each of its L3 domains:
+ ::
+
+ # cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs
+ 0=30;1=30
+
+"event_configs":
+ Directory that exists when "mbm_event" counter assignment mode is supported.
+ Contains a sub-directory for each MBM event that can be assigned to a counter.
+
+ Two MBM events are supported by default: mbm_local_bytes and mbm_total_bytes.
+ Each MBM event's sub-directory contains a file named "event_filter" that is
+ used to view and modify which memory transactions the MBM event is configured
+ with. The file is accessible only when "mbm_event" counter assignment mode is
+ enabled.
+
+ List of memory transaction types supported:
+
+ ========================== ========================================================
+ Name Description
+ ========================== ========================================================
+ dirty_victim_writes_all Dirty Victims from the QOS domain to all types of memory
+ remote_reads_slow_memory Reads to slow memory in the non-local NUMA domain
+ local_reads_slow_memory Reads to slow memory in the local NUMA domain
+ remote_non_temporal_writes Non-temporal writes to non-local NUMA domain
+ local_non_temporal_writes Non-temporal writes to local NUMA domain
+ remote_reads Reads to memory in the non-local NUMA domain
+ local_reads Reads to memory in the local NUMA domain
+ ========================== ========================================================
+
+ For example::
+
+ # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_total_bytes/event_filter
+ local_reads,remote_reads,local_non_temporal_writes,remote_non_temporal_writes,
+ local_reads_slow_memory,remote_reads_slow_memory,dirty_victim_writes_all
+
+ # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_local_bytes/event_filter
+ local_reads,local_non_temporal_writes,local_reads_slow_memory
+
+ Modify the event configuration by writing to the "event_filter" file within
+ the "event_configs" directory. The read/write "event_filter" file contains the
+ configuration of the event that reflects which memory transactions are counted by it.
+
+ For example::
+
+ # echo "local_reads, local_non_temporal_writes" >
+ /sys/fs/resctrl/info/L3_MON/event_configs/mbm_total_bytes/event_filter
+
+ # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_total_bytes/event_filter
+ local_reads,local_non_temporal_writes
+
+"mbm_assign_on_mkdir":
+ Exists when "mbm_event" counter assignment mode is supported. Accessible
+ only when "mbm_event" counter assignment mode is enabled.
+
+ Determines if a counter will automatically be assigned to an RMID, MBM event
+ pair when its associated monitor group is created via mkdir. Enabled by default
+ on boot, also when switched from "default" mode to "mbm_event" counter assignment
+ mode. Users can disable this capability by writing to the interface.
+
+ "0":
+ Auto assignment is disabled.
+ "1":
+ Auto assignment is enabled.
+
+ Example::
+
+ # echo 0 > /sys/fs/resctrl/info/L3_MON/mbm_assign_on_mkdir
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_on_mkdir
+ 0
+
+"max_threshold_occupancy":
+ Read/write file provides the largest value (in
+ bytes) at which a previously used LLC_occupancy
+ counter can be considered for re-use.
+
+Finally, in the top level of the "info" directory there is a file
+named "last_cmd_status". This is reset with every "command" issued
+via the file system (making new directories or writing to any of the
+control files). If the command was successful, it will read as "ok".
+If the command failed, it will provide more information that can be
+conveyed in the error returns from file operations. E.g.
+::
+
+ # echo L3:0=f7 > schemata
+ bash: echo: write error: Invalid argument
+ # cat info/last_cmd_status
+ mask f7 has non-consecutive 1-bits
+
+Resource alloc and monitor groups
+=================================
+
+Resource groups are represented as directories in the resctrl file
+system. The default group is the root directory which, immediately
+after mounting, owns all the tasks and cpus in the system and can make
+full use of all resources.
+
+On a system with RDT control features additional directories can be
+created in the root directory that specify different amounts of each
+resource (see "schemata" below). The root and these additional top level
+directories are referred to as "CTRL_MON" groups below.
+
+On a system with RDT monitoring the root directory and other top level
+directories contain a directory named "mon_groups" in which additional
+directories can be created to monitor subsets of tasks in the CTRL_MON
+group that is their ancestor. These are called "MON" groups in the rest
+of this document.
+
+Removing a directory will move all tasks and cpus owned by the group it
+represents to the parent. Removing one of the created CTRL_MON groups
+will automatically remove all MON groups below it.
+
+Moving MON group directories to a new parent CTRL_MON group is supported
+for the purpose of changing the resource allocations of a MON group
+without impacting its monitoring data or assigned tasks. This operation
+is not allowed for MON groups which monitor CPUs. No other move
+operation is currently allowed other than simply renaming a CTRL_MON or
+MON group.
+
+All groups contain the following files:
+
+"tasks":
+ Reading this file shows the list of all tasks that belong to
+ this group. Writing a task id to the file will add a task to the
+ group. Multiple tasks can be added by separating the task ids
+ with commas. Tasks will be assigned sequentially. Multiple
+ failures are not supported. A single failure encountered while
+ attempting to assign a task will cause the operation to abort and
+ already added tasks before the failure will remain in the group.
+ Failures will be logged to /sys/fs/resctrl/info/last_cmd_status.
+
+ If the group is a CTRL_MON group the task is removed from
+ whichever previous CTRL_MON group owned the task and also from
+ any MON group that owned the task. If the group is a MON group,
+ then the task must already belong to the CTRL_MON parent of this
+ group. The task is removed from any previous MON group.
+
+
+"cpus":
+ Reading this file shows a bitmask of the logical CPUs owned by
+ this group. Writing a mask to this file will add and remove
+ CPUs to/from this group. As with the tasks file a hierarchy is
+ maintained where MON groups may only include CPUs owned by the
+ parent CTRL_MON group.
+ When the resource group is in pseudo-locked mode this file will
+ only be readable, reflecting the CPUs associated with the
+ pseudo-locked region.
+
+
+"cpus_list":
+ Just like "cpus", only using ranges of CPUs instead of bitmasks.
+
+
+When control is enabled all CTRL_MON groups will also contain:
+
+"schemata":
+ A list of all the resources available to this group.
+ Each resource has its own line and format - see below for details.
+
+"size":
+ Mirrors the display of the "schemata" file to display the size in
+ bytes of each allocation instead of the bits representing the
+ allocation.
+
+"mode":
+ The "mode" of the resource group dictates the sharing of its
+ allocations. A "shareable" resource group allows sharing of its
+ allocations while an "exclusive" resource group does not. A
+ cache pseudo-locked region is created by first writing
+ "pseudo-locksetup" to the "mode" file before writing the cache
+ pseudo-locked region's schemata to the resource group's "schemata"
+ file. On successful pseudo-locked region creation the mode will
+ automatically change to "pseudo-locked".
+
+"ctrl_hw_id":
+ Available only with debug option. The identifier used by hardware
+ for the control group. On x86 this is the CLOSID.
+
+When monitoring is enabled all MON groups will also contain:
+
+"mon_data":
+ This contains a set of files organized by L3 domain and by
+ RDT event. E.g. on a system with two L3 domains there will
+ be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
+ directories have one file per event (e.g. "llc_occupancy",
+ "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
+ files provide a read out of the current value of the event for
+ all tasks in the group. In CTRL_MON groups these files provide
+ the sum for all tasks in the CTRL_MON group and all tasks in
+ MON groups. Please see example section for more details on usage.
+ On systems with Sub-NUMA Cluster (SNC) enabled there are extra
+ directories for each node (located within the "mon_L3_XX" directory
+ for the L3 cache they occupy). These are named "mon_sub_L3_YY"
+ where "YY" is the node number.
+
+ When the 'mbm_event' counter assignment mode is enabled, reading
+ an MBM event of a MON group returns 'Unassigned' if no hardware
+ counter is assigned to it. For CTRL_MON groups, 'Unassigned' is
+ returned if the MBM event does not have an assigned counter in the
+ CTRL_MON group nor in any of its associated MON groups.
+
+"mon_hw_id":
+ Available only with debug option. The identifier used by hardware
+ for the monitor group. On x86 this is the RMID.
+
+When monitoring is enabled all MON groups may also contain:
+
+"mbm_L3_assignments":
+ Exists when "mbm_event" counter assignment mode is supported and lists the
+ counter assignment states of the group.
+
+ The assignment list is displayed in the following format:
+
+ <Event>:<Domain ID>=<Assignment state>;<Domain ID>=<Assignment state>
+
+ Event: A valid MBM event in the
+ /sys/fs/resctrl/info/L3_MON/event_configs directory.
+
+ Domain ID: A valid domain ID. When writing, '*' applies the changes
+ to all the domains.
+
+ Assignment states:
+
+ _ : No counter assigned.
+
+ e : Counter assigned exclusively.
+
+ Example:
+
+ To display the counter assignment states for the default group.
+ ::
+
+ # cd /sys/fs/resctrl
+ # cat /sys/fs/resctrl/mbm_L3_assignments
+ mbm_total_bytes:0=e;1=e
+ mbm_local_bytes:0=e;1=e
+
+ Assignments can be modified by writing to the interface.
+
+ Examples:
+
+ To unassign the counter associated with the mbm_total_bytes event on domain 0:
+ ::
+
+ # echo "mbm_total_bytes:0=_" > /sys/fs/resctrl/mbm_L3_assignments
+ # cat /sys/fs/resctrl/mbm_L3_assignments
+ mbm_total_bytes:0=_;1=e
+ mbm_local_bytes:0=e;1=e
+
+ To unassign the counter associated with the mbm_total_bytes event on all the domains:
+ ::
+
+ # echo "mbm_total_bytes:*=_" > /sys/fs/resctrl/mbm_L3_assignments
+ # cat /sys/fs/resctrl/mbm_L3_assignments
+ mbm_total_bytes:0=_;1=_
+ mbm_local_bytes:0=e;1=e
+
+ To assign a counter associated with the mbm_total_bytes event on all domains in
+ exclusive mode:
+ ::
+
+ # echo "mbm_total_bytes:*=e" > /sys/fs/resctrl/mbm_L3_assignments
+ # cat /sys/fs/resctrl/mbm_L3_assignments
+ mbm_total_bytes:0=e;1=e
+ mbm_local_bytes:0=e;1=e
+
+When the "mba_MBps" mount option is used all CTRL_MON groups will also contain:
+
+"mba_MBps_event":
+ Reading this file shows which memory bandwidth event is used
+ as input to the software feedback loop that keeps memory bandwidth
+ below the value specified in the schemata file. Writing the
+ name of one of the supported memory bandwidth events found in
+ /sys/fs/resctrl/info/L3_MON/mon_features changes the input
+ event.
+
+Resource allocation rules
+-------------------------
+
+When a task is running the following rules define which resources are
+available to it:
+
+1) If the task is a member of a non-default group, then the schemata
+ for that group is used.
+
+2) Else if the task belongs to the default group, but is running on a
+ CPU that is assigned to some specific group, then the schemata for the
+ CPU's group is used.
+
+3) Otherwise the schemata for the default group is used.
+
+Resource monitoring rules
+-------------------------
+1) If a task is a member of a MON group, or non-default CTRL_MON group
+ then RDT events for the task will be reported in that group.
+
+2) If a task is a member of the default CTRL_MON group, but is running
+ on a CPU that is assigned to some specific group, then the RDT events
+ for the task will be reported in that group.
+
+3) Otherwise RDT events for the task will be reported in the root level
+ "mon_data" group.
+
+
+Notes on cache occupancy monitoring and control
+===============================================
+When moving a task from one group to another you should remember that
+this only affects *new* cache allocations by the task. E.g. you may have
+a task in a monitor group showing 3 MB of cache occupancy. If you move
+to a new group and immediately check the occupancy of the old and new
+groups you will likely see that the old group is still showing 3 MB and
+the new group zero. When the task accesses locations still in cache from
+before the move, the h/w does not update any counters. On a busy system
+you will likely see the occupancy in the old group go down as cache lines
+are evicted and re-used while the occupancy in the new group rises as
+the task accesses memory and loads into the cache are counted based on
+membership in the new group.
+
+The same applies to cache allocation control. Moving a task to a group
+with a smaller cache partition will not evict any cache lines. The
+process may continue to use them from the old partition.
+
+Hardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID)
+to identify a control group and a monitoring group respectively. Each of
+the resource groups are mapped to these IDs based on the kind of group. The
+number of CLOSid and RMID are limited by the hardware and hence the creation of
+a "CTRL_MON" directory may fail if we run out of either CLOSID or RMID
+and creation of "MON" group may fail if we run out of RMIDs.
+
+max_threshold_occupancy - generic concepts
+------------------------------------------
+
+Note that an RMID once freed may not be immediately available for use as
+the RMID is still tagged the cache lines of the previous user of RMID.
+Hence such RMIDs are placed on limbo list and checked back if the cache
+occupancy has gone down. If there is a time when system has a lot of
+limbo RMIDs but which are not ready to be used, user may see an -EBUSY
+during mkdir.
+
+max_threshold_occupancy is a user configurable value to determine the
+occupancy at which an RMID can be freed.
+
+The mon_llc_occupancy_limbo tracepoint gives the precise occupancy in bytes
+for a subset of RMID that are not immediately available for allocation.
+This can't be relied on to produce output every second, it may be necessary
+to attempt to create an empty monitor group to force an update. Output may
+only be produced if creation of a control or monitor group fails.
+
+Schemata files - general concepts
+---------------------------------
+Each line in the file describes one resource. The line starts with
+the name of the resource, followed by specific values to be applied
+in each of the instances of that resource on the system.
+
+Cache IDs
+---------
+On current generation systems there is one L3 cache per socket and L2
+caches are generally just shared by the hyperthreads on a core, but this
+isn't an architectural requirement. We could have multiple separate L3
+caches on a socket, multiple cores could share an L2 cache. So instead
+of using "socket" or "core" to define the set of logical cpus sharing
+a resource we use a "Cache ID". At a given cache level this will be a
+unique number across the whole system (but it isn't guaranteed to be a
+contiguous sequence, there may be gaps). To find the ID for each logical
+CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id
+
+Cache Bit Masks (CBM)
+---------------------
+For cache resources we describe the portion of the cache that is available
+for allocation using a bitmask. The maximum value of the mask is defined
+by each cpu model (and may be different for different cache levels). It
+is found using CPUID, but is also provided in the "info" directory of
+the resctrl file system in "info/{resource}/cbm_mask". Some Intel hardware
+requires that these masks have all the '1' bits in a contiguous block. So
+0x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9
+and 0xA are not. Check /sys/fs/resctrl/info/{resource}/sparse_masks
+if non-contiguous 1s value is supported. On a system with a 20-bit mask
+each bit represents 5% of the capacity of the cache. You could partition
+the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
+
+Notes on Sub-NUMA Cluster mode
+==============================
+When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA
+nodes much more readily than between regular NUMA nodes since the CPUs
+on Sub-NUMA nodes share the same L3 cache and the system may report
+the NUMA distance between Sub-NUMA nodes with a lower value than used
+for regular NUMA nodes.
+
+The top-level monitoring files in each "mon_L3_XX" directory provide
+the sum of data across all SNC nodes sharing an L3 cache instance.
+Users who bind tasks to the CPUs of a specific Sub-NUMA node can read
+the "llc_occupancy", "mbm_total_bytes", and "mbm_local_bytes" in the
+"mon_sub_L3_YY" directories to get node local data.
+
+Memory bandwidth allocation is still performed at the L3 cache
+level. I.e. throttling controls are applied to all SNC nodes.
+
+L3 cache allocation bitmaps also apply to all SNC nodes. But note that
+the amount of L3 cache represented by each bit is divided by the number
+of SNC nodes per L3 cache. E.g. with a 100MB cache on a system with 10-bit
+allocation masks each bit normally represents 10MB. With SNC mode enabled
+with two SNC nodes per L3 cache, each bit only represents 5MB.
+
+Memory bandwidth Allocation and monitoring
+==========================================
+
+For Memory bandwidth resource, by default the user controls the resource
+by indicating the percentage of total memory bandwidth.
+
+The minimum bandwidth percentage value for each cpu model is predefined
+and can be looked up through "info/MB/min_bandwidth". The bandwidth
+granularity that is allocated is also dependent on the cpu model and can
+be looked up at "info/MB/bandwidth_gran". The available bandwidth
+control steps are: min_bw + N * bw_gran. Intermediate values are rounded
+to the next control step available on the hardware.
+
+The bandwidth throttling is a core specific mechanism on some of Intel
+SKUs. Using a high bandwidth and a low bandwidth setting on two threads
+sharing a core may result in both threads being throttled to use the
+low bandwidth (see "thread_throttle_mode").
+
+The fact that Memory bandwidth allocation(MBA) may be a core
+specific mechanism where as memory bandwidth monitoring(MBM) is done at
+the package level may lead to confusion when users try to apply control
+via the MBA and then monitor the bandwidth to see if the controls are
+effective. Below are such scenarios:
+
+1. User may *not* see increase in actual bandwidth when percentage
+ values are increased:
+
+This can occur when aggregate L2 external bandwidth is more than L3
+external bandwidth. Consider an SKL SKU with 24 cores on a package and
+where L2 external is 10GBps (hence aggregate L2 external bandwidth is
+240GBps) and L3 external bandwidth is 100GBps. Now a workload with '20
+threads, having 50% bandwidth, each consuming 5GBps' consumes the max L3
+bandwidth of 100GBps although the percentage value specified is only 50%
+<< 100%. Hence increasing the bandwidth percentage will not yield any
+more bandwidth. This is because although the L2 external bandwidth still
+has capacity, the L3 external bandwidth is fully used. Also note that
+this would be dependent on number of cores the benchmark is run on.
+
+2. Same bandwidth percentage may mean different actual bandwidth
+ depending on # of threads:
+
+For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4
+thread, with 10% bandwidth' can consume up to 10GBps and 40GBps although
+they have same percentage bandwidth of 10%. This is simply because as
+threads start using more cores in an rdtgroup, the actual bandwidth may
+increase or vary although user specified bandwidth percentage is same.
+
+In order to mitigate this and make the interface more user friendly,
+resctrl added support for specifying the bandwidth in MiBps as well. The
+kernel underneath would use a software feedback mechanism or a "Software
+Controller(mba_sc)" which reads the actual bandwidth using MBM counters
+and adjust the memory bandwidth percentages to ensure::
+
+ "actual bandwidth < user specified bandwidth".
+
+By default, the schemata would take the bandwidth percentage values
+where as user can switch to the "MBA software controller" mode using
+a mount option 'mba_MBps'. The schemata format is specified in the below
+sections.
+
+L3 schemata file details (code and data prioritization disabled)
+----------------------------------------------------------------
+With CDP disabled the L3 schemata format is::
+
+ L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+
+L3 schemata file details (CDP enabled via mount option to resctrl)
+------------------------------------------------------------------
+When CDP is enabled L3 control is split into two separate resources
+so you can specify independent masks for code and data like this::
+
+ L3DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+ L3CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+
+L2 schemata file details
+------------------------
+CDP is supported at L2 using the 'cdpl2' mount option. The schemata
+format is either::
+
+ L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+
+or
+
+ L2DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+ L2CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+
+
+Memory bandwidth Allocation (default mode)
+------------------------------------------
+
+Memory b/w domain is L3 cache.
+::
+
+ MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
+
+Memory bandwidth Allocation specified in MiBps
+----------------------------------------------
+
+Memory bandwidth domain is L3 cache.
+::
+
+ MB:<cache_id0>=bw_MiBps0;<cache_id1>=bw_MiBps1;...
+
+Slow Memory Bandwidth Allocation (SMBA)
+---------------------------------------
+AMD hardware supports Slow Memory Bandwidth Allocation (SMBA).
+CXL.memory is the only supported "slow" memory device. With the
+support of SMBA, the hardware enables bandwidth allocation on
+the slow memory devices. If there are multiple such devices in
+the system, the throttling logic groups all the slow sources
+together and applies the limit on them as a whole.
+
+The presence of SMBA (with CXL.memory) is independent of slow memory
+devices presence. If there are no such devices on the system, then
+configuring SMBA will have no impact on the performance of the system.
+
+The bandwidth domain for slow memory is L3 cache. Its schemata file
+is formatted as:
+::
+
+ SMBA:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
+
+Reading/writing the schemata file
+---------------------------------
+Reading the schemata file will show the state of all resources
+on all domains. When writing you only need to specify those values
+which you wish to change. E.g.
+::
+
+ # cat schemata
+ L3DATA:0=fffff;1=fffff;2=fffff;3=fffff
+ L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
+ # echo "L3DATA:2=3c0;" > schemata
+ # cat schemata
+ L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
+ L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
+
+Reading/writing the schemata file (on AMD systems)
+--------------------------------------------------
+Reading the schemata file will show the current bandwidth limit on all
+domains. The allocated resources are in multiples of one eighth GB/s.
+When writing to the file, you need to specify what cache id you wish to
+configure the bandwidth limit.
+
+For example, to allocate 2GB/s limit on the first cache id:
+
+::
+
+ # cat schemata
+ MB:0=2048;1=2048;2=2048;3=2048
+ L3:0=ffff;1=ffff;2=ffff;3=ffff
+
+ # echo "MB:1=16" > schemata
+ # cat schemata
+ MB:0=2048;1= 16;2=2048;3=2048
+ L3:0=ffff;1=ffff;2=ffff;3=ffff
+
+Reading/writing the schemata file (on AMD systems) with SMBA feature
+--------------------------------------------------------------------
+Reading and writing the schemata file is the same as without SMBA in
+above section.
+
+For example, to allocate 8GB/s limit on the first cache id:
+
+::
+
+ # cat schemata
+ SMBA:0=2048;1=2048;2=2048;3=2048
+ MB:0=2048;1=2048;2=2048;3=2048
+ L3:0=ffff;1=ffff;2=ffff;3=ffff
+
+ # echo "SMBA:1=64" > schemata
+ # cat schemata
+ SMBA:0=2048;1= 64;2=2048;3=2048
+ MB:0=2048;1=2048;2=2048;3=2048
+ L3:0=ffff;1=ffff;2=ffff;3=ffff
+
+Cache Pseudo-Locking
+====================
+CAT enables a user to specify the amount of cache space that an
+application can fill. Cache pseudo-locking builds on the fact that a
+CPU can still read and write data pre-allocated outside its current
+allocated area on a cache hit. With cache pseudo-locking, data can be
+preloaded into a reserved portion of cache that no application can
+fill, and from that point on will only serve cache hits. The cache
+pseudo-locked memory is made accessible to user space where an
+application can map it into its virtual address space and thus have
+a region of memory with reduced average read latency.
+
+The creation of a cache pseudo-locked region is triggered by a request
+from the user to do so that is accompanied by a schemata of the region
+to be pseudo-locked. The cache pseudo-locked region is created as follows:
+
+- Create a CAT allocation CLOSNEW with a CBM matching the schemata
+ from the user of the cache region that will contain the pseudo-locked
+ memory. This region must not overlap with any current CAT allocation/CLOS
+ on the system and no future overlap with this cache region is allowed
+ while the pseudo-locked region exists.
+- Create a contiguous region of memory of the same size as the cache
+ region.
+- Flush the cache, disable hardware prefetchers, disable preemption.
+- Make CLOSNEW the active CLOS and touch the allocated memory to load
+ it into the cache.
+- Set the previous CLOS as active.
+- At this point the closid CLOSNEW can be released - the cache
+ pseudo-locked region is protected as long as its CBM does not appear in
+ any CAT allocation. Even though the cache pseudo-locked region will from
+ this point on not appear in any CBM of any CLOS an application running with
+ any CLOS will be able to access the memory in the pseudo-locked region since
+ the region continues to serve cache hits.
+- The contiguous region of memory loaded into the cache is exposed to
+ user-space as a character device.
+
+Cache pseudo-locking increases the probability that data will remain
+in the cache via carefully configuring the CAT feature and controlling
+application behavior. There is no guarantee that data is placed in
+cache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict
+“locked” data from cache. Power management C-states may shrink or
+power off cache. Deeper C-states will automatically be restricted on
+pseudo-locked region creation.
+
+It is required that an application using a pseudo-locked region runs
+with affinity to the cores (or a subset of the cores) associated
+with the cache on which the pseudo-locked region resides. A sanity check
+within the code will not allow an application to map pseudo-locked memory
+unless it runs with affinity to cores associated with the cache on which the
+pseudo-locked region resides. The sanity check is only done during the
+initial mmap() handling, there is no enforcement afterwards and the
+application self needs to ensure it remains affine to the correct cores.
+
+Pseudo-locking is accomplished in two stages:
+
+1) During the first stage the system administrator allocates a portion
+ of cache that should be dedicated to pseudo-locking. At this time an
+ equivalent portion of memory is allocated, loaded into allocated
+ cache portion, and exposed as a character device.
+2) During the second stage a user-space application maps (mmap()) the
+ pseudo-locked memory into its address space.
+
+Cache Pseudo-Locking Interface
+------------------------------
+A pseudo-locked region is created using the resctrl interface as follows:
+
+1) Create a new resource group by creating a new directory in /sys/fs/resctrl.
+2) Change the new resource group's mode to "pseudo-locksetup" by writing
+ "pseudo-locksetup" to the "mode" file.
+3) Write the schemata of the pseudo-locked region to the "schemata" file. All
+ bits within the schemata should be "unused" according to the "bit_usage"
+ file.
+
+On successful pseudo-locked region creation the "mode" file will contain
+"pseudo-locked" and a new character device with the same name as the resource
+group will exist in /dev/pseudo_lock. This character device can be mmap()'ed
+by user space in order to obtain access to the pseudo-locked memory region.
+
+An example of cache pseudo-locked region creation and usage can be found below.
+
+Cache Pseudo-Locking Debugging Interface
+----------------------------------------
+The pseudo-locking debugging interface is enabled by default (if
+CONFIG_DEBUG_FS is enabled) and can be found in /sys/kernel/debug/resctrl.
+
+There is no explicit way for the kernel to test if a provided memory
+location is present in the cache. The pseudo-locking debugging interface uses
+the tracing infrastructure to provide two ways to measure cache residency of
+the pseudo-locked region:
+
+1) Memory access latency using the pseudo_lock_mem_latency tracepoint. Data
+ from these measurements are best visualized using a hist trigger (see
+ example below). In this test the pseudo-locked region is traversed at
+ a stride of 32 bytes while hardware prefetchers and preemption
+ are disabled. This also provides a substitute visualization of cache
+ hits and misses.
+2) Cache hit and miss measurements using model specific precision counters if
+ available. Depending on the levels of cache on the system the pseudo_lock_l2
+ and pseudo_lock_l3 tracepoints are available.
+
+When a pseudo-locked region is created a new debugfs directory is created for
+it in debugfs as /sys/kernel/debug/resctrl/<newdir>. A single
+write-only file, pseudo_lock_measure, is present in this directory. The
+measurement of the pseudo-locked region depends on the number written to this
+debugfs file:
+
+1:
+ writing "1" to the pseudo_lock_measure file will trigger the latency
+ measurement captured in the pseudo_lock_mem_latency tracepoint. See
+ example below.
+2:
+ writing "2" to the pseudo_lock_measure file will trigger the L2 cache
+ residency (cache hits and misses) measurement captured in the
+ pseudo_lock_l2 tracepoint. See example below.
+3:
+ writing "3" to the pseudo_lock_measure file will trigger the L3 cache
+ residency (cache hits and misses) measurement captured in the
+ pseudo_lock_l3 tracepoint.
+
+All measurements are recorded with the tracing infrastructure. This requires
+the relevant tracepoints to be enabled before the measurement is triggered.
+
+Example of latency debugging interface
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+In this example a pseudo-locked region named "newlock" was created. Here is
+how we can measure the latency in cycles of reading from this region and
+visualize this data with a histogram that is available if CONFIG_HIST_TRIGGERS
+is set::
+
+ # :> /sys/kernel/tracing/trace
+ # echo 'hist:keys=latency' > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/trigger
+ # echo 1 > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/enable
+ # echo 1 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure
+ # echo 0 > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/enable
+ # cat /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/hist
+
+ # event histogram
+ #
+ # trigger info: hist:keys=latency:vals=hitcount:sort=hitcount:size=2048 [active]
+ #
+
+ { latency: 456 } hitcount: 1
+ { latency: 50 } hitcount: 83
+ { latency: 36 } hitcount: 96
+ { latency: 44 } hitcount: 174
+ { latency: 48 } hitcount: 195
+ { latency: 46 } hitcount: 262
+ { latency: 42 } hitcount: 693
+ { latency: 40 } hitcount: 3204
+ { latency: 38 } hitcount: 3484
+
+ Totals:
+ Hits: 8192
+ Entries: 9
+ Dropped: 0
+
+Example of cache hits/misses debugging
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+In this example a pseudo-locked region named "newlock" was created on the L2
+cache of a platform. Here is how we can obtain details of the cache hits
+and misses using the platform's precision counters.
+::
+
+ # :> /sys/kernel/tracing/trace
+ # echo 1 > /sys/kernel/tracing/events/resctrl/pseudo_lock_l2/enable
+ # echo 2 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure
+ # echo 0 > /sys/kernel/tracing/events/resctrl/pseudo_lock_l2/enable
+ # cat /sys/kernel/tracing/trace
+
+ # tracer: nop
+ #
+ # _-----=> irqs-off
+ # / _----=> need-resched
+ # | / _---=> hardirq/softirq
+ # || / _--=> preempt-depth
+ # ||| / delay
+ # TASK-PID CPU# |||| TIMESTAMP FUNCTION
+ # | | | |||| | |
+ pseudo_lock_mea-1672 [002] .... 3132.860500: pseudo_lock_l2: hits=4097 miss=0
+
+
+Examples for RDT allocation usage
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+1) Example 1
+
+On a two socket machine (one L3 cache per socket) with just four bits
+for cache bit masks, minimum b/w of 10% with a memory bandwidth
+granularity of 10%.
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl
+ # cd /sys/fs/resctrl
+ # mkdir p0 p1
+ # echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata
+ # echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata
+
+The default resource group is unmodified, so we have access to all parts
+of all caches (its schemata file reads "L3:0=f;1=f").
+
+Tasks that are under the control of group "p0" may only allocate from the
+"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
+Tasks in group "p1" use the "lower" 50% of cache on both sockets.
+
+Similarly, tasks that are under the control of group "p0" may use a
+maximum memory b/w of 50% on socket0 and 50% on socket 1.
+Tasks in group "p1" may also use 50% memory b/w on both sockets.
+Note that unlike cache masks, memory b/w cannot specify whether these
+allocations can overlap or not. The allocations specifies the maximum
+b/w that the group may be able to use and the system admin can configure
+the b/w accordingly.
+
+If resctrl is using the software controller (mba_sc) then user can enter the
+max b/w in MB rather than the percentage values.
+::
+
+ # echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata
+ # echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata
+
+In the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w
+of 1024MB where as on socket 1 they would use 500MB.
+
+2) Example 2
+
+Again two sockets, but this time with a more realistic 20-bit mask.
+
+Two real time tasks pid=1234 running on processor 0 and pid=5678 running on
+processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy
+neighbors, each of the two real-time tasks exclusively occupies one quarter
+of L3 cache on socket 0.
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl
+ # cd /sys/fs/resctrl
+
+First we reset the schemata for the default group so that the "upper"
+50% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by
+ordinary tasks::
+
+ # echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata
+
+Next we make a resource group for our first real time task and give
+it access to the "top" 25% of the cache on socket 0.
+::
+
+ # mkdir p0
+ # echo "L3:0=f8000;1=fffff" > p0/schemata
+
+Finally we move our first real time task into this resource group. We
+also use taskset(1) to ensure the task always runs on a dedicated CPU
+on socket 0. Most uses of resource groups will also constrain which
+processors tasks run on.
+::
+
+ # echo 1234 > p0/tasks
+ # taskset -cp 1 1234
+
+Ditto for the second real time task (with the remaining 25% of cache)::
+
+ # mkdir p1
+ # echo "L3:0=7c00;1=fffff" > p1/schemata
+ # echo 5678 > p1/tasks
+ # taskset -cp 2 5678
+
+For the same 2 socket system with memory b/w resource and CAT L3 the
+schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is
+10):
+
+For our first real time task this would request 20% memory b/w on socket 0.
+::
+
+ # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
+
+For our second real time task this would request an other 20% memory b/w
+on socket 0.
+::
+
+ # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
+
+3) Example 3
+
+A single socket system which has real-time tasks running on core 4-7 and
+non real-time workload assigned to core 0-3. The real-time tasks share text
+and data, so a per task association is not required and due to interaction
+with the kernel it's desired that the kernel on these cores shares L3 with
+the tasks.
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl
+ # cd /sys/fs/resctrl
+
+First we reset the schemata for the default group so that the "upper"
+50% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0
+cannot be used by ordinary tasks::
+
+ # echo "L3:0=3ff\nMB:0=50" > schemata
+
+Next we make a resource group for our real time cores and give it access
+to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on
+socket 0.
+::
+
+ # mkdir p0
+ # echo "L3:0=ffc00\nMB:0=50" > p0/schemata
+
+Finally we move core 4-7 over to the new group and make sure that the
+kernel and the tasks running there get 50% of the cache. They should
+also get 50% of memory bandwidth assuming that the cores 4-7 are SMT
+siblings and only the real time threads are scheduled on the cores 4-7.
+::
+
+ # echo F0 > p0/cpus
+
+4) Example 4
+
+The resource groups in previous examples were all in the default "shareable"
+mode allowing sharing of their cache allocations. If one resource group
+configures a cache allocation then nothing prevents another resource group
+to overlap with that allocation.
+
+In this example a new exclusive resource group will be created on a L2 CAT
+system with two L2 cache instances that can be configured with an 8-bit
+capacity bitmask. The new exclusive resource group will be configured to use
+25% of each cache instance.
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl/
+ # cd /sys/fs/resctrl
+
+First, we observe that the default group is configured to allocate to all L2
+cache::
+
+ # cat schemata
+ L2:0=ff;1=ff
+
+We could attempt to create the new resource group at this point, but it will
+fail because of the overlap with the schemata of the default group::
+
+ # mkdir p0
+ # echo 'L2:0=0x3;1=0x3' > p0/schemata
+ # cat p0/mode
+ shareable
+ # echo exclusive > p0/mode
+ -sh: echo: write error: Invalid argument
+ # cat info/last_cmd_status
+ schemata overlaps
+
+To ensure that there is no overlap with another resource group the default
+resource group's schemata has to change, making it possible for the new
+resource group to become exclusive.
+::
+
+ # echo 'L2:0=0xfc;1=0xfc' > schemata
+ # echo exclusive > p0/mode
+ # grep . p0/*
+ p0/cpus:0
+ p0/mode:exclusive
+ p0/schemata:L2:0=03;1=03
+ p0/size:L2:0=262144;1=262144
+
+A new resource group will on creation not overlap with an exclusive resource
+group::
+
+ # mkdir p1
+ # grep . p1/*
+ p1/cpus:0
+ p1/mode:shareable
+ p1/schemata:L2:0=fc;1=fc
+ p1/size:L2:0=786432;1=786432
+
+The bit_usage will reflect how the cache is used::
+
+ # cat info/L2/bit_usage
+ 0=SSSSSSEE;1=SSSSSSEE
+
+A resource group cannot be forced to overlap with an exclusive resource group::
+
+ # echo 'L2:0=0x1;1=0x1' > p1/schemata
+ -sh: echo: write error: Invalid argument
+ # cat info/last_cmd_status
+ overlaps with exclusive group
+
+Example of Cache Pseudo-Locking
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Lock portion of L2 cache from cache id 1 using CBM 0x3. Pseudo-locked
+region is exposed at /dev/pseudo_lock/newlock that can be provided to
+application for argument to mmap().
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl/
+ # cd /sys/fs/resctrl
+
+Ensure that there are bits available that can be pseudo-locked, since only
+unused bits can be pseudo-locked the bits to be pseudo-locked needs to be
+removed from the default resource group's schemata::
+
+ # cat info/L2/bit_usage
+ 0=SSSSSSSS;1=SSSSSSSS
+ # echo 'L2:1=0xfc' > schemata
+ # cat info/L2/bit_usage
+ 0=SSSSSSSS;1=SSSSSS00
+
+Create a new resource group that will be associated with the pseudo-locked
+region, indicate that it will be used for a pseudo-locked region, and
+configure the requested pseudo-locked region capacity bitmask::
+
+ # mkdir newlock
+ # echo pseudo-locksetup > newlock/mode
+ # echo 'L2:1=0x3' > newlock/schemata
+
+On success the resource group's mode will change to pseudo-locked, the
+bit_usage will reflect the pseudo-locked region, and the character device
+exposing the pseudo-locked region will exist::
+
+ # cat newlock/mode
+ pseudo-locked
+ # cat info/L2/bit_usage
+ 0=SSSSSSSS;1=SSSSSSPP
+ # ls -l /dev/pseudo_lock/newlock
+ crw------- 1 root root 243, 0 Apr 3 05:01 /dev/pseudo_lock/newlock
+
+::
+
+ /*
+ * Example code to access one page of pseudo-locked cache region
+ * from user space.
+ */
+ #define _GNU_SOURCE
+ #include <fcntl.h>
+ #include <sched.h>
+ #include <stdio.h>
+ #include <stdlib.h>
+ #include <unistd.h>
+ #include <sys/mman.h>
+
+ /*
+ * It is required that the application runs with affinity to only
+ * cores associated with the pseudo-locked region. Here the cpu
+ * is hardcoded for convenience of example.
+ */
+ static int cpuid = 2;
+
+ int main(int argc, char *argv[])
+ {
+ cpu_set_t cpuset;
+ long page_size;
+ void *mapping;
+ int dev_fd;
+ int ret;
+
+ page_size = sysconf(_SC_PAGESIZE);
+
+ CPU_ZERO(&cpuset);
+ CPU_SET(cpuid, &cpuset);
+ ret = sched_setaffinity(0, sizeof(cpuset), &cpuset);
+ if (ret < 0) {
+ perror("sched_setaffinity");
+ exit(EXIT_FAILURE);
+ }
+
+ dev_fd = open("/dev/pseudo_lock/newlock", O_RDWR);
+ if (dev_fd < 0) {
+ perror("open");
+ exit(EXIT_FAILURE);
+ }
+
+ mapping = mmap(0, page_size, PROT_READ | PROT_WRITE, MAP_SHARED,
+ dev_fd, 0);
+ if (mapping == MAP_FAILED) {
+ perror("mmap");
+ close(dev_fd);
+ exit(EXIT_FAILURE);
+ }
+
+ /* Application interacts with pseudo-locked memory @mapping */
+
+ ret = munmap(mapping, page_size);
+ if (ret < 0) {
+ perror("munmap");
+ close(dev_fd);
+ exit(EXIT_FAILURE);
+ }
+
+ close(dev_fd);
+ exit(EXIT_SUCCESS);
+ }
+
+Locking between applications
+----------------------------
+
+Certain operations on the resctrl filesystem, composed of read/writes
+to/from multiple files, must be atomic.
+
+As an example, the allocation of an exclusive reservation of L3 cache
+involves:
+
+ 1. Read the cbmmasks from each directory or the per-resource "bit_usage"
+ 2. Find a contiguous set of bits in the global CBM bitmask that is clear
+ in any of the directory cbmmasks
+ 3. Create a new directory
+ 4. Set the bits found in step 2 to the new directory "schemata" file
+
+If two applications attempt to allocate space concurrently then they can
+end up allocating the same bits so the reservations are shared instead of
+exclusive.
+
+To coordinate atomic operations on the resctrlfs and to avoid the problem
+above, the following locking procedure is recommended:
+
+Locking is based on flock, which is available in libc and also as a shell
+script command
+
+Write lock:
+
+ A) Take flock(LOCK_EX) on /sys/fs/resctrl
+ B) Read/write the directory structure.
+ C) funlock
+
+Read lock:
+
+ A) Take flock(LOCK_SH) on /sys/fs/resctrl
+ B) If success read the directory structure.
+ C) funlock
+
+Example with bash::
+
+ # Atomically read directory structure
+ $ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl
+
+ # Read directory contents and create new subdirectory
+
+ $ cat create-dir.sh
+ find /sys/fs/resctrl/ > output.txt
+ mask = function-of(output.txt)
+ mkdir /sys/fs/resctrl/newres/
+ echo mask > /sys/fs/resctrl/newres/schemata
+
+ $ flock /sys/fs/resctrl/ ./create-dir.sh
+
+Example with C::
+
+ /*
+ * Example code do take advisory locks
+ * before accessing resctrl filesystem
+ */
+ #include <sys/file.h>
+ #include <stdlib.h>
+
+ void resctrl_take_shared_lock(int fd)
+ {
+ int ret;
+
+ /* take shared lock on resctrl filesystem */
+ ret = flock(fd, LOCK_SH);
+ if (ret) {
+ perror("flock");
+ exit(-1);
+ }
+ }
+
+ void resctrl_take_exclusive_lock(int fd)
+ {
+ int ret;
+
+ /* release lock on resctrl filesystem */
+ ret = flock(fd, LOCK_EX);
+ if (ret) {
+ perror("flock");
+ exit(-1);
+ }
+ }
+
+ void resctrl_release_lock(int fd)
+ {
+ int ret;
+
+ /* take shared lock on resctrl filesystem */
+ ret = flock(fd, LOCK_UN);
+ if (ret) {
+ perror("flock");
+ exit(-1);
+ }
+ }
+
+ void main(void)
+ {
+ int fd, ret;
+
+ fd = open("/sys/fs/resctrl", O_DIRECTORY);
+ if (fd == -1) {
+ perror("open");
+ exit(-1);
+ }
+ resctrl_take_shared_lock(fd);
+ /* code to read directory contents */
+ resctrl_release_lock(fd);
+
+ resctrl_take_exclusive_lock(fd);
+ /* code to read and write directory contents */
+ resctrl_release_lock(fd);
+ }
+
+Examples for RDT Monitoring along with allocation usage
+=======================================================
+Reading monitored data
+----------------------
+Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would
+show the current snapshot of LLC occupancy of the corresponding MON
+group or CTRL_MON group.
+
+
+Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group)
+------------------------------------------------------------------------
+On a two socket machine (one L3 cache per socket) with just four bits
+for cache bit masks::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl
+ # cd /sys/fs/resctrl
+ # mkdir p0 p1
+ # echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
+ # echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata
+ # echo 5678 > p1/tasks
+ # echo 5679 > p1/tasks
+
+The default resource group is unmodified, so we have access to all parts
+of all caches (its schemata file reads "L3:0=f;1=f").
+
+Tasks that are under the control of group "p0" may only allocate from the
+"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
+Tasks in group "p1" use the "lower" 50% of cache on both sockets.
+
+Create monitor groups and assign a subset of tasks to each monitor group.
+::
+
+ # cd /sys/fs/resctrl/p1/mon_groups
+ # mkdir m11 m12
+ # echo 5678 > m11/tasks
+ # echo 5679 > m12/tasks
+
+fetch data (data shown in bytes)
+::
+
+ # cat m11/mon_data/mon_L3_00/llc_occupancy
+ 16234000
+ # cat m11/mon_data/mon_L3_01/llc_occupancy
+ 14789000
+ # cat m12/mon_data/mon_L3_00/llc_occupancy
+ 16789000
+
+The parent ctrl_mon group shows the aggregated data.
+::
+
+ # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
+ 31234000
+
+Example 2 (Monitor a task from its creation)
+--------------------------------------------
+On a two socket machine (one L3 cache per socket)::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl
+ # cd /sys/fs/resctrl
+ # mkdir p0 p1
+
+An RMID is allocated to the group once its created and hence the <cmd>
+below is monitored from its creation.
+::
+
+ # echo $$ > /sys/fs/resctrl/p1/tasks
+ # <cmd>
+
+Fetch the data::
+
+ # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
+ 31789000
+
+Example 3 (Monitor without CAT support or before creating CAT groups)
+---------------------------------------------------------------------
+
+Assume a system like HSW has only CQM and no CAT support. In this case
+the resctrl will still mount but cannot create CTRL_MON directories.
+But user can create different MON groups within the root group thereby
+able to monitor all tasks including kernel threads.
+
+This can also be used to profile jobs cache size footprint before being
+able to allocate them to different allocation groups.
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl
+ # cd /sys/fs/resctrl
+ # mkdir mon_groups/m01
+ # mkdir mon_groups/m02
+
+ # echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks
+ # echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks
+
+Monitor the groups separately and also get per domain data. From the
+below its apparent that the tasks are mostly doing work on
+domain(socket) 0.
+::
+
+ # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy
+ 31234000
+ # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy
+ 34555
+ # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy
+ 31234000
+ # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy
+ 32789
+
+
+Example 4 (Monitor real time tasks)
+-----------------------------------
+
+A single socket system which has real time tasks running on cores 4-7
+and non real time tasks on other cpus. We want to monitor the cache
+occupancy of the real time threads on these cores.
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl
+ # cd /sys/fs/resctrl
+ # mkdir p1
+
+Move the cpus 4-7 over to p1::
+
+ # echo f0 > p1/cpus
+
+View the llc occupancy snapshot::
+
+ # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy
+ 11234000
+
+
+Examples on working with mbm_assign_mode
+========================================
+
+a. Check if MBM counter assignment mode is supported.
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl/
+
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
+ [mbm_event]
+ default
+
+The "mbm_event" mode is detected and enabled.
+
+b. Check how many assignable counters are supported.
+::
+
+ # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
+ 0=32;1=32
+
+c. Check how many assignable counters are available for assignment in each domain.
+::
+
+ # cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs
+ 0=30;1=30
+
+d. To list the default group's assign states.
+::
+
+ # cat /sys/fs/resctrl/mbm_L3_assignments
+ mbm_total_bytes:0=e;1=e
+ mbm_local_bytes:0=e;1=e
+
+e. To unassign the counter associated with the mbm_total_bytes event on domain 0.
+::
+
+ # echo "mbm_total_bytes:0=_" > /sys/fs/resctrl/mbm_L3_assignments
+ # cat /sys/fs/resctrl/mbm_L3_assignments
+ mbm_total_bytes:0=_;1=e
+ mbm_local_bytes:0=e;1=e
+
+f. To unassign the counter associated with the mbm_total_bytes event on all domains.
+::
+
+ # echo "mbm_total_bytes:*=_" > /sys/fs/resctrl/mbm_L3_assignments
+ # cat /sys/fs/resctrl/mbm_L3_assignment
+ mbm_total_bytes:0=_;1=_
+ mbm_local_bytes:0=e;1=e
+
+g. To assign a counter associated with the mbm_total_bytes event on all domains in
+exclusive mode.
+::
+
+ # echo "mbm_total_bytes:*=e" > /sys/fs/resctrl/mbm_L3_assignments
+ # cat /sys/fs/resctrl/mbm_L3_assignments
+ mbm_total_bytes:0=e;1=e
+ mbm_local_bytes:0=e;1=e
+
+h. Read the events mbm_total_bytes and mbm_local_bytes of the default group. There is
+no change in reading the events with the assignment.
+::
+
+ # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
+ 779247936
+ # cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
+ 562324232
+ # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
+ 212122123
+ # cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
+ 121212144
+
+i. Check the event configurations.
+::
+
+ # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_total_bytes/event_filter
+ local_reads,remote_reads,local_non_temporal_writes,remote_non_temporal_writes,
+ local_reads_slow_memory,remote_reads_slow_memory,dirty_victim_writes_all
+
+ # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_local_bytes/event_filter
+ local_reads,local_non_temporal_writes,local_reads_slow_memory
+
+j. Change the event configuration for mbm_local_bytes.
+::
+
+ # echo "local_reads, local_non_temporal_writes, local_reads_slow_memory, remote_reads" >
+ /sys/fs/resctrl/info/L3_MON/event_configs/mbm_local_bytes/event_filter
+
+ # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_local_bytes/event_filter
+ local_reads,local_non_temporal_writes,local_reads_slow_memory,remote_reads
+
+k. Now read the local events again. The first read may come back with "Unavailable"
+status. The subsequent read of mbm_local_bytes will display the current value.
+::
+
+ # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
+ Unavailable
+ # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
+ 2252323
+ # cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
+ Unavailable
+ # cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
+ 1566565
+
+l. Users have the option to go back to 'default' mbm_assign_mode if required. This can be
+done using the following command. Note that switching the mbm_assign_mode may reset all
+the MBM counters (and thus all MBM events) of all the resctrl groups.
+::
+
+ # echo "default" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
+ mbm_event
+ [default]
+
+m. Unmount the resctrl filesystem.
+::
+
+ # umount /sys/fs/resctrl/
+
+Intel RDT Errata
+================
+
+Intel MBM Counters May Report System Memory Bandwidth Incorrectly
+-----------------------------------------------------------------
+
+Errata SKX99 for Skylake server and BDF102 for Broadwell server.
+
+Problem: Intel Memory Bandwidth Monitoring (MBM) counters track metrics
+according to the assigned Resource Monitor ID (RMID) for that logical
+core. The IA32_QM_CTR register (MSR 0xC8E), used to report these
+metrics, may report incorrect system bandwidth for certain RMID values.
+
+Implication: Due to the errata, system memory bandwidth may not match
+what is reported.
+
+Workaround: MBM total and local readings are corrected according to the
+following correction factor table:
+
++---------------+---------------+---------------+-----------------+
+|core count |rmid count |rmid threshold |correction factor|
++---------------+---------------+---------------+-----------------+
+|1 |8 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|2 |16 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|3 |24 |15 |0.969650 |
++---------------+---------------+---------------+-----------------+
+|4 |32 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|6 |48 |31 |0.969650 |
++---------------+---------------+---------------+-----------------+
+|7 |56 |47 |1.142857 |
++---------------+---------------+---------------+-----------------+
+|8 |64 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|9 |72 |63 |1.185115 |
++---------------+---------------+---------------+-----------------+
+|10 |80 |63 |1.066553 |
++---------------+---------------+---------------+-----------------+
+|11 |88 |79 |1.454545 |
++---------------+---------------+---------------+-----------------+
+|12 |96 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|13 |104 |95 |1.230769 |
++---------------+---------------+---------------+-----------------+
+|14 |112 |95 |1.142857 |
++---------------+---------------+---------------+-----------------+
+|15 |120 |95 |1.066667 |
++---------------+---------------+---------------+-----------------+
+|16 |128 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|17 |136 |127 |1.254863 |
++---------------+---------------+---------------+-----------------+
+|18 |144 |127 |1.185255 |
++---------------+---------------+---------------+-----------------+
+|19 |152 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|20 |160 |127 |1.066667 |
++---------------+---------------+---------------+-----------------+
+|21 |168 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|22 |176 |159 |1.454334 |
++---------------+---------------+---------------+-----------------+
+|23 |184 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|24 |192 |127 |0.969744 |
++---------------+---------------+---------------+-----------------+
+|25 |200 |191 |1.280246 |
++---------------+---------------+---------------+-----------------+
+|26 |208 |191 |1.230921 |
++---------------+---------------+---------------+-----------------+
+|27 |216 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|28 |224 |191 |1.143118 |
++---------------+---------------+---------------+-----------------+
+
+If rmid > rmid threshold, MBM total and local values should be multiplied
+by the correction factor.
+
+See:
+
+1. Erratum SKX99 in Intel Xeon Processor Scalable Family Specification Update:
+http://web.archive.org/web/20200716124958/https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-spec-update.html
+
+2. Erratum BDF102 in Intel Xeon E5-2600 v4 Processor Product Family Specification Update:
+http://web.archive.org/web/20191125200531/https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v4-spec-update.pdf
+
+3. The errata in Intel Resource Director Technology (Intel RDT) on 2nd Generation Intel Xeon Scalable Processors Reference Manual:
+https://software.intel.com/content/www/us/en/develop/articles/intel-resource-director-technology-rdt-reference-manual.html
+
+for further information.
diff --git a/Documentation/filesystems/seq_file.rst b/Documentation/filesystems/seq_file.rst
index a6726082a7c2..1e1713d00010 100644
--- a/Documentation/filesystems/seq_file.rst
+++ b/Documentation/filesystems/seq_file.rst
@@ -130,7 +130,7 @@ called SEQ_START_TOKEN; it can be used if you wish to instruct your
show() function (described below) to print a header at the top of the
output. SEQ_START_TOKEN should only be used if the offset is zero,
however. SEQ_START_TOKEN has no special meaning to the core seq_file
-code. It is provided as a convenience for a start() funciton to
+code. It is provided as a convenience for a start() function to
communicate with the next() and show() functions.
The next function to implement is called, amazingly, next(); its job is to
@@ -217,7 +217,7 @@ between the calls to start() and stop(), so holding a lock during that time
is a reasonable thing to do. The seq_file code will also avoid taking any
other locks while the iterator is active.
-The iterater value returned by start() or next() is guaranteed to be
+The iterator value returned by start() or next() is guaranteed to be
passed to a subsequent next() or stop() call. This allows resources
such as locks that were taken to be reliably released. There is *no*
guarantee that the iterator will be passed to show(), though in practice
diff --git a/Documentation/filesystems/sharedsubtree.rst b/Documentation/filesystems/sharedsubtree.rst
index 1cf56489ed48..8b7dc9159083 100644
--- a/Documentation/filesystems/sharedsubtree.rst
+++ b/Documentation/filesystems/sharedsubtree.rst
@@ -31,965 +31,960 @@ and versioned filesystem.
-----------
Shared subtree provides four different flavors of mounts; struct vfsmount to be
-precise
+precise:
- a. shared mount
- b. slave mount
- c. private mount
- d. unbindable mount
+a) A **shared mount** can be replicated to as many mountpoints and all the
+ replicas continue to be exactly same.
-2a) A shared mount can be replicated to as many mountpoints and all the
-replicas continue to be exactly same.
+ Here is an example:
- Here is an example:
+ Let's say /mnt has a mount that is shared::
- Let's say /mnt has a mount that is shared::
+ # mount --make-shared /mnt
- mount --make-shared /mnt
+ .. note::
+ mount(8) command now supports the --make-shared flag,
+ so the sample 'smount' program is no longer needed and has been
+ removed.
- Note: mount(8) command now supports the --make-shared flag,
- so the sample 'smount' program is no longer needed and has been
- removed.
+ ::
- ::
+ # mount --bind /mnt /tmp
- # mount --bind /mnt /tmp
+ The above command replicates the mount at /mnt to the mountpoint /tmp
+ and the contents of both the mounts remain identical.
- The above command replicates the mount at /mnt to the mountpoint /tmp
- and the contents of both the mounts remain identical.
+ ::
- ::
+ #ls /mnt
+ a b c
- #ls /mnt
- a b c
+ #ls /tmp
+ a b c
- #ls /tmp
- a b c
+ Now let's say we mount a device at /tmp/a::
- Now let's say we mount a device at /tmp/a::
+ # mount /dev/sd0 /tmp/a
- # mount /dev/sd0 /tmp/a
+ # ls /tmp/a
+ t1 t2 t3
- #ls /tmp/a
- t1 t2 t3
+ # ls /mnt/a
+ t1 t2 t3
- #ls /mnt/a
- t1 t2 t3
+ Note that the mount has propagated to the mount at /mnt as well.
- Note that the mount has propagated to the mount at /mnt as well.
+ And the same is true even when /dev/sd0 is mounted on /mnt/a. The
+ contents will be visible under /tmp/a too.
- And the same is true even when /dev/sd0 is mounted on /mnt/a. The
- contents will be visible under /tmp/a too.
+b) A **slave mount** is like a shared mount except that mount and umount events
+ only propagate towards it.
-2b) A slave mount is like a shared mount except that mount and umount events
- only propagate towards it.
+ All slave mounts have a master mount which is a shared.
- All slave mounts have a master mount which is a shared.
+ Here is an example:
- Here is an example:
+ Let's say /mnt has a mount which is shared::
- Let's say /mnt has a mount which is shared.
- # mount --make-shared /mnt
+ # mount --make-shared /mnt
- Let's bind mount /mnt to /tmp
- # mount --bind /mnt /tmp
+ Let's bind mount /mnt to /tmp::
- the new mount at /tmp becomes a shared mount and it is a replica of
- the mount at /mnt.
+ # mount --bind /mnt /tmp
- Now let's make the mount at /tmp; a slave of /mnt
- # mount --make-slave /tmp
+ the new mount at /tmp becomes a shared mount and it is a replica of
+ the mount at /mnt.
- let's mount /dev/sd0 on /mnt/a
- # mount /dev/sd0 /mnt/a
+ Now let's make the mount at /tmp; a slave of /mnt::
- #ls /mnt/a
- t1 t2 t3
+ # mount --make-slave /tmp
- #ls /tmp/a
- t1 t2 t3
+ let's mount /dev/sd0 on /mnt/a::
- Note the mount event has propagated to the mount at /tmp
+ # mount /dev/sd0 /mnt/a
- However let's see what happens if we mount something on the mount at /tmp
+ # ls /mnt/a
+ t1 t2 t3
- # mount /dev/sd1 /tmp/b
+ # ls /tmp/a
+ t1 t2 t3
- #ls /tmp/b
- s1 s2 s3
+ Note the mount event has propagated to the mount at /tmp
- #ls /mnt/b
+ However let's see what happens if we mount something on the mount at
+ /tmp::
- Note how the mount event has not propagated to the mount at
- /mnt
+ # mount /dev/sd1 /tmp/b
+ # ls /tmp/b
+ s1 s2 s3
-2c) A private mount does not forward or receive propagation.
+ # ls /mnt/b
- This is the mount we are familiar with. Its the default type.
+ Note how the mount event has not propagated to the mount at
+ /mnt
-2d) A unbindable mount is a unbindable private mount
+c) A **private mount** does not forward or receive propagation.
- let's say we have a mount at /mnt and we make it unbindable::
+ This is the mount we are familiar with. Its the default type.
- # mount --make-unbindable /mnt
- Let's try to bind mount this mount somewhere else::
+d) An **unbindable mount** is, as the name suggests, an unbindable private
+ mount.
- # mount --bind /mnt /tmp
- mount: wrong fs type, bad option, bad superblock on /mnt,
- or too many mounted file systems
+ let's say we have a mount at /mnt and we make it unbindable::
- Binding a unbindable mount is a invalid operation.
+ # mount --make-unbindable /mnt
+
+ Let's try to bind mount this mount somewhere else::
+
+ # mount --bind /mnt /tmp mount: wrong fs type, bad option, bad
+ superblock on /mnt, or too many mounted file systems
+
+ Binding a unbindable mount is a invalid operation.
3) Setting mount states
-----------------------
- The mount command (util-linux package) can be used to set mount
- states::
+The mount command (util-linux package) can be used to set mount
+states::
- mount --make-shared mountpoint
- mount --make-slave mountpoint
- mount --make-private mountpoint
- mount --make-unbindable mountpoint
+ mount --make-shared mountpoint
+ mount --make-slave mountpoint
+ mount --make-private mountpoint
+ mount --make-unbindable mountpoint
4) Use cases
------------
- A) A process wants to clone its own namespace, but still wants to
- access the CD that got mounted recently.
+A) A process wants to clone its own namespace, but still wants to
+ access the CD that got mounted recently.
- Solution:
+ Solution:
- The system administrator can make the mount at /cdrom shared::
+ The system administrator can make the mount at /cdrom shared::
- mount --bind /cdrom /cdrom
- mount --make-shared /cdrom
+ mount --bind /cdrom /cdrom
+ mount --make-shared /cdrom
- Now any process that clones off a new namespace will have a
- mount at /cdrom which is a replica of the same mount in the
- parent namespace.
+ Now any process that clones off a new namespace will have a
+ mount at /cdrom which is a replica of the same mount in the
+ parent namespace.
- So when a CD is inserted and mounted at /cdrom that mount gets
- propagated to the other mount at /cdrom in all the other clone
- namespaces.
+ So when a CD is inserted and mounted at /cdrom that mount gets
+ propagated to the other mount at /cdrom in all the other clone
+ namespaces.
- B) A process wants its mounts invisible to any other process, but
- still be able to see the other system mounts.
+B) A process wants its mounts invisible to any other process, but
+ still be able to see the other system mounts.
- Solution:
+ Solution:
- To begin with, the administrator can mark the entire mount tree
- as shareable::
+ To begin with, the administrator can mark the entire mount tree
+ as shareable::
- mount --make-rshared /
+ mount --make-rshared /
- A new process can clone off a new namespace. And mark some part
- of its namespace as slave::
+ A new process can clone off a new namespace. And mark some part
+ of its namespace as slave::
- mount --make-rslave /myprivatetree
+ mount --make-rslave /myprivatetree
- Hence forth any mounts within the /myprivatetree done by the
- process will not show up in any other namespace. However mounts
- done in the parent namespace under /myprivatetree still shows
- up in the process's namespace.
+ Hence forth any mounts within the /myprivatetree done by the
+ process will not show up in any other namespace. However mounts
+ done in the parent namespace under /myprivatetree still shows
+ up in the process's namespace.
- Apart from the above semantics this feature provides the
- building blocks to solve the following problems:
+Apart from the above semantics this feature provides the
+building blocks to solve the following problems:
- C) Per-user namespace
+C) Per-user namespace
- The above semantics allows a way to share mounts across
- namespaces. But namespaces are associated with processes. If
- namespaces are made first class objects with user API to
- associate/disassociate a namespace with userid, then each user
- could have his/her own namespace and tailor it to his/her
- requirements. This needs to be supported in PAM.
+ The above semantics allows a way to share mounts across
+ namespaces. But namespaces are associated with processes. If
+ namespaces are made first class objects with user API to
+ associate/disassociate a namespace with userid, then each user
+ could have his/her own namespace and tailor it to his/her
+ requirements. This needs to be supported in PAM.
- D) Versioned files
+D) Versioned files
- If the entire mount tree is visible at multiple locations, then
- an underlying versioning file system can return different
- versions of the file depending on the path used to access that
- file.
+ If the entire mount tree is visible at multiple locations, then
+ an underlying versioning file system can return different
+ versions of the file depending on the path used to access that
+ file.
- An example is::
+ An example is::
- mount --make-shared /
- mount --rbind / /view/v1
- mount --rbind / /view/v2
- mount --rbind / /view/v3
- mount --rbind / /view/v4
+ mount --make-shared /
+ mount --rbind / /view/v1
+ mount --rbind / /view/v2
+ mount --rbind / /view/v3
+ mount --rbind / /view/v4
- and if /usr has a versioning filesystem mounted, then that
- mount appears at /view/v1/usr, /view/v2/usr, /view/v3/usr and
- /view/v4/usr too
+ and if /usr has a versioning filesystem mounted, then that
+ mount appears at /view/v1/usr, /view/v2/usr, /view/v3/usr and
+ /view/v4/usr too
- A user can request v3 version of the file /usr/fs/namespace.c
- by accessing /view/v3/usr/fs/namespace.c . The underlying
- versioning filesystem can then decipher that v3 version of the
- filesystem is being requested and return the corresponding
- inode.
+ A user can request v3 version of the file /usr/fs/namespace.c
+ by accessing /view/v3/usr/fs/namespace.c . The underlying
+ versioning filesystem can then decipher that v3 version of the
+ filesystem is being requested and return the corresponding
+ inode.
5) Detailed semantics
---------------------
- The section below explains the detailed semantics of
- bind, rbind, move, mount, umount and clone-namespace operations.
-
- Note: the word 'vfsmount' and the noun 'mount' have been used
- to mean the same thing, throughout this document.
+The section below explains the detailed semantics of
+bind, rbind, move, mount, umount and clone-namespace operations.
-5a) Mount states
+.. Note::
+ the word 'vfsmount' and the noun 'mount' have been used
+ to mean the same thing, throughout this document.
- A given mount can be in one of the following states
+a) Mount states
- 1) shared
- 2) slave
- 3) shared and slave
- 4) private
- 5) unbindable
+ A **propagation event** is defined as event generated on a vfsmount
+ that leads to mount or unmount actions in other vfsmounts.
- A 'propagation event' is defined as event generated on a vfsmount
- that leads to mount or unmount actions in other vfsmounts.
+ A **peer group** is defined as a group of vfsmounts that propagate
+ events to each other.
- A 'peer group' is defined as a group of vfsmounts that propagate
- events to each other.
+ A given mount can be in one of the following states:
- (1) Shared mounts
+ (1) Shared mounts
- A 'shared mount' is defined as a vfsmount that belongs to a
- 'peer group'.
+ A **shared mount** is defined as a vfsmount that belongs to a
+ peer group.
- For example::
+ For example::
- mount --make-shared /mnt
- mount --bind /mnt /tmp
+ mount --make-shared /mnt
+ mount --bind /mnt /tmp
- The mount at /mnt and that at /tmp are both shared and belong
- to the same peer group. Anything mounted or unmounted under
- /mnt or /tmp reflect in all the other mounts of its peer
- group.
+ The mount at /mnt and that at /tmp are both shared and belong
+ to the same peer group. Anything mounted or unmounted under
+ /mnt or /tmp reflect in all the other mounts of its peer
+ group.
- (2) Slave mounts
+ (2) Slave mounts
- A 'slave mount' is defined as a vfsmount that receives
- propagation events and does not forward propagation events.
+ A **slave mount** is defined as a vfsmount that receives
+ propagation events and does not forward propagation events.
- A slave mount as the name implies has a master mount from which
- mount/unmount events are received. Events do not propagate from
- the slave mount to the master. Only a shared mount can be made
- a slave by executing the following command::
+ A slave mount as the name implies has a master mount from which
+ mount/unmount events are received. Events do not propagate from
+ the slave mount to the master. Only a shared mount can be made
+ a slave by executing the following command::
- mount --make-slave mount
+ mount --make-slave mount
- A shared mount that is made as a slave is no more shared unless
- modified to become shared.
+ A shared mount that is made as a slave is no more shared unless
+ modified to become shared.
- (3) Shared and Slave
+ (3) Shared and Slave
- A vfsmount can be both shared as well as slave. This state
- indicates that the mount is a slave of some vfsmount, and
- has its own peer group too. This vfsmount receives propagation
- events from its master vfsmount, and also forwards propagation
- events to its 'peer group' and to its slave vfsmounts.
+ A vfsmount can be both **shared** as well as **slave**. This state
+ indicates that the mount is a slave of some vfsmount, and
+ has its own peer group too. This vfsmount receives propagation
+ events from its master vfsmount, and also forwards propagation
+ events to its 'peer group' and to its slave vfsmounts.
- Strictly speaking, the vfsmount is shared having its own
- peer group, and this peer-group is a slave of some other
- peer group.
+ Strictly speaking, the vfsmount is shared having its own
+ peer group, and this peer-group is a slave of some other
+ peer group.
- Only a slave vfsmount can be made as 'shared and slave' by
- either executing the following command::
+ Only a slave vfsmount can be made as 'shared and slave' by
+ either executing the following command::
- mount --make-shared mount
+ mount --make-shared mount
- or by moving the slave vfsmount under a shared vfsmount.
+ or by moving the slave vfsmount under a shared vfsmount.
- (4) Private mount
+ (4) Private mount
- A 'private mount' is defined as vfsmount that does not
- receive or forward any propagation events.
+ A **private mount** is defined as vfsmount that does not
+ receive or forward any propagation events.
- (5) Unbindable mount
+ (5) Unbindable mount
- A 'unbindable mount' is defined as vfsmount that does not
- receive or forward any propagation events and cannot
- be bind mounted.
+ A **unbindable mount** is defined as vfsmount that does not
+ receive or forward any propagation events and cannot
+ be bind mounted.
- State diagram:
+ State diagram:
- The state diagram below explains the state transition of a mount,
- in response to various commands::
+ The state diagram below explains the state transition of a mount,
+ in response to various commands::
- -----------------------------------------------------------------------
- | |make-shared | make-slave | make-private |make-unbindab|
- --------------|------------|--------------|--------------|-------------|
- |shared |shared |*slave/private| private | unbindable |
- | | | | | |
- |-------------|------------|--------------|--------------|-------------|
- |slave |shared | **slave | private | unbindable |
- | |and slave | | | |
- |-------------|------------|--------------|--------------|-------------|
- |shared |shared | slave | private | unbindable |
- |and slave |and slave | | | |
- |-------------|------------|--------------|--------------|-------------|
- |private |shared | **private | private | unbindable |
- |-------------|------------|--------------|--------------|-------------|
- |unbindable |shared |**unbindable | private | unbindable |
- ------------------------------------------------------------------------
+ -----------------------------------------------------------------------
+ | |make-shared | make-slave | make-private |make-unbindab|
+ --------------|------------|--------------|--------------|-------------|
+ |shared |shared |*slave/private| private | unbindable |
+ | | | | | |
+ |-------------|------------|--------------|--------------|-------------|
+ |slave |shared | **slave | private | unbindable |
+ | |and slave | | | |
+ |-------------|------------|--------------|--------------|-------------|
+ |shared |shared | slave | private | unbindable |
+ |and slave |and slave | | | |
+ |-------------|------------|--------------|--------------|-------------|
+ |private |shared | **private | private | unbindable |
+ |-------------|------------|--------------|--------------|-------------|
+ |unbindable |shared |**unbindable | private | unbindable |
+ ------------------------------------------------------------------------
- * if the shared mount is the only mount in its peer group, making it
- slave, makes it private automatically. Note that there is no master to
- which it can be slaved to.
+ * if the shared mount is the only mount in its peer group, making it
+ slave, makes it private automatically. Note that there is no master to
+ which it can be slaved to.
- ** slaving a non-shared mount has no effect on the mount.
+ ** slaving a non-shared mount has no effect on the mount.
- Apart from the commands listed below, the 'move' operation also changes
- the state of a mount depending on type of the destination mount. Its
- explained in section 5d.
+ Apart from the commands listed below, the 'move' operation also changes
+ the state of a mount depending on type of the destination mount. Its
+ explained in section 5d.
-5b) Bind semantics
+b) Bind semantics
- Consider the following command::
+ Consider the following command::
- mount --bind A/a B/b
+ mount --bind A/a B/b
- where 'A' is the source mount, 'a' is the dentry in the mount 'A', 'B'
- is the destination mount and 'b' is the dentry in the destination mount.
+ where 'A' is the source mount, 'a' is the dentry in the mount 'A', 'B'
+ is the destination mount and 'b' is the dentry in the destination mount.
- The outcome depends on the type of mount of 'A' and 'B'. The table
- below contains quick reference::
+ The outcome depends on the type of mount of 'A' and 'B'. The table
+ below contains quick reference::
- --------------------------------------------------------------------------
- | BIND MOUNT OPERATION |
- |************************************************************************|
- |source(A)->| shared | private | slave | unbindable |
- | dest(B) | | | | |
- | | | | | | |
- | v | | | | |
- |************************************************************************|
- | shared | shared | shared | shared & slave | invalid |
- | | | | | |
- |non-shared| shared | private | slave | invalid |
- **************************************************************************
+ --------------------------------------------------------------------------
+ | BIND MOUNT OPERATION |
+ |************************************************************************|
+ |source(A)->| shared | private | slave | unbindable |
+ | dest(B) | | | | |
+ | | | | | | |
+ | v | | | | |
+ |************************************************************************|
+ | shared | shared | shared | shared & slave | invalid |
+ | | | | | |
+ |non-shared| shared | private | slave | invalid |
+ **************************************************************************
- Details:
+ Details:
- 1. 'A' is a shared mount and 'B' is a shared mount. A new mount 'C'
- which is clone of 'A', is created. Its root dentry is 'a' . 'C' is
- mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
- are created and mounted at the dentry 'b' on all mounts where 'B'
- propagates to. A new propagation tree containing 'C1',..,'Cn' is
- created. This propagation tree is identical to the propagation tree of
- 'B'. And finally the peer-group of 'C' is merged with the peer group
- of 'A'.
+ 1. 'A' is a shared mount and 'B' is a shared mount. A new mount 'C'
+ which is clone of 'A', is created. Its root dentry is 'a' . 'C' is
+ mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
+ are created and mounted at the dentry 'b' on all mounts where 'B'
+ propagates to. A new propagation tree containing 'C1',..,'Cn' is
+ created. This propagation tree is identical to the propagation tree of
+ 'B'. And finally the peer-group of 'C' is merged with the peer group
+ of 'A'.
- 2. 'A' is a private mount and 'B' is a shared mount. A new mount 'C'
- which is clone of 'A', is created. Its root dentry is 'a'. 'C' is
- mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
- are created and mounted at the dentry 'b' on all mounts where 'B'
- propagates to. A new propagation tree is set containing all new mounts
- 'C', 'C1', .., 'Cn' with exactly the same configuration as the
- propagation tree for 'B'.
+ 2. 'A' is a private mount and 'B' is a shared mount. A new mount 'C'
+ which is clone of 'A', is created. Its root dentry is 'a'. 'C' is
+ mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
+ are created and mounted at the dentry 'b' on all mounts where 'B'
+ propagates to. A new propagation tree is set containing all new mounts
+ 'C', 'C1', .., 'Cn' with exactly the same configuration as the
+ propagation tree for 'B'.
- 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. A new
- mount 'C' which is clone of 'A', is created. Its root dentry is 'a' .
- 'C' is mounted on mount 'B' at dentry 'b'. Also new mounts 'C1', 'C2',
- 'C3' ... are created and mounted at the dentry 'b' on all mounts where
- 'B' propagates to. A new propagation tree containing the new mounts
- 'C','C1',.. 'Cn' is created. This propagation tree is identical to the
- propagation tree for 'B'. And finally the mount 'C' and its peer group
- is made the slave of mount 'Z'. In other words, mount 'C' is in the
- state 'slave and shared'.
-
- 4. 'A' is a unbindable mount and 'B' is a shared mount. This is a
- invalid operation.
-
- 5. 'A' is a private mount and 'B' is a non-shared(private or slave or
- unbindable) mount. A new mount 'C' which is clone of 'A', is created.
- Its root dentry is 'a'. 'C' is mounted on mount 'B' at dentry 'b'.
-
- 6. 'A' is a shared mount and 'B' is a non-shared mount. A new mount 'C'
- which is a clone of 'A' is created. Its root dentry is 'a'. 'C' is
- mounted on mount 'B' at dentry 'b'. 'C' is made a member of the
- peer-group of 'A'.
-
- 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. A
- new mount 'C' which is a clone of 'A' is created. Its root dentry is
- 'a'. 'C' is mounted on mount 'B' at dentry 'b'. Also 'C' is set as a
- slave mount of 'Z'. In other words 'A' and 'C' are both slave mounts of
- 'Z'. All mount/unmount events on 'Z' propagates to 'A' and 'C'. But
- mount/unmount on 'A' do not propagate anywhere else. Similarly
- mount/unmount on 'C' do not propagate anywhere else.
-
- 8. 'A' is a unbindable mount and 'B' is a non-shared mount. This is a
- invalid operation. A unbindable mount cannot be bind mounted.
-
-5c) Rbind semantics
-
- rbind is same as bind. Bind replicates the specified mount. Rbind
- replicates all the mounts in the tree belonging to the specified mount.
- Rbind mount is bind mount applied to all the mounts in the tree.
-
- If the source tree that is rbind has some unbindable mounts,
- then the subtree under the unbindable mount is pruned in the new
- location.
-
- eg:
-
- let's say we have the following mount tree::
-
- A
- / \
- B C
- / \ / \
- D E F G
-
- Let's say all the mount except the mount C in the tree are
- of a type other than unbindable.
-
- If this tree is rbound to say Z
-
- We will have the following tree at the new location::
-
- Z
- |
- A'
- /
- B' Note how the tree under C is pruned
- / \ in the new location.
- D' E'
-
-
-
-5d) Move semantics
-
- Consider the following command
-
- mount --move A B/b
+ 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. A new
+ mount 'C' which is clone of 'A', is created. Its root dentry is 'a' .
+ 'C' is mounted on mount 'B' at dentry 'b'. Also new mounts 'C1', 'C2',
+ 'C3' ... are created and mounted at the dentry 'b' on all mounts where
+ 'B' propagates to. A new propagation tree containing the new mounts
+ 'C','C1',.. 'Cn' is created. This propagation tree is identical to the
+ propagation tree for 'B'. And finally the mount 'C' and its peer group
+ is made the slave of mount 'Z'. In other words, mount 'C' is in the
+ state 'slave and shared'.
+
+ 4. 'A' is a unbindable mount and 'B' is a shared mount. This is a
+ invalid operation.
+
+ 5. 'A' is a private mount and 'B' is a non-shared(private or slave or
+ unbindable) mount. A new mount 'C' which is clone of 'A', is created.
+ Its root dentry is 'a'. 'C' is mounted on mount 'B' at dentry 'b'.
+
+ 6. 'A' is a shared mount and 'B' is a non-shared mount. A new mount 'C'
+ which is a clone of 'A' is created. Its root dentry is 'a'. 'C' is
+ mounted on mount 'B' at dentry 'b'. 'C' is made a member of the
+ peer-group of 'A'.
+
+ 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. A
+ new mount 'C' which is a clone of 'A' is created. Its root dentry is
+ 'a'. 'C' is mounted on mount 'B' at dentry 'b'. Also 'C' is set as a
+ slave mount of 'Z'. In other words 'A' and 'C' are both slave mounts of
+ 'Z'. All mount/unmount events on 'Z' propagates to 'A' and 'C'. But
+ mount/unmount on 'A' do not propagate anywhere else. Similarly
+ mount/unmount on 'C' do not propagate anywhere else.
+
+ 8. 'A' is a unbindable mount and 'B' is a non-shared mount. This is a
+ invalid operation. A unbindable mount cannot be bind mounted.
+
+c) Rbind semantics
+
+ rbind is same as bind. Bind replicates the specified mount. Rbind
+ replicates all the mounts in the tree belonging to the specified mount.
+ Rbind mount is bind mount applied to all the mounts in the tree.
+
+ If the source tree that is rbind has some unbindable mounts,
+ then the subtree under the unbindable mount is pruned in the new
+ location.
+
+ eg:
+
+ let's say we have the following mount tree::
+
+ A
+ / \
+ B C
+ / \ / \
+ D E F G
+
+ Let's say all the mount except the mount C in the tree are
+ of a type other than unbindable.
+
+ If this tree is rbound to say Z
+
+ We will have the following tree at the new location::
+
+ Z
+ |
+ A'
+ /
+ B' Note how the tree under C is pruned
+ / \ in the new location.
+ D' E'
+
+
+
+d) Move semantics
+
+ Consider the following command::
+
+ mount --move A B/b
- where 'A' is the source mount, 'B' is the destination mount and 'b' is
- the dentry in the destination mount.
+ where 'A' is the source mount, 'B' is the destination mount and 'b' is
+ the dentry in the destination mount.
- The outcome depends on the type of the mount of 'A' and 'B'. The table
- below is a quick reference::
+ The outcome depends on the type of the mount of 'A' and 'B'. The table
+ below is a quick reference::
- ---------------------------------------------------------------------------
- | MOVE MOUNT OPERATION |
- |**************************************************************************
- | source(A)->| shared | private | slave | unbindable |
- | dest(B) | | | | |
- | | | | | | |
- | v | | | | |
- |**************************************************************************
- | shared | shared | shared |shared and slave| invalid |
- | | | | | |
- |non-shared| shared | private | slave | unbindable |
- ***************************************************************************
+ ---------------------------------------------------------------------------
+ | MOVE MOUNT OPERATION |
+ |**************************************************************************
+ | source(A)->| shared | private | slave | unbindable |
+ | dest(B) | | | | |
+ | | | | | | |
+ | v | | | | |
+ |**************************************************************************
+ | shared | shared | shared |shared and slave| invalid |
+ | | | | | |
+ |non-shared| shared | private | slave | unbindable |
+ ***************************************************************************
- .. Note:: moving a mount residing under a shared mount is invalid.
+ .. Note:: moving a mount residing under a shared mount is invalid.
- Details follow:
+ Details follow:
- 1. 'A' is a shared mount and 'B' is a shared mount. The mount 'A' is
- mounted on mount 'B' at dentry 'b'. Also new mounts 'A1', 'A2'...'An'
- are created and mounted at dentry 'b' on all mounts that receive
- propagation from mount 'B'. A new propagation tree is created in the
- exact same configuration as that of 'B'. This new propagation tree
- contains all the new mounts 'A1', 'A2'... 'An'. And this new
- propagation tree is appended to the already existing propagation tree
- of 'A'.
+ 1. 'A' is a shared mount and 'B' is a shared mount. The mount 'A' is
+ mounted on mount 'B' at dentry 'b'. Also new mounts 'A1', 'A2'...'An'
+ are created and mounted at dentry 'b' on all mounts that receive
+ propagation from mount 'B'. A new propagation tree is created in the
+ exact same configuration as that of 'B'. This new propagation tree
+ contains all the new mounts 'A1', 'A2'... 'An'. And this new
+ propagation tree is appended to the already existing propagation tree
+ of 'A'.
- 2. 'A' is a private mount and 'B' is a shared mount. The mount 'A' is
- mounted on mount 'B' at dentry 'b'. Also new mount 'A1', 'A2'... 'An'
- are created and mounted at dentry 'b' on all mounts that receive
- propagation from mount 'B'. The mount 'A' becomes a shared mount and a
- propagation tree is created which is identical to that of
- 'B'. This new propagation tree contains all the new mounts 'A1',
- 'A2'... 'An'.
+ 2. 'A' is a private mount and 'B' is a shared mount. The mount 'A' is
+ mounted on mount 'B' at dentry 'b'. Also new mount 'A1', 'A2'... 'An'
+ are created and mounted at dentry 'b' on all mounts that receive
+ propagation from mount 'B'. The mount 'A' becomes a shared mount and a
+ propagation tree is created which is identical to that of
+ 'B'. This new propagation tree contains all the new mounts 'A1',
+ 'A2'... 'An'.
- 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. The
- mount 'A' is mounted on mount 'B' at dentry 'b'. Also new mounts 'A1',
- 'A2'... 'An' are created and mounted at dentry 'b' on all mounts that
- receive propagation from mount 'B'. A new propagation tree is created
- in the exact same configuration as that of 'B'. This new propagation
- tree contains all the new mounts 'A1', 'A2'... 'An'. And this new
- propagation tree is appended to the already existing propagation tree of
- 'A'. Mount 'A' continues to be the slave mount of 'Z' but it also
- becomes 'shared'.
+ 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. The
+ mount 'A' is mounted on mount 'B' at dentry 'b'. Also new mounts 'A1',
+ 'A2'... 'An' are created and mounted at dentry 'b' on all mounts that
+ receive propagation from mount 'B'. A new propagation tree is created
+ in the exact same configuration as that of 'B'. This new propagation
+ tree contains all the new mounts 'A1', 'A2'... 'An'. And this new
+ propagation tree is appended to the already existing propagation tree of
+ 'A'. Mount 'A' continues to be the slave mount of 'Z' but it also
+ becomes 'shared'.
- 4. 'A' is a unbindable mount and 'B' is a shared mount. The operation
- is invalid. Because mounting anything on the shared mount 'B' can
- create new mounts that get mounted on the mounts that receive
- propagation from 'B'. And since the mount 'A' is unbindable, cloning
- it to mount at other mountpoints is not possible.
+ 4. 'A' is a unbindable mount and 'B' is a shared mount. The operation
+ is invalid. Because mounting anything on the shared mount 'B' can
+ create new mounts that get mounted on the mounts that receive
+ propagation from 'B'. And since the mount 'A' is unbindable, cloning
+ it to mount at other mountpoints is not possible.
- 5. 'A' is a private mount and 'B' is a non-shared(private or slave or
- unbindable) mount. The mount 'A' is mounted on mount 'B' at dentry 'b'.
+ 5. 'A' is a private mount and 'B' is a non-shared(private or slave or
+ unbindable) mount. The mount 'A' is mounted on mount 'B' at dentry 'b'.
- 6. 'A' is a shared mount and 'B' is a non-shared mount. The mount 'A'
- is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
- shared mount.
+ 6. 'A' is a shared mount and 'B' is a non-shared mount. The mount 'A'
+ is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
+ shared mount.
- 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount.
- The mount 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A'
- continues to be a slave mount of mount 'Z'.
+ 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount.
+ The mount 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A'
+ continues to be a slave mount of mount 'Z'.
- 8. 'A' is a unbindable mount and 'B' is a non-shared mount. The mount
- 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
- unbindable mount.
+ 8. 'A' is a unbindable mount and 'B' is a non-shared mount. The mount
+ 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
+ unbindable mount.
-5e) Mount semantics
+e) Mount semantics
- Consider the following command::
+ Consider the following command::
- mount device B/b
+ mount device B/b
- 'B' is the destination mount and 'b' is the dentry in the destination
- mount.
+ 'B' is the destination mount and 'b' is the dentry in the destination
+ mount.
- The above operation is the same as bind operation with the exception
- that the source mount is always a private mount.
+ The above operation is the same as bind operation with the exception
+ that the source mount is always a private mount.
-5f) Unmount semantics
+f) Unmount semantics
- Consider the following command::
+ Consider the following command::
- umount A
+ umount A
- where 'A' is a mount mounted on mount 'B' at dentry 'b'.
+ where 'A' is a mount mounted on mount 'B' at dentry 'b'.
- If mount 'B' is shared, then all most-recently-mounted mounts at dentry
- 'b' on mounts that receive propagation from mount 'B' and does not have
- sub-mounts within them are unmounted.
+ If mount 'B' is shared, then all most-recently-mounted mounts at dentry
+ 'b' on mounts that receive propagation from mount 'B' and does not have
+ sub-mounts within them are unmounted.
- Example: Let's say 'B1', 'B2', 'B3' are shared mounts that propagate to
- each other.
+ Example: Let's say 'B1', 'B2', 'B3' are shared mounts that propagate to
+ each other.
- let's say 'A1', 'A2', 'A3' are first mounted at dentry 'b' on mount
- 'B1', 'B2' and 'B3' respectively.
+ let's say 'A1', 'A2', 'A3' are first mounted at dentry 'b' on mount
+ 'B1', 'B2' and 'B3' respectively.
- let's say 'C1', 'C2', 'C3' are next mounted at the same dentry 'b' on
- mount 'B1', 'B2' and 'B3' respectively.
+ let's say 'C1', 'C2', 'C3' are next mounted at the same dentry 'b' on
+ mount 'B1', 'B2' and 'B3' respectively.
- if 'C1' is unmounted, all the mounts that are most-recently-mounted on
- 'B1' and on the mounts that 'B1' propagates-to are unmounted.
+ if 'C1' is unmounted, all the mounts that are most-recently-mounted on
+ 'B1' and on the mounts that 'B1' propagates-to are unmounted.
- 'B1' propagates to 'B2' and 'B3'. And the most recently mounted mount
- on 'B2' at dentry 'b' is 'C2', and that of mount 'B3' is 'C3'.
+ 'B1' propagates to 'B2' and 'B3'. And the most recently mounted mount
+ on 'B2' at dentry 'b' is 'C2', and that of mount 'B3' is 'C3'.
- So all 'C1', 'C2' and 'C3' should be unmounted.
+ So all 'C1', 'C2' and 'C3' should be unmounted.
- If any of 'C2' or 'C3' has some child mounts, then that mount is not
- unmounted, but all other mounts are unmounted. However if 'C1' is told
- to be unmounted and 'C1' has some sub-mounts, the umount operation is
- failed entirely.
+ If any of 'C2' or 'C3' has some child mounts, then that mount is not
+ unmounted, but all other mounts are unmounted. However if 'C1' is told
+ to be unmounted and 'C1' has some sub-mounts, the umount operation is
+ failed entirely.
-5g) Clone Namespace
+g) Clone Namespace
- A cloned namespace contains all the mounts as that of the parent
- namespace.
+ A cloned namespace contains all the mounts as that of the parent
+ namespace.
- Let's say 'A' and 'B' are the corresponding mounts in the parent and the
- child namespace.
+ Let's say 'A' and 'B' are the corresponding mounts in the parent and the
+ child namespace.
- If 'A' is shared, then 'B' is also shared and 'A' and 'B' propagate to
- each other.
+ If 'A' is shared, then 'B' is also shared and 'A' and 'B' propagate to
+ each other.
- If 'A' is a slave mount of 'Z', then 'B' is also the slave mount of
- 'Z'.
+ If 'A' is a slave mount of 'Z', then 'B' is also the slave mount of
+ 'Z'.
- If 'A' is a private mount, then 'B' is a private mount too.
+ If 'A' is a private mount, then 'B' is a private mount too.
- If 'A' is unbindable mount, then 'B' is a unbindable mount too.
+ If 'A' is unbindable mount, then 'B' is a unbindable mount too.
6) Quiz
-------
- A. What is the result of the following command sequence?
+A. What is the result of the following command sequence?
- ::
+ ::
- mount --bind /mnt /mnt
- mount --make-shared /mnt
- mount --bind /mnt /tmp
- mount --move /tmp /mnt/1
+ mount --bind /mnt /mnt
+ mount --make-shared /mnt
+ mount --bind /mnt /tmp
+ mount --move /tmp /mnt/1
- what should be the contents of /mnt /mnt/1 /mnt/1/1 should be?
- Should they all be identical? or should /mnt and /mnt/1 be
- identical only?
+ what should be the contents of /mnt /mnt/1 /mnt/1/1 should be?
+ Should they all be identical? or should /mnt and /mnt/1 be
+ identical only?
- B. What is the result of the following command sequence?
+B. What is the result of the following command sequence?
- ::
+ ::
- mount --make-rshared /
- mkdir -p /v/1
- mount --rbind / /v/1
+ mount --make-rshared /
+ mkdir -p /v/1
+ mount --rbind / /v/1
- what should be the content of /v/1/v/1 be?
+ what should be the content of /v/1/v/1 be?
- C. What is the result of the following command sequence?
+C. What is the result of the following command sequence?
- ::
+ ::
- mount --bind /mnt /mnt
- mount --make-shared /mnt
- mkdir -p /mnt/1/2/3 /mnt/1/test
- mount --bind /mnt/1 /tmp
- mount --make-slave /mnt
- mount --make-shared /mnt
- mount --bind /mnt/1/2 /tmp1
- mount --make-slave /mnt
+ mount --bind /mnt /mnt
+ mount --make-shared /mnt
+ mkdir -p /mnt/1/2/3 /mnt/1/test
+ mount --bind /mnt/1 /tmp
+ mount --make-slave /mnt
+ mount --make-shared /mnt
+ mount --bind /mnt/1/2 /tmp1
+ mount --make-slave /mnt
- At this point we have the first mount at /tmp and
- its root dentry is 1. Let's call this mount 'A'
- And then we have a second mount at /tmp1 with root
- dentry 2. Let's call this mount 'B'
- Next we have a third mount at /mnt with root dentry
- mnt. Let's call this mount 'C'
+ At this point we have the first mount at /tmp and
+ its root dentry is 1. Let's call this mount 'A'
+ And then we have a second mount at /tmp1 with root
+ dentry 2. Let's call this mount 'B'
+ Next we have a third mount at /mnt with root dentry
+ mnt. Let's call this mount 'C'
- 'B' is the slave of 'A' and 'C' is a slave of 'B'
- A -> B -> C
+ 'B' is the slave of 'A' and 'C' is a slave of 'B'
+ A -> B -> C
- at this point if we execute the following command
+ at this point if we execute the following command::
- mount --bind /bin /tmp/test
+ mount --bind /bin /tmp/test
- The mount is attempted on 'A'
+ The mount is attempted on 'A'
- will the mount propagate to 'B' and 'C' ?
+ will the mount propagate to 'B' and 'C' ?
- what would be the contents of
- /mnt/1/test be?
+ what would be the contents of
+ /mnt/1/test be?
7) FAQ
------
- Q1. Why is bind mount needed? How is it different from symbolic links?
- symbolic links can get stale if the destination mount gets
- unmounted or moved. Bind mounts continue to exist even if the
- other mount is unmounted or moved.
+1. Why is bind mount needed? How is it different from symbolic links?
- Q2. Why can't the shared subtree be implemented using exportfs?
+ symbolic links can get stale if the destination mount gets
+ unmounted or moved. Bind mounts continue to exist even if the
+ other mount is unmounted or moved.
- exportfs is a heavyweight way of accomplishing part of what
- shared subtree can do. I cannot imagine a way to implement the
- semantics of slave mount using exportfs?
+2. Why can't the shared subtree be implemented using exportfs?
- Q3 Why is unbindable mount needed?
+ exportfs is a heavyweight way of accomplishing part of what
+ shared subtree can do. I cannot imagine a way to implement the
+ semantics of slave mount using exportfs?
- Let's say we want to replicate the mount tree at multiple
- locations within the same subtree.
+3. Why is unbindable mount needed?
- if one rbind mounts a tree within the same subtree 'n' times
- the number of mounts created is an exponential function of 'n'.
- Having unbindable mount can help prune the unneeded bind
- mounts. Here is an example.
+ Let's say we want to replicate the mount tree at multiple
+ locations within the same subtree.
- step 1:
- let's say the root tree has just two directories with
- one vfsmount::
+ if one rbind mounts a tree within the same subtree 'n' times
+ the number of mounts created is an exponential function of 'n'.
+ Having unbindable mount can help prune the unneeded bind
+ mounts. Here is an example.
- root
- / \
- tmp usr
+ step 1:
+ let's say the root tree has just two directories with
+ one vfsmount::
- And we want to replicate the tree at multiple
- mountpoints under /root/tmp
+ root
+ / \
+ tmp usr
- step 2:
- ::
+ And we want to replicate the tree at multiple
+ mountpoints under /root/tmp
+ step 2:
+ ::
- mount --make-shared /root
- mkdir -p /tmp/m1
+ mount --make-shared /root
- mount --rbind /root /tmp/m1
+ mkdir -p /tmp/m1
- the new tree now looks like this::
+ mount --rbind /root /tmp/m1
- root
- / \
- tmp usr
- /
- m1
- / \
- tmp usr
- /
- m1
+ the new tree now looks like this::
- it has two vfsmounts
+ root
+ / \
+ tmp usr
+ /
+ m1
+ / \
+ tmp usr
+ /
+ m1
- step 3:
- ::
+ it has two vfsmounts
- mkdir -p /tmp/m2
- mount --rbind /root /tmp/m2
+ step 3:
+ ::
- the new tree now looks like this::
+ mkdir -p /tmp/m2
+ mount --rbind /root /tmp/m2
- root
- / \
- tmp usr
- / \
- m1 m2
- / \ / \
- tmp usr tmp usr
- / \ /
- m1 m2 m1
- / \ / \
- tmp usr tmp usr
- / / \
- m1 m1 m2
- / \
- tmp usr
- / \
- m1 m2
+ the new tree now looks like this::
- it has 6 vfsmounts
+ root
+ / \
+ tmp usr
+ / \
+ m1 m2
+ / \ / \
+ tmp usr tmp usr
+ / \ /
+ m1 m2 m1
+ / \ / \
+ tmp usr tmp usr
+ / / \
+ m1 m1 m2
+ / \
+ tmp usr
+ / \
+ m1 m2
- step 4:
- ::
- mkdir -p /tmp/m3
- mount --rbind /root /tmp/m3
+ it has 6 vfsmounts
- I won't draw the tree..but it has 24 vfsmounts
+ step 4:
+ ::
+ mkdir -p /tmp/m3
+ mount --rbind /root /tmp/m3
- at step i the number of vfsmounts is V[i] = i*V[i-1].
- This is an exponential function. And this tree has way more
- mounts than what we really needed in the first place.
+ I won't draw the tree..but it has 24 vfsmounts
- One could use a series of umount at each step to prune
- out the unneeded mounts. But there is a better solution.
- Unclonable mounts come in handy here.
- step 1:
- let's say the root tree has just two directories with
- one vfsmount::
+ at step i the number of vfsmounts is V[i] = i*V[i-1].
+ This is an exponential function. And this tree has way more
+ mounts than what we really needed in the first place.
- root
- / \
- tmp usr
+ One could use a series of umount at each step to prune
+ out the unneeded mounts. But there is a better solution.
+ Unclonable mounts come in handy here.
- How do we set up the same tree at multiple locations under
- /root/tmp
+ step 1:
+ let's say the root tree has just two directories with
+ one vfsmount::
- step 2:
- ::
+ root
+ / \
+ tmp usr
+ How do we set up the same tree at multiple locations under
+ /root/tmp
- mount --bind /root/tmp /root/tmp
+ step 2:
+ ::
- mount --make-rshared /root
- mount --make-unbindable /root/tmp
- mkdir -p /tmp/m1
+ mount --bind /root/tmp /root/tmp
- mount --rbind /root /tmp/m1
+ mount --make-rshared /root
+ mount --make-unbindable /root/tmp
- the new tree now looks like this::
+ mkdir -p /tmp/m1
- root
- / \
- tmp usr
- /
- m1
- / \
- tmp usr
+ mount --rbind /root /tmp/m1
- step 3:
- ::
+ the new tree now looks like this::
- mkdir -p /tmp/m2
- mount --rbind /root /tmp/m2
+ root
+ / \
+ tmp usr
+ /
+ m1
+ / \
+ tmp usr
- the new tree now looks like this::
+ step 3:
+ ::
- root
- / \
- tmp usr
- / \
- m1 m2
- / \ / \
- tmp usr tmp usr
+ mkdir -p /tmp/m2
+ mount --rbind /root /tmp/m2
- step 4:
- ::
+ the new tree now looks like this::
- mkdir -p /tmp/m3
- mount --rbind /root /tmp/m3
+ root
+ / \
+ tmp usr
+ / \
+ m1 m2
+ / \ / \
+ tmp usr tmp usr
- the new tree now looks like this::
+ step 4:
+ ::
- root
- / \
- tmp usr
- / \ \
- m1 m2 m3
- / \ / \ / \
- tmp usr tmp usr tmp usr
+ mkdir -p /tmp/m3
+ mount --rbind /root /tmp/m3
+
+ the new tree now looks like this::
+
+ root
+ / \
+ tmp usr
+ / \ \
+ m1 m2 m3
+ / \ / \ / \
+ tmp usr tmp usr tmp usr
8) Implementation
-----------------
-8A) Datastructure
+A) Datastructure
+
+ Several new fields are introduced to struct vfsmount:
+
+ ->mnt_share
+ Links together all the mount to/from which this vfsmount
+ send/receives propagation events.
- 4 new fields are introduced to struct vfsmount:
+ ->mnt_slave_list
+ Links all the mounts to which this vfsmount propagates
+ to.
- * ->mnt_share
- * ->mnt_slave_list
- * ->mnt_slave
- * ->mnt_master
+ ->mnt_slave
+ Links together all the slaves that its master vfsmount
+ propagates to.
- ->mnt_share
- links together all the mount to/from which this vfsmount
- send/receives propagation events.
+ ->mnt_master
+ Points to the master vfsmount from which this vfsmount
+ receives propagation.
- ->mnt_slave_list
- links all the mounts to which this vfsmount propagates
- to.
+ ->mnt_flags
+ Takes two more flags to indicate the propagation status of
+ the vfsmount. MNT_SHARE indicates that the vfsmount is a shared
+ vfsmount. MNT_UNCLONABLE indicates that the vfsmount cannot be
+ replicated.
- ->mnt_slave
- links together all the slaves that its master vfsmount
- propagates to.
+ All the shared vfsmounts in a peer group form a cyclic list through
+ ->mnt_share.
- ->mnt_master
- points to the master vfsmount from which this vfsmount
- receives propagation.
+ All vfsmounts with the same ->mnt_master form on a cyclic list anchored
+ in ->mnt_master->mnt_slave_list and going through ->mnt_slave.
- ->mnt_flags
- takes two more flags to indicate the propagation status of
- the vfsmount. MNT_SHARE indicates that the vfsmount is a shared
- vfsmount. MNT_UNCLONABLE indicates that the vfsmount cannot be
- replicated.
+ ->mnt_master can point to arbitrary (and possibly different) members
+ of master peer group. To find all immediate slaves of a peer group
+ you need to go through _all_ ->mnt_slave_list of its members.
+ Conceptually it's just a single set - distribution among the
+ individual lists does not affect propagation or the way propagation
+ tree is modified by operations.
- All the shared vfsmounts in a peer group form a cyclic list through
- ->mnt_share.
+ All vfsmounts in a peer group have the same ->mnt_master. If it is
+ non-NULL, they form a contiguous (ordered) segment of slave list.
- All vfsmounts with the same ->mnt_master form on a cyclic list anchored
- in ->mnt_master->mnt_slave_list and going through ->mnt_slave.
+ A example propagation tree looks as shown in the figure below.
- ->mnt_master can point to arbitrary (and possibly different) members
- of master peer group. To find all immediate slaves of a peer group
- you need to go through _all_ ->mnt_slave_list of its members.
- Conceptually it's just a single set - distribution among the
- individual lists does not affect propagation or the way propagation
- tree is modified by operations.
+ .. note::
+ Though it looks like a forest, if we consider all the shared
+ mounts as a conceptual entity called 'pnode', it becomes a tree.
- All vfsmounts in a peer group have the same ->mnt_master. If it is
- non-NULL, they form a contiguous (ordered) segment of slave list.
+ ::
- A example propagation tree looks as shown in the figure below.
- [ NOTE: Though it looks like a forest, if we consider all the shared
- mounts as a conceptual entity called 'pnode', it becomes a tree]::
+ A <--> B <--> C <---> D
+ /|\ /| |\
+ / F G J K H I
+ /
+ E<-->K
+ /|\
+ M L N
- A <--> B <--> C <---> D
- /|\ /| |\
- / F G J K H I
- /
- E<-->K
- /|\
- M L N
+ In the above figure A,B,C and D all are shared and propagate to each
+ other. 'A' has got 3 slave mounts 'E' 'F' and 'G' 'C' has got 2 slave
+ mounts 'J' and 'K' and 'D' has got two slave mounts 'H' and 'I'.
+ 'E' is also shared with 'K' and they propagate to each other. And
+ 'K' has 3 slaves 'M', 'L' and 'N'
- In the above figure A,B,C and D all are shared and propagate to each
- other. 'A' has got 3 slave mounts 'E' 'F' and 'G' 'C' has got 2 slave
- mounts 'J' and 'K' and 'D' has got two slave mounts 'H' and 'I'.
- 'E' is also shared with 'K' and they propagate to each other. And
- 'K' has 3 slaves 'M', 'L' and 'N'
+ A's ->mnt_share links with the ->mnt_share of 'B' 'C' and 'D'
- A's ->mnt_share links with the ->mnt_share of 'B' 'C' and 'D'
+ A's ->mnt_slave_list links with ->mnt_slave of 'E', 'K', 'F' and 'G'
- A's ->mnt_slave_list links with ->mnt_slave of 'E', 'K', 'F' and 'G'
+ E's ->mnt_share links with ->mnt_share of K
- E's ->mnt_share links with ->mnt_share of K
+ 'E', 'K', 'F', 'G' have their ->mnt_master point to struct vfsmount of 'A'
- 'E', 'K', 'F', 'G' have their ->mnt_master point to struct vfsmount of 'A'
+ 'M', 'L', 'N' have their ->mnt_master point to struct vfsmount of 'K'
- 'M', 'L', 'N' have their ->mnt_master point to struct vfsmount of 'K'
+ K's ->mnt_slave_list links with ->mnt_slave of 'M', 'L' and 'N'
- K's ->mnt_slave_list links with ->mnt_slave of 'M', 'L' and 'N'
+ C's ->mnt_slave_list links with ->mnt_slave of 'J' and 'K'
- C's ->mnt_slave_list links with ->mnt_slave of 'J' and 'K'
+ J and K's ->mnt_master points to struct vfsmount of C
- J and K's ->mnt_master points to struct vfsmount of C
+ and finally D's ->mnt_slave_list links with ->mnt_slave of 'H' and 'I'
- and finally D's ->mnt_slave_list links with ->mnt_slave of 'H' and 'I'
+ 'H' and 'I' have their ->mnt_master pointing to struct vfsmount of 'D'.
- 'H' and 'I' have their ->mnt_master pointing to struct vfsmount of 'D'.
+ NOTE: The propagation tree is orthogonal to the mount tree.
- NOTE: The propagation tree is orthogonal to the mount tree.
+B) Locking:
-8B Locking:
+ ->mnt_share, ->mnt_slave, ->mnt_slave_list, ->mnt_master are protected
+ by namespace_sem (exclusive for modifications, shared for reading).
- ->mnt_share, ->mnt_slave, ->mnt_slave_list, ->mnt_master are protected
- by namespace_sem (exclusive for modifications, shared for reading).
+ Normally we have ->mnt_flags modifications serialized by vfsmount_lock.
+ There are two exceptions: do_add_mount() and clone_mnt().
+ The former modifies a vfsmount that has not been visible in any shared
+ data structures yet.
+ The latter holds namespace_sem and the only references to vfsmount
+ are in lists that can't be traversed without namespace_sem.
- Normally we have ->mnt_flags modifications serialized by vfsmount_lock.
- There are two exceptions: do_add_mount() and clone_mnt().
- The former modifies a vfsmount that has not been visible in any shared
- data structures yet.
- The latter holds namespace_sem and the only references to vfsmount
- are in lists that can't be traversed without namespace_sem.
+C) Algorithm:
-8C Algorithm:
+ The crux of the implementation resides in rbind/move operation.
- The crux of the implementation resides in rbind/move operation.
+ The overall algorithm breaks the operation into 3 phases: (look at
+ attach_recursive_mnt() and propagate_mnt())
- The overall algorithm breaks the operation into 3 phases: (look at
- attach_recursive_mnt() and propagate_mnt())
+ 1. Prepare phase.
- 1. prepare phase.
- 2. commit phases.
- 3. abort phases.
+ For each mount in the source tree:
- Prepare phase:
+ a) Create the necessary number of mount trees to
+ be attached to each of the mounts that receive
+ propagation from the destination mount.
+ b) Do not attach any of the trees to its destination.
+ However note down its ->mnt_parent and ->mnt_mountpoint
+ c) Link all the new mounts to form a propagation tree that
+ is identical to the propagation tree of the destination
+ mount.
- for each mount in the source tree:
+ If this phase is successful, there should be 'n' new
+ propagation trees; where 'n' is the number of mounts in the
+ source tree. Go to the commit phase
- a) Create the necessary number of mount trees to
- be attached to each of the mounts that receive
- propagation from the destination mount.
- b) Do not attach any of the trees to its destination.
- However note down its ->mnt_parent and ->mnt_mountpoint
- c) Link all the new mounts to form a propagation tree that
- is identical to the propagation tree of the destination
- mount.
+ Also there should be 'm' new mount trees, where 'm' is
+ the number of mounts to which the destination mount
+ propagates to.
- If this phase is successful, there should be 'n' new
- propagation trees; where 'n' is the number of mounts in the
- source tree. Go to the commit phase
+ If any memory allocations fail, go to the abort phase.
- Also there should be 'm' new mount trees, where 'm' is
- the number of mounts to which the destination mount
- propagates to.
+ 2. Commit phase.
- if any memory allocations fail, go to the abort phase.
+ Attach each of the mount trees to their corresponding
+ destination mounts.
- Commit phase
- attach each of the mount trees to their corresponding
- destination mounts.
+ 3. Abort phase.
- Abort phase
- delete all the newly created trees.
+ Delete all the newly created trees.
- .. Note::
- all the propagation related functionality resides in the file pnode.c
+ .. Note::
+ all the propagation related functionality resides in the file pnode.c
------------------------------------------------------------------------
diff --git a/Documentation/filesystems/smb/index.rst b/Documentation/filesystems/smb/index.rst
index 1c8597a679ab..6df23b0e45c8 100644
--- a/Documentation/filesystems/smb/index.rst
+++ b/Documentation/filesystems/smb/index.rst
@@ -8,3 +8,4 @@ CIFS
ksmbd
cifsroot
+ smbdirect
diff --git a/Documentation/filesystems/smb/ksmbd.rst b/Documentation/filesystems/smb/ksmbd.rst
index 7bed96d794fc..67cb68ea6e68 100644
--- a/Documentation/filesystems/smb/ksmbd.rst
+++ b/Documentation/filesystems/smb/ksmbd.rst
@@ -13,7 +13,7 @@ KSMBD architecture
The subset of performance related operations belong in kernelspace and
the other subset which belong to operations which are not really related with
performance in userspace. So, DCE/RPC management that has historically resulted
-into number of buffer overflow issues and dangerous security bugs and user
+into a number of buffer overflow issues and dangerous security bugs and user
account management are implemented in user space as ksmbd.mountd.
File operations that are related with performance (open/read/write/close etc.)
in kernel space (ksmbd). This also allows for easier integration with VFS
@@ -24,8 +24,8 @@ ksmbd (kernel daemon)
When the server daemon is started, It starts up a forker thread
(ksmbd/interface name) at initialization time and open a dedicated port 445
-for listening to SMB requests. Whenever new clients make request, Forker
-thread will accept the client connection and fork a new thread for dedicated
+for listening to SMB requests. Whenever new clients make a request, the Forker
+thread will accept the client connection and fork a new thread for a dedicated
communication channel between the client and the server. It allows for parallel
processing of SMB requests(commands) from clients as well as allowing for new
clients to make new connections. Each instance is named ksmbd/1~n(port number)
@@ -34,12 +34,12 @@ thread can decide to pass through the commands to the user space (ksmbd.mountd),
currently DCE/RPC commands are identified to be handled through the user space.
To further utilize the linux kernel, it has been chosen to process the commands
as workitems and to be executed in the handlers of the ksmbd-io kworker threads.
-It allows for multiplexing of the handlers as the kernel take care of initiating
+It allows for multiplexing of the handlers as the kernel takes care of initiating
extra worker threads if the load is increased and vice versa, if the load is
-decreased it destroys the extra worker threads. So, after connection is
-established with client. Dedicated ksmbd/1..n(port number) takes complete
+decreased it destroys the extra worker threads. So, after the connection is
+established with the client. Dedicated ksmbd/1..n(port number) takes complete
ownership of receiving/parsing of SMB commands. Each received command is worked
-in parallel i.e., There can be multiple clients commands which are worked in
+in parallel i.e., there can be multiple client commands which are worked in
parallel. After receiving each command a separated kernel workitem is prepared
for each command which is further queued to be handled by ksmbd-io kworkers.
So, each SMB workitem is queued to the kworkers. This allows the benefit of load
@@ -49,9 +49,9 @@ performance by handling client commands in parallel.
ksmbd.mountd (user space daemon)
--------------------------------
-ksmbd.mountd is userspace process to, transfer user account and password that
+ksmbd.mountd is a userspace process to, transfer the user account and password that
are registered using ksmbd.adduser (part of utils for user space). Further it
-allows sharing information parameters that parsed from smb.conf to ksmbd in
+allows sharing information parameters that are parsed from smb.conf to ksmbd in
kernel. For the execution part it has a daemon which is continuously running
and connected to the kernel interface using netlink socket, it waits for the
requests (dcerpc and share/user info). It handles RPC calls (at a minimum few
@@ -73,15 +73,14 @@ Auto Negotiation Supported.
Compound Request Supported.
Oplock Cache Mechanism Supported.
SMB2 leases(v1 lease) Supported.
-Directory leases(v2 lease) Planned for future.
+Directory leases(v2 lease) Supported.
Multi-credits Supported.
NTLM/NTLMv2 Supported.
HMAC-SHA256 Signing Supported.
Secure negotiate Supported.
Signing Update Supported.
Pre-authentication integrity Supported.
-SMB3 encryption(CCM, GCM) Supported. (CCM and GCM128 supported, GCM256 in
- progress)
+SMB3 encryption(CCM, GCM) Supported. (CCM/GCM128 and CCM/GCM256 supported)
SMB direct(RDMA) Supported.
SMB3 Multi-channel Partially Supported. Planned to implement
replay/retry mechanisms for future.
@@ -112,6 +111,10 @@ DCE/RPC support Partially Supported. a few calls(NetShareEnumAll,
for Witness protocol e.g.)
ksmbd/nfsd interoperability Planned for future. The features that ksmbd
support are Leases, Notify, ACLs and Share modes.
+SMB3.1.1 Compression Planned for future.
+SMB3.1.1 over QUIC Planned for future.
+Signing/Encryption over RDMA Planned for future.
+SMB3.1.1 GMAC signing support Planned for future.
============================== =================================================
@@ -121,7 +124,7 @@ How to run
1. Download ksmbd-tools(https://github.com/cifsd-team/ksmbd-tools/releases) and
compile them.
- - Refer README(https://github.com/cifsd-team/ksmbd-tools/blob/master/README.md)
+ - Refer to README(https://github.com/cifsd-team/ksmbd-tools/blob/master/README.md)
to know how to use ksmbd.mountd/adduser/addshare/control utils
$ ./autogen.sh
@@ -130,7 +133,7 @@ How to run
2. Create /usr/local/etc/ksmbd/ksmbd.conf file, add SMB share in ksmbd.conf file.
- - Refer ksmbd.conf.example in ksmbd-utils, See ksmbd.conf manpage
+ - Refer to ksmbd.conf.example in ksmbd-utils, See ksmbd.conf manpage
for details to configure shares.
$ man ksmbd.conf
@@ -142,7 +145,7 @@ How to run
$ man ksmbd.adduser
$ sudo ksmbd.adduser -a <Enter USERNAME for SMB share access>
-4. Insert ksmbd.ko module after build your kernel. No need to load module
+4. Insert the ksmbd.ko module after you build your kernel. No need to load the module
if ksmbd is built into the kernel.
- Set ksmbd in menuconfig(e.g. $ make menuconfig)
@@ -172,7 +175,7 @@ Each layer
1. Enable all component prints
# sudo ksmbd.control -d "all"
-2. Enable one of components (smb, auth, vfs, oplock, ipc, conn, rdma)
+2. Enable one of the components (smb, auth, vfs, oplock, ipc, conn, rdma)
# sudo ksmbd.control -d "smb"
3. Show what prints are enabled.
diff --git a/Documentation/filesystems/smb/smbdirect.rst b/Documentation/filesystems/smb/smbdirect.rst
new file mode 100644
index 000000000000..ca6927c0b2c0
--- /dev/null
+++ b/Documentation/filesystems/smb/smbdirect.rst
@@ -0,0 +1,103 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================
+SMB Direct - SMB3 over RDMA
+===========================
+
+This document describes how to set up the Linux SMB client and server to
+use RDMA.
+
+Overview
+========
+The Linux SMB kernel client supports SMB Direct, which is a transport
+scheme for SMB3 that uses RDMA (Remote Direct Memory Access) to provide
+high throughput and low latencies by bypassing the traditional TCP/IP
+stack.
+SMB Direct on the Linux SMB client can be tested against KSMBD - a
+kernel-space SMB server.
+
+Installation
+=============
+- Install an RDMA device. As long as the RDMA device driver is supported
+ by the kernel, it should work. This includes both software emulators (soft
+ RoCE, soft iWARP) and hardware devices (InfiniBand, RoCE, iWARP).
+
+- Install a kernel with SMB Direct support. The first kernel release to
+ support SMB Direct on both the client and server side is 5.15. Therefore,
+ a distribution compatible with kernel 5.15 or later is required.
+
+- Install cifs-utils, which provides the `mount.cifs` command to mount SMB
+ shares.
+
+- Configure the RDMA stack
+
+ Make sure that your kernel configuration has RDMA support enabled. Under
+ Device Drivers -> Infiniband support, update the kernel configuration to
+ enable Infiniband support.
+
+ Enable the appropriate IB HCA support or iWARP adapter support,
+ depending on your hardware.
+
+ If you are using InfiniBand, enable IP-over-InfiniBand support.
+
+ For soft RDMA, enable either the soft iWARP (`RDMA _SIW`) or soft RoCE
+ (`RDMA_RXE`) module. Install the `iproute2` package and use the
+ `rdma link add` command to load the module and create an
+ RDMA interface.
+
+ e.g. if your local ethernet interface is `eth0`, you can use:
+
+ .. code-block:: bash
+
+ sudo rdma link add siw0 type siw netdev eth0
+
+- Enable SMB Direct support for both the server and the client in the kernel
+ configuration.
+
+ Server Setup
+
+ .. code-block:: text
+
+ Network File Systems --->
+ <M> SMB3 server support
+ [*] Support for SMB Direct protocol
+
+ Client Setup
+
+ .. code-block:: text
+
+ Network File Systems --->
+ <M> SMB3 and CIFS support (advanced network filesystem)
+ [*] SMB Direct support
+
+- Build and install the kernel. SMB Direct support will be enabled in the
+ cifs.ko and ksmbd.ko modules.
+
+Setup and Usage
+================
+
+- Set up and start a KSMBD server as described in the `KSMBD documentation
+ <https://www.kernel.org/doc/Documentation/filesystems/smb/ksmbd.rst>`_.
+ Also add the "server multi channel support = yes" parameter to ksmbd.conf.
+
+- On the client, mount the share with `rdma` mount option to use SMB Direct
+ (specify a SMB version 3.0 or higher using `vers`).
+
+ For example:
+
+ .. code-block:: bash
+
+ mount -t cifs //server/share /mnt/point -o vers=3.1.1,rdma
+
+- To verify that the mount is using SMB Direct, you can check dmesg for the
+ following log line after mounting:
+
+ .. code-block:: text
+
+ CIFS: VFS: RDMA transport established
+
+ Or, verify `rdma` mount option for the share in `/proc/mounts`:
+
+ .. code-block:: bash
+
+ cat /proc/mounts | grep cifs
diff --git a/Documentation/filesystems/squashfs.rst b/Documentation/filesystems/squashfs.rst
index df42106bae71..45653b3228f9 100644
--- a/Documentation/filesystems/squashfs.rst
+++ b/Documentation/filesystems/squashfs.rst
@@ -6,7 +6,7 @@ Squashfs 4.0 Filesystem
Squashfs is a compressed read-only filesystem for Linux.
-It uses zlib, lz4, lzo, or xz compression to compress files, inodes and
+It uses zlib, lz4, lzo, xz or zstd compression to compress files, inodes and
directories. Inodes in the system are very small and all blocks are packed to
minimise data overhead. Block sizes greater than 4K are supported up to a
maximum of 1Mbytes (default block size 128K).
@@ -16,8 +16,8 @@ use (i.e. in cases where a .tar.gz file may be used), and in constrained
block device/memory systems (e.g. embedded systems) where low overhead is
needed.
-Mailing list: squashfs-devel@lists.sourceforge.net
-Web site: www.squashfs.org
+Mailing list (kernel code): linux-fsdevel@vger.kernel.org
+Web site: github.com/plougher/squashfs-tools
1. Filesystem Features
----------------------
@@ -58,11 +58,69 @@ inodes have different sizes).
As squashfs is a read-only filesystem, the mksquashfs program must be used to
create populated squashfs filesystems. This and other squashfs utilities
-can be obtained from http://www.squashfs.org. Usage instructions can be
-obtained from this site also.
+are very likely packaged by your linux distribution (called squashfs-tools).
+The source code can be obtained from github.com/plougher/squashfs-tools.
+Usage instructions can also be obtained from this site.
-The squashfs-tools development tree is now located on kernel.org
- git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git
+2.1 Mount options
+-----------------
+=================== =========================================================
+errors=%s Specify whether squashfs errors trigger a kernel panic
+ or not
+
+ ========== =============================================
+ continue errors don't trigger a panic (default)
+ panic trigger a panic when errors are encountered,
+ similar to several other filesystems (e.g.
+ btrfs, ext4, f2fs, GFS2, jfs, ntfs, ubifs)
+
+ This allows a kernel dump to be saved,
+ useful for analyzing and debugging the
+ corruption.
+ ========== =============================================
+threads=%s Select the decompression mode or the number of threads
+
+ If SQUASHFS_CHOICE_DECOMP_BY_MOUNT is set:
+
+ ========== =============================================
+ single use single-threaded decompression (default)
+
+ Only one block (data or metadata) can be
+ decompressed at any one time. This limits
+ CPU and memory usage to a minimum, but it
+ also gives poor performance on parallel I/O
+ workloads when using multiple CPU machines
+ due to waiting on decompressor availability.
+ multi use up to two parallel decompressors per core
+
+ If you have a parallel I/O workload and your
+ system has enough memory, using this option
+ may improve overall I/O performance. It
+ dynamically allocates decompressors on a
+ demand basis.
+ percpu use a maximum of one decompressor per core
+
+ It uses percpu variables to ensure
+ decompression is load-balanced across the
+ cores.
+ 1|2|3|... configure the number of threads used for
+ decompression
+
+ The upper limit is num_online_cpus() * 2.
+ ========== =============================================
+
+ If SQUASHFS_CHOICE_DECOMP_BY_MOUNT is **not** set and
+ SQUASHFS_DECOMP_MULTI, SQUASHFS_MOUNT_DECOMP_THREADS are
+ both set:
+
+ ========== =============================================
+ 2|3|... configure the number of threads used for
+ decompression
+
+ The upper limit is num_online_cpus() * 2.
+ ========== =============================================
+
+=================== =========================================================
3. Squashfs Filesystem Design
-----------------------------
diff --git a/Documentation/filesystems/sysfs.rst b/Documentation/filesystems/sysfs.rst
index c32993bc83c7..2703c04af7d0 100644
--- a/Documentation/filesystems/sysfs.rst
+++ b/Documentation/filesystems/sysfs.rst
@@ -243,8 +243,8 @@ Other notes:
- show() methods should return the number of bytes printed into the
buffer.
-- show() should only use sysfs_emit() or sysfs_emit_at() when formatting
- the value to be returned to user space.
+- New implementations of show() methods should only use sysfs_emit() or
+ sysfs_emit_at() when formatting the value to be returned to user space.
- store() should return the number of bytes used from the buffer. If the
entire buffer has been used, just return the count argument.
@@ -299,7 +299,6 @@ The top level sysfs directory looks like::
hypervisor/
kernel/
module/
- net/
power/
devices/ contains a filesystem representation of the device tree. It maps
@@ -313,7 +312,7 @@ kernel. Each bus's directory contains two subdirectories::
drivers/
devices/ contains symlinks for each device discovered in the system
-that point to the device's directory under root/.
+that point to the device's directory under /sys/devices.
drivers/ contains a directory for each device driver that is loaded
for devices on that particular bus (this assumes that drivers do not
@@ -321,22 +320,36 @@ span multiple bus types).
fs/ contains a directory for some filesystems. Currently each
filesystem wanting to export attributes must create its own hierarchy
-below fs/ (see ./fuse.rst for an example).
+below fs/ (see fuse/fuse.rst for an example).
module/ contains parameter values and state information for all
loaded system modules, for both builtin and loadable modules.
dev/ contains two directories: char/ and block/. Inside these two
directories there are symlinks named <major>:<minor>. These symlinks
-point to the sysfs directory for the given device. /sys/dev provides a
+point to the directories under /sys/devices for each device. /sys/dev provides a
quick way to lookup the sysfs interface for a device from the result of
a stat(2) operation.
More information on driver-model specific features can be found in
Documentation/driver-api/driver-model/.
+block/ contains symlinks to all the block devices discovered on the system.
+These symlinks point to directories under /sys/devices.
-TODO: Finish this section.
+class/ contains a directory for each device class, grouped by functional type.
+Each directory in class/ contains symlinks to devices in the /sys/devices directory.
+
+firmware/ contains system firmware data and configuration such as firmware tables,
+ACPI information, and device tree data.
+
+hypervisor/ contains virtualization platform information and provides an interface to
+the underlying hypervisor. It is only present when running on a virtual machine.
+
+kernel/ contains runtime kernel parameters, configuration settings, and status.
+
+power/ contains power management subsystem information including
+sleep states, suspend/resume capabilities, and policies.
Current Interfaces
diff --git a/Documentation/filesystems/sysv-fs.rst b/Documentation/filesystems/sysv-fs.rst
deleted file mode 100644
index 89e40911ad7c..000000000000
--- a/Documentation/filesystems/sysv-fs.rst
+++ /dev/null
@@ -1,264 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-==================
-SystemV Filesystem
-==================
-
-It implements all of
- - Xenix FS,
- - SystemV/386 FS,
- - Coherent FS.
-
-To install:
-
-* Answer the 'System V and Coherent filesystem support' question with 'y'
- when configuring the kernel.
-* To mount a disk or a partition, use::
-
- mount [-r] -t sysv device mountpoint
-
- The file system type names::
-
- -t sysv
- -t xenix
- -t coherent
-
- may be used interchangeably, but the last two will eventually disappear.
-
-Bugs in the present implementation:
-
-- Coherent FS:
-
- - The "free list interleave" n:m is currently ignored.
- - Only file systems with no filesystem name and no pack name are recognized.
- (See Coherent "man mkfs" for a description of these features.)
-
-- SystemV Release 2 FS:
-
- The superblock is only searched in the blocks 9, 15, 18, which
- corresponds to the beginning of track 1 on floppy disks. No support
- for this FS on hard disk yet.
-
-
-These filesystems are rather similar. Here is a comparison with Minix FS:
-
-* Linux fdisk reports on partitions
-
- - Minix FS 0x81 Linux/Minix
- - Xenix FS ??
- - SystemV FS ??
- - Coherent FS 0x08 AIX bootable
-
-* Size of a block or zone (data allocation unit on disk)
-
- - Minix FS 1024
- - Xenix FS 1024 (also 512 ??)
- - SystemV FS 1024 (also 512 and 2048)
- - Coherent FS 512
-
-* General layout: all have one boot block, one super block and
- separate areas for inodes and for directories/data.
- On SystemV Release 2 FS (e.g. Microport) the first track is reserved and
- all the block numbers (including the super block) are offset by one track.
-
-* Byte ordering of "short" (16 bit entities) on disk:
-
- - Minix FS little endian 0 1
- - Xenix FS little endian 0 1
- - SystemV FS little endian 0 1
- - Coherent FS little endian 0 1
-
- Of course, this affects only the file system, not the data of files on it!
-
-* Byte ordering of "long" (32 bit entities) on disk:
-
- - Minix FS little endian 0 1 2 3
- - Xenix FS little endian 0 1 2 3
- - SystemV FS little endian 0 1 2 3
- - Coherent FS PDP-11 2 3 0 1
-
- Of course, this affects only the file system, not the data of files on it!
-
-* Inode on disk: "short", 0 means non-existent, the root dir ino is:
-
- ================================= ==
- Minix FS 1
- Xenix FS, SystemV FS, Coherent FS 2
- ================================= ==
-
-* Maximum number of hard links to a file:
-
- =========== =========
- Minix FS 250
- Xenix FS ??
- SystemV FS ??
- Coherent FS >=10000
- =========== =========
-
-* Free inode management:
-
- - Minix FS
- a bitmap
- - Xenix FS, SystemV FS, Coherent FS
- There is a cache of a certain number of free inodes in the super-block.
- When it is exhausted, new free inodes are found using a linear search.
-
-* Free block management:
-
- - Minix FS
- a bitmap
- - Xenix FS, SystemV FS, Coherent FS
- Free blocks are organized in a "free list". Maybe a misleading term,
- since it is not true that every free block contains a pointer to
- the next free block. Rather, the free blocks are organized in chunks
- of limited size, and every now and then a free block contains pointers
- to the free blocks pertaining to the next chunk; the first of these
- contains pointers and so on. The list terminates with a "block number"
- 0 on Xenix FS and SystemV FS, with a block zeroed out on Coherent FS.
-
-* Super-block location:
-
- =========== ==========================
- Minix FS block 1 = bytes 1024..2047
- Xenix FS block 1 = bytes 1024..2047
- SystemV FS bytes 512..1023
- Coherent FS block 1 = bytes 512..1023
- =========== ==========================
-
-* Super-block layout:
-
- - Minix FS::
-
- unsigned short s_ninodes;
- unsigned short s_nzones;
- unsigned short s_imap_blocks;
- unsigned short s_zmap_blocks;
- unsigned short s_firstdatazone;
- unsigned short s_log_zone_size;
- unsigned long s_max_size;
- unsigned short s_magic;
-
- - Xenix FS, SystemV FS, Coherent FS::
-
- unsigned short s_firstdatazone;
- unsigned long s_nzones;
- unsigned short s_fzone_count;
- unsigned long s_fzones[NICFREE];
- unsigned short s_finode_count;
- unsigned short s_finodes[NICINOD];
- char s_flock;
- char s_ilock;
- char s_modified;
- char s_rdonly;
- unsigned long s_time;
- short s_dinfo[4]; -- SystemV FS only
- unsigned long s_free_zones;
- unsigned short s_free_inodes;
- short s_dinfo[4]; -- Xenix FS only
- unsigned short s_interleave_m,s_interleave_n; -- Coherent FS only
- char s_fname[6];
- char s_fpack[6];
-
- then they differ considerably:
-
- Xenix FS::
-
- char s_clean;
- char s_fill[371];
- long s_magic;
- long s_type;
-
- SystemV FS::
-
- long s_fill[12 or 14];
- long s_state;
- long s_magic;
- long s_type;
-
- Coherent FS::
-
- unsigned long s_unique;
-
- Note that Coherent FS has no magic.
-
-* Inode layout:
-
- - Minix FS::
-
- unsigned short i_mode;
- unsigned short i_uid;
- unsigned long i_size;
- unsigned long i_time;
- unsigned char i_gid;
- unsigned char i_nlinks;
- unsigned short i_zone[7+1+1];
-
- - Xenix FS, SystemV FS, Coherent FS::
-
- unsigned short i_mode;
- unsigned short i_nlink;
- unsigned short i_uid;
- unsigned short i_gid;
- unsigned long i_size;
- unsigned char i_zone[3*(10+1+1+1)];
- unsigned long i_atime;
- unsigned long i_mtime;
- unsigned long i_ctime;
-
-
-* Regular file data blocks are organized as
-
- - Minix FS:
-
- - 7 direct blocks
- - 1 indirect block (pointers to blocks)
- - 1 double-indirect block (pointer to pointers to blocks)
-
- - Xenix FS, SystemV FS, Coherent FS:
-
- - 10 direct blocks
- - 1 indirect block (pointers to blocks)
- - 1 double-indirect block (pointer to pointers to blocks)
- - 1 triple-indirect block (pointer to pointers to pointers to blocks)
-
-
- =========== ========== ================
- Inode size inodes per block
- =========== ========== ================
- Minix FS 32 32
- Xenix FS 64 16
- SystemV FS 64 16
- Coherent FS 64 8
- =========== ========== ================
-
-* Directory entry on disk
-
- - Minix FS::
-
- unsigned short inode;
- char name[14/30];
-
- - Xenix FS, SystemV FS, Coherent FS::
-
- unsigned short inode;
- char name[14];
-
- =========== ============== =====================
- Dir entry size dir entries per block
- =========== ============== =====================
- Minix FS 16/32 64/32
- Xenix FS 16 64
- SystemV FS 16 64
- Coherent FS 16 32
- =========== ============== =====================
-
-* How to implement symbolic links such that the host fsck doesn't scream:
-
- - Minix FS normal
- - Xenix FS kludge: as regular files with chmod 1000
- - SystemV FS ??
- - Coherent FS kludge: as regular files with chmod 1000
-
-
-Notation: We often speak of a "block" but mean a zone (the allocation unit)
-and not the disk driver's notion of "block".
diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst
index 2cd8fa332feb..d677e0428c3f 100644
--- a/Documentation/filesystems/tmpfs.rst
+++ b/Documentation/filesystems/tmpfs.rst
@@ -21,8 +21,8 @@ explained further below, some of which can be reconfigured dynamically on the
fly using a remount ('mount -o remount ...') of the filesystem. A tmpfs
filesystem can be resized but it cannot be resized to a size below its current
usage. tmpfs also supports POSIX ACLs, and extended attributes for the
-trusted.* and security.* namespaces. ramfs does not use swap and you cannot
-modify any parameter for a ramfs filesystem. The size limit of a ramfs
+trusted.*, security.* and user.* namespaces. ramfs does not use swap and you
+cannot modify any parameter for a ramfs filesystem. The size limit of a ramfs
filesystem is how much memory you have available, and so care must be taken if
used so to not run out of memory.
@@ -97,6 +97,9 @@ mount with such options, since it allows any user with write access to
use up all the memory on the machine; but enhances the scalability of
that instance in a system with many CPUs making intensive use of it.
+If nr_inodes is not 0, that limited space for inodes is also used up by
+extended attributes: "df -i"'s IUsed and IUse% increase, IFree decreases.
+
tmpfs blocks may be swapped out, when there is a shortage of memory.
tmpfs has a mount option to disable its use of swap:
@@ -123,6 +126,37 @@ sysfs file /sys/kernel/mm/transparent_hugepage/shmem_enabled: which can
be used to deny huge pages on all tmpfs mounts in an emergency, or to
force huge pages on all tmpfs mounts for testing.
+tmpfs also supports quota with the following mount options
+
+======================== =================================================
+quota User and group quota accounting and enforcement
+ is enabled on the mount. Tmpfs is using hidden
+ system quota files that are initialized on mount.
+usrquota User quota accounting and enforcement is enabled
+ on the mount.
+grpquota Group quota accounting and enforcement is enabled
+ on the mount.
+usrquota_block_hardlimit Set global user quota block hard limit.
+usrquota_inode_hardlimit Set global user quota inode hard limit.
+grpquota_block_hardlimit Set global group quota block hard limit.
+grpquota_inode_hardlimit Set global group quota inode hard limit.
+======================== =================================================
+
+None of the quota related mount options can be set or changed on remount.
+
+Quota limit parameters accept a suffix k, m or g for kilo, mega and giga
+and can't be changed on remount. Default global quota limits are taking
+effect for any and all user/group/project except root the first time the
+quota entry for user/group/project id is being accessed - typically the
+first time an inode with a particular id ownership is being created after
+the mount. In other words, instead of the limits being initialized to zero,
+they are initialized with the particular value provided with these mount
+options. The limits can be changed for any user/group id at any time as they
+normally can be.
+
+Note that tmpfs quotas do not support user namespaces so no uid/gid
+translation is done if quotas are enabled inside user namespaces.
+
tmpfs has a mount option to set the NUMA memory allocation policy for
all files in that instance (if CONFIG_NUMA is enabled) - which can be
adjusted on the fly via 'mount -o remount ...'
@@ -207,6 +241,28 @@ So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs'
will give you tmpfs instance on /mytmpfs which can allocate 10GB
RAM/SWAP in 10240 inodes and it is only accessible by root.
+tmpfs has the following mounting options for case-insensitive lookup support:
+
+================= ==============================================================
+casefold Enable casefold support at this mount point using the given
+ argument as the encoding standard. Currently only UTF-8
+ encodings are supported. If no argument is used, it will load
+ the latest UTF-8 encoding available.
+strict_encoding Enable strict encoding at this mount point (disabled by
+ default). In this mode, the filesystem refuses to create file
+ and directory with names containing invalid UTF-8 characters.
+================= ==============================================================
+
+This option doesn't render the entire filesystem case-insensitive. One needs to
+still set the casefold flag per directory, by flipping the +F attribute in an
+empty directory. Nevertheless, new directories will inherit the attribute. The
+mountpoint itself cannot be made case-insensitive.
+
+Example::
+
+ $ mount -t tmpfs -o casefold=utf8-12.1.0,strict_encoding fs_name /mytmpfs
+ $ mount -t tmpfs -o casefold fs_name /mytmpfs
+
:Author:
Christoph Rohland <cr@sap.com>, 1.12.01
@@ -216,3 +272,5 @@ RAM/SWAP in 10240 inodes and it is only accessible by root.
KOSAKI Motohiro, 16 Mar 2010
:Updated:
Chris Down, 13 July 2020
+:Updated:
+ André Almeida, 23 Aug 2024
diff --git a/Documentation/filesystems/ubifs-authentication.rst b/Documentation/filesystems/ubifs-authentication.rst
index 5210aed2afbc..106bb9c056f6 100644
--- a/Documentation/filesystems/ubifs-authentication.rst
+++ b/Documentation/filesystems/ubifs-authentication.rst
@@ -130,7 +130,7 @@ marked as dirty are written to the flash to update the persisted index.
Journal
~~~~~~~
-To avoid wearing out the flash, the index is only persisted (*commited*) when
+To avoid wearing out the flash, the index is only persisted (*committed*) when
certain conditions are met (eg. ``fsync(2)``). The journal is used to record
any changes (in form of inode nodes, data nodes etc.) between commits
of the index. During mount, the journal is read from the flash and replayed
@@ -443,6 +443,6 @@ References
[DM-VERITY] https://www.kernel.org/doc/Documentation/device-mapper/verity.rst
-[FSCRYPT-POLICY2] https://www.spinics.net/lists/linux-ext4/msg58710.html
+[FSCRYPT-POLICY2] https://lore.kernel.org/r/20171023214058.128121-1-ebiggers3@gmail.com/
[UBIFS-WP] http://www.linux-mtd.infradead.org/doc/ubifs_whitepaper.pdf
diff --git a/Documentation/filesystems/vfat.rst b/Documentation/filesystems/vfat.rst
index 760a4d83fdf9..b289c4449cd0 100644
--- a/Documentation/filesystems/vfat.rst
+++ b/Documentation/filesystems/vfat.rst
@@ -50,7 +50,7 @@ VFAT MOUNT OPTIONS
Normally utime(2) checks current process is owner of
the file, or it has CAP_FOWNER capability. But FAT
filesystem doesn't have uid/gid on disk, so normal
- check is too unflexible. With this option you can
+ check is too inflexible. With this option you can
relax it.
**codepage=###**
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index cb2a97e49872..670ba66b60e4 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -209,31 +209,8 @@ method fills in is the "s_op" field. This is a pointer to a "struct
super_operations" which describes the next level of the filesystem
implementation.
-Usually, a filesystem uses one of the generic mount() implementations
-and provides a fill_super() callback instead. The generic variants are:
-
-``mount_bdev``
- mount a filesystem residing on a block device
-
-``mount_nodev``
- mount a filesystem that is not backed by a device
-
-``mount_single``
- mount a filesystem which shares the instance between all mounts
-
-A fill_super() callback implementation has the following arguments:
-
-``struct super_block *sb``
- the superblock structure. The callback must initialize this
- properly.
-
-``void *data``
- arbitrary mount options, usually comes as an ASCII string (see
- "Mount Options" section)
-
-``int silent``
- whether or not to be silent on error
-
+For more information on mounting (and the new mount API), see
+Documentation/filesystems/mount_api.rst.
The Superblock Object
=====================
@@ -260,9 +237,11 @@ filesystem. The following members are defined:
void (*evict_inode) (struct inode *);
void (*put_super) (struct super_block *);
int (*sync_fs)(struct super_block *sb, int wait);
- int (*freeze_super) (struct super_block *);
+ int (*freeze_super) (struct super_block *sb,
+ enum freeze_holder who);
int (*freeze_fs) (struct super_block *);
- int (*thaw_super) (struct super_block *);
+ int (*thaw_super) (struct super_block *sb,
+ enum freeze_wholder who);
int (*unfreeze_fs) (struct super_block *);
int (*statfs) (struct dentry *, struct kstatfs *);
int (*remount_fs) (struct super_block *, int *, char *);
@@ -325,11 +304,11 @@ or bottom half).
inode->i_lock spinlock held.
This method should be either NULL (normal UNIX filesystem
- semantics) or "generic_delete_inode" (for filesystems that do
+ semantics) or "inode_just_drop" (for filesystems that do
not want to cache inodes - causing "delete_inode" to always be
called regardless of the value of i_nlink)
- The "generic_delete_inode()" behavior is equivalent to the old
+ The "inode_just_drop()" behavior is equivalent to the old
practice of using "force_delete" in the put_inode() case, but
does not have the races that the "force_delete()" approach had.
@@ -435,7 +414,7 @@ field. This is a pointer to a "struct inode_operations" which describes
the methods that can be performed on individual inodes.
-struct xattr_handlers
+struct xattr_handler
---------------------
On filesystems that support extended attributes (xattrs), the s_xattr
@@ -493,7 +472,7 @@ As of kernel 2.6.22, the following members are defined:
int (*link) (struct dentry *,struct inode *,struct dentry *);
int (*unlink) (struct inode *,struct dentry *);
int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *,const char *);
- int (*mkdir) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t);
+ struct dentry *(*mkdir) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t);
int (*rmdir) (struct inode *,struct dentry *);
int (*mknod) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t);
int (*rename) (struct mnt_idmap *, struct inode *, struct dentry *,
@@ -513,8 +492,9 @@ As of kernel 2.6.22, the following members are defined:
struct posix_acl * (*get_acl)(struct mnt_idmap *, struct dentry *, int);
int (*set_acl)(struct mnt_idmap *, struct dentry *, struct posix_acl *, int);
int (*fileattr_set)(struct mnt_idmap *idmap,
- struct dentry *dentry, struct fileattr *fa);
- int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa);
+ struct dentry *dentry, struct file_kattr *fa);
+ int (*fileattr_get)(struct dentry *dentry, struct file_kattr *fa);
+ struct offset_ctx *(*get_offset_ctx)(struct inode *inode);
};
Again, all methods are called without any locks being held, unless
@@ -559,7 +539,26 @@ otherwise noted.
``mkdir``
called by the mkdir(2) system call. Only required if you want
to support creating subdirectories. You will probably need to
- call d_instantiate() just as you would in the create() method
+ call d_instantiate_new() just as you would in the create() method.
+
+ If d_instantiate_new() is not used and if the fh_to_dentry()
+ export operation is provided, or if the storage might be
+ accessible by another path (e.g. with a network filesystem)
+ then more care may be needed. Importantly d_instantate()
+ should not be used with an inode that is no longer I_NEW if there
+ any chance that the inode could already be attached to a dentry.
+ This is because of a hard rule in the VFS that a directory must
+ only ever have one dentry.
+
+ For example, if an NFS filesystem is mounted twice the new directory
+ could be visible on the other mount before it is on the original
+ mount, and a pair of name_to_handle_at(), open_by_handle_at()
+ calls could instantiate the directory inode with an IS_ROOT()
+ dentry before the first mkdir returns.
+
+ If there is any chance this could happen, then the new inode
+ should be d_drop()ed and attached with d_splice_alias(). The
+ returned dentry (if any) should be returned by ->mkdir().
``rmdir``
called by the rmdir(2) system call. Only required if you want
@@ -675,7 +674,10 @@ otherwise noted.
called on ioctl(FS_IOC_SETFLAGS) and ioctl(FS_IOC_FSSETXATTR) to
change miscellaneous file flags and attributes. Callers hold
i_rwsem exclusive. If unset, then fall back to f_op->ioctl().
-
+``get_offset_ctx``
+ called to get the offset context for a directory inode. A
+ filesystem must define this operation to use
+ simple_offset_dir_operations.
The Address Space Object
========================
@@ -691,9 +693,8 @@ page lookup by address, and keeping track of pages tagged as Dirty or
Writeback.
The first can be used independently to the others. The VM can try to
-either write dirty pages in order to clean them, or release clean pages
-in order to reuse them. To do this it can call the ->writepage method
-on dirty pages, and ->release_folio on clean folios with the private
+release clean pages in order to reuse them. To do this it can call
+->release_folio on clean folios with the private
flag set. Clean pages without PagePrivate and with no external references
will be released without notice being given to the address_space.
@@ -706,8 +707,8 @@ maintains information about the PG_Dirty and PG_Writeback status of each
page, so that pages with either of these flags can be found quickly.
The Dirty tag is primarily used by mpage_writepages - the default
-->writepages method. It uses the tag to find dirty pages to call
-->writepage on. If mpage_writepages is not used (i.e. the address
+->writepages method. It uses the tag to find dirty pages to
+write back. If mpage_writepages is not used (i.e. the address
provides its own ->writepages) , the PAGECACHE_TAG_DIRTY tag is almost
unused. write_inode_now and sync_inode do use it (through
__sync_single_inode) to check if ->writepages has been successful in
@@ -731,23 +732,24 @@ pages, however the address_space has finer control of write sizes.
The read process essentially only requires 'read_folio'. The write
process is more complicated and uses write_begin/write_end or
-dirty_folio to write data into the address_space, and writepage and
+dirty_folio to write data into the address_space, and
writepages to writeback data to storage.
-Adding and removing pages to/from an address_space is protected by the
-inode's i_mutex.
+Removing pages from an address_space requires holding the inode's i_rwsem
+exclusively, while adding pages to the address_space requires holding the
+inode's i_mapping->invalidate_lock exclusively.
When data is written to a page, the PG_Dirty flag should be set. It
-typically remains set until writepage asks for it to be written. This
+typically remains set until writepages asks for it to be written. This
should clear PG_Dirty and set PG_Writeback. It can be actually written
at any point after PG_Dirty is clear. Once it is known to be safe,
PG_Writeback is cleared.
Writeback makes use of a writeback_control structure to direct the
-operations. This gives the writepage and writepages operations some
+operations. This gives the writepages operation some
information about the nature of and reason for the writeback request,
and the constraints under which it is being done. It is also used to
-return information back to the caller about the result of a writepage or
+return information back to the caller about the result of a
writepages request.
@@ -761,7 +763,7 @@ is an error during writeback, they expect that error to be reported when
a file sync request is made. After an error has been reported on one
request, subsequent requests on the same file descriptor should return
0, unless further writeback errors have occurred since the previous file
-syncronization.
+synchronization.
Ideally, the kernel would report errors only on file descriptions on
which writes were done that subsequently failed to be written back. The
@@ -794,17 +796,16 @@ cache in your filesystem. The following members are defined:
.. code-block:: c
struct address_space_operations {
- int (*writepage)(struct page *page, struct writeback_control *wbc);
int (*read_folio)(struct file *, struct folio *);
int (*writepages)(struct address_space *, struct writeback_control *);
bool (*dirty_folio)(struct address_space *, struct folio *);
void (*readahead)(struct readahead_control *);
- int (*write_begin)(struct file *, struct address_space *mapping,
+ int (*write_begin)(const struct kiocb *, struct address_space *mapping,
loff_t pos, unsigned len,
- struct page **pagep, void **fsdata);
- int (*write_end)(struct file *, struct address_space *mapping,
+ struct page **pagep, void **fsdata);
+ int (*write_end)(const struct kiocb *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
- struct page *page, void *fsdata);
+ struct folio *folio, void *fsdata);
sector_t (*bmap)(struct address_space *, sector_t);
void (*invalidate_folio) (struct folio *, size_t start, size_t len);
bool (*release_folio)(struct folio *, gfp_t);
@@ -817,31 +818,12 @@ cache in your filesystem. The following members are defined:
bool (*is_partially_uptodate) (struct folio *, size_t from,
size_t count);
void (*is_dirty_writeback)(struct folio *, bool *, bool *);
- int (*error_remove_page) (struct mapping *mapping, struct page *page);
+ int (*error_remove_folio)(struct mapping *mapping, struct folio *);
int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
int (*swap_deactivate)(struct file *);
int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
};
-``writepage``
- called by the VM to write a dirty page to backing store. This
- may happen for data integrity reasons (i.e. 'sync'), or to free
- up memory (flush). The difference can be seen in
- wbc->sync_mode. The PG_Dirty flag has been cleared and
- PageLocked is true. writepage should start writeout, should set
- PG_Writeback, and should make sure the page is unlocked, either
- synchronously or asynchronously when the write operation
- completes.
-
- If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to
- try too hard if there are problems, and may choose to write out
- other pages from the mapping if that is easier (e.g. due to
- internal dependencies). If it chooses not to start writeout, it
- should return AOP_WRITEPAGE_ACTIVATE so that the VM will not
- keep calling ->writepage on that page.
-
- See the file "Locking" for more details.
-
``read_folio``
Called by the page cache to read a folio from the backing store.
The 'file' argument supplies authentication information to network
@@ -884,7 +866,7 @@ cache in your filesystem. The following members are defined:
given and that many pages should be written if possible. If no
->writepages is given, then mpage_writepages is used instead.
This will choose pages from the address space that are tagged as
- DIRTY and will pass them to ->writepage.
+ DIRTY and will write them back.
``dirty_folio``
called by the VM to mark a folio as dirty. This is particularly
@@ -907,8 +889,7 @@ cache in your filesystem. The following members are defined:
stop attempting I/O, it can simply return. The caller will
remove the remaining pages from the address space, unlock them
and decrement the page refcount. Set PageUptodate if the I/O
- completes successfully. Setting PageError on any page will be
- ignored; simply unlock the page if an I/O error occurs.
+ completes successfully.
``write_begin``
Called by the generic buffered write code to ask the filesystem
@@ -920,12 +901,12 @@ cache in your filesystem. The following members are defined:
(if they haven't been read already) so that the updated blocks
can be written out properly.
- The filesystem must return the locked pagecache page for the
- specified offset, in ``*pagep``, for the caller to write into.
+ The filesystem must return the locked pagecache folio for the
+ specified offset, in ``*foliop``, for the caller to write into.
It must be able to cope with short writes (where the length
passed to write_begin is greater than the number of bytes copied
- into the page).
+ into the folio).
A void * may be returned in fsdata, which then gets passed into
write_end.
@@ -938,8 +919,8 @@ cache in your filesystem. The following members are defined:
called. len is the original len passed to write_begin, and
copied is the amount that was able to be copied.
- The filesystem must take care of unlocking the page and
- releasing it refcount, and updating i_size.
+ The filesystem must take care of unlocking the folio,
+ decrementing its refcount, and updating i_size.
Returns < 0 on failure, otherwise the number of bytes (<=
'copied') that were able to be copied into pagecache.
@@ -1028,8 +1009,8 @@ cache in your filesystem. The following members are defined:
VM if a folio should be treated as dirty or writeback for the
purposes of stalling.
-``error_remove_page``
- normally set to generic_error_remove_page if truncation is ok
+``error_remove_folio``
+ normally set to generic_error_remove_folio if truncation is ok
for this address space. Used for memory failure handling.
Setting this implies you deal with pages going away under you,
unless you have them locked or reference counts increased.
@@ -1068,13 +1049,14 @@ This describes how the VFS can manipulate an open file. As of kernel
struct file_operations {
struct module *owner;
+ fop_flags_t fop_flags;
loff_t (*llseek) (struct file *, loff_t, int);
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
- int (*iopoll)(struct kiocb *kiocb, bool spin);
- int (*iterate) (struct file *, struct dir_context *);
+ int (*iopoll)(struct kiocb *kiocb, struct io_comp_batch *,
+ unsigned int flags);
int (*iterate_shared) (struct file *, struct dir_context *);
__poll_t (*poll) (struct file *, struct poll_table_struct *);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
@@ -1091,18 +1073,24 @@ This describes how the VFS can manipulate an open file. As of kernel
int (*flock) (struct file *, int, struct file_lock *);
ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
- int (*setlease)(struct file *, long, struct file_lock **, void **);
+ void (*splice_eof)(struct file *file);
+ int (*setlease)(struct file *, int, struct file_lease **, void **);
long (*fallocate)(struct file *file, int mode, loff_t offset,
loff_t len);
void (*show_fdinfo)(struct seq_file *m, struct file *f);
#ifndef CONFIG_MMU
unsigned (*mmap_capabilities)(struct file *);
#endif
- ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int);
+ ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
+ loff_t, size_t, unsigned int);
loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t len, unsigned int remap_flags);
int (*fadvise)(struct file *, loff_t, loff_t, int);
+ int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
+ int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
+ unsigned int poll_flags);
+ int (*mmap_prepare)(struct vm_area_desc *);
};
Again, all methods are called without any locks being held, unless
@@ -1126,12 +1114,8 @@ otherwise noted.
``iopoll``
called when aio wants to poll for completions on HIPRI iocbs
-``iterate``
- called when the VFS needs to read the directory contents
-
``iterate_shared``
- called when the VFS needs to read the directory contents when
- filesystem supports concurrent dir iterators
+ called when the VFS needs to read the directory contents
``poll``
called by the VFS when a process wants to check if there is
@@ -1146,7 +1130,8 @@ otherwise noted.
used on 64 bit kernels.
``mmap``
- called by the mmap(2) system call
+ called by the mmap(2) system call. Deprecated in favour of
+ ``mmap_prepare``.
``open``
called by the VFS when an inode should be opened. When the VFS
@@ -1223,6 +1208,15 @@ otherwise noted.
``fadvise``
possibly called by the fadvise64() system call.
+``mmap_prepare``
+ Called by the mmap(2) system call. Allows a VFS to set up a
+ file-backed memory mapping, most notably establishing relevant
+ private state and VMA callbacks.
+
+ If further action such as pre-population of page tables is required,
+ this can be specified by the vm_area_desc->action field and related
+ parameters.
+
Note that the file operations are implemented by the specific
filesystem in which the inode resides. When opening a device node
(character or block special) most filesystems will call special
@@ -1251,7 +1245,8 @@ defined:
.. code-block:: c
struct dentry_operations {
- int (*d_revalidate)(struct dentry *, unsigned int);
+ int (*d_revalidate)(struct inode *, const struct qstr *,
+ struct dentry *, unsigned int);
int (*d_weak_revalidate)(struct dentry *, unsigned int);
int (*d_hash)(const struct dentry *, struct qstr *);
int (*d_compare)(const struct dentry *,
@@ -1263,7 +1258,9 @@ defined:
char *(*d_dname)(struct dentry *, char *, int);
struct vfsmount *(*d_automount)(struct path *);
int (*d_manage)(const struct path *, bool);
- struct dentry *(*d_real)(struct dentry *, const struct inode *);
+ struct dentry *(*d_real)(struct dentry *, enum d_real_type type);
+ bool (*d_unalias_trylock)(const struct dentry *);
+ void (*d_unalias_unlock)(const struct dentry *);
};
``d_revalidate``
@@ -1389,9 +1386,7 @@ defined:
If a vfsmount is returned, the caller will attempt to mount it
on the mountpoint and will remove the vfsmount from its
- expiration list in the case of failure. The vfsmount should be
- returned with 2 refs on it to prevent automatic expiration - the
- caller will clean up the additional ref.
+ expiration list in the case of failure.
This function is only used if DCACHE_NEED_AUTOMOUNT is set on
the dentry. This is set by __d_instantiate() if S_AUTOMOUNT is
@@ -1418,16 +1413,33 @@ defined:
the dentry being transited from.
``d_real``
- overlay/union type filesystems implement this method to return
- one of the underlying dentries hidden by the overlay. It is
- used in two different modes:
+ overlay/union type filesystems implement this method to return one
+ of the underlying dentries of a regular file hidden by the overlay.
+
+ The 'type' argument takes the values D_REAL_DATA or D_REAL_METADATA
+ for returning the real underlying dentry that refers to the inode
+ hosting the file's data or metadata respectively.
+
+ For non-regular files, the 'dentry' argument is returned.
+
+``d_unalias_trylock``
+ if present, will be called by d_splice_alias() before moving a
+ preexisting attached alias. Returning false prevents __d_move(),
+ making d_splice_alias() fail with -ESTALE.
+
+ Rationale: setting FS_RENAME_DOES_D_MOVE will prevent d_move()
+ and d_exchange() calls from the outside of filesystem methods;
+ however, it does not guarantee that attached dentries won't
+ be renamed or moved by d_splice_alias() finding a preexisting
+ alias for a directory inode. Normally we would not care;
+ however, something that wants to stabilize the entire path to
+ root over a blocking operation might need that. See 9p for one
+ (and hopefully only) example.
- Called from file_dentry() it returns the real dentry matching
- the inode argument. The real dentry may be from a lower layer
- already copied up, but still referenced from the file. This
- mode is selected with a non-NULL inode argument.
+``d_unalias_unlock``
+ should be paired with ``d_unalias_trylock``; that one is called after
+ __d_move() call in __d_unalias().
- With NULL inode the topmost real underlying dentry is returned.
Each dentry has a pointer to its parent dentry, as well as a hash list
of child dentries. Child dentries are basically like files in a
diff --git a/Documentation/filesystems/xfs/index.rst b/Documentation/filesystems/xfs/index.rst
new file mode 100644
index 000000000000..ab66c57a5d18
--- /dev/null
+++ b/Documentation/filesystems/xfs/index.rst
@@ -0,0 +1,14 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============================
+XFS Filesystem Documentation
+============================
+
+.. toctree::
+ :maxdepth: 2
+ :numbered:
+
+ xfs-delayed-logging-design
+ xfs-maintainer-entry-profile
+ xfs-self-describing-metadata
+ xfs-online-fsck-design
diff --git a/Documentation/filesystems/xfs-delayed-logging-design.rst b/Documentation/filesystems/xfs/xfs-delayed-logging-design.rst
index 6402ab8e370c..2a2705e975e8 100644
--- a/Documentation/filesystems/xfs-delayed-logging-design.rst
+++ b/Documentation/filesystems/xfs/xfs-delayed-logging-design.rst
@@ -219,7 +219,7 @@ The log is circular, so the positions in the log are defined by the combination
of a cycle number - the number of times the log has been overwritten - and the
offset into the log. A LSN carries the cycle in the upper 32 bits and the
offset in the lower 32 bits. The offset is in units of "basic blocks" (512
-bytes). Hence we can do realtively simple LSN based math to keep track of
+bytes). Hence we can do relatively simple LSN based math to keep track of
available space in the log.
Log space accounting is done via a pair of constructs called "grant heads". The
diff --git a/Documentation/filesystems/xfs/xfs-maintainer-entry-profile.rst b/Documentation/filesystems/xfs/xfs-maintainer-entry-profile.rst
new file mode 100644
index 000000000000..ce4584fb3103
--- /dev/null
+++ b/Documentation/filesystems/xfs/xfs-maintainer-entry-profile.rst
@@ -0,0 +1,194 @@
+XFS Maintainer Entry Profile
+============================
+
+Overview
+--------
+XFS is a well known high-performance filesystem in the Linux kernel.
+The aim of this project is to provide and maintain a robust and
+performant filesystem.
+
+Patches are generally merged to the for-next branch of the appropriate
+git repository.
+After a testing period, the for-next branch is merged to the master
+branch.
+
+Kernel code are merged to the xfs-linux tree[0].
+Userspace code are merged to the xfsprogs tree[1].
+Test cases are merged to the xfstests tree[2].
+Ondisk format documentation are merged to the xfs-documentation tree[3].
+
+All patchsets involving XFS *must* be cc'd in their entirety to the mailing
+list linux-xfs@vger.kernel.org.
+
+Roles
+-----
+There are eight key roles in the XFS project.
+A person can take on multiple roles, and a role can be filled by
+multiple people.
+Anyone taking on a role is advised to check in with themselves and
+others on a regular basis about burnout.
+
+- **Outside Contributor**: Anyone who sends a patch but is not involved
+ in the XFS project on a regular basis.
+ These folks are usually people who work on other filesystems or
+ elsewhere in the kernel community.
+
+- **Developer**: Someone who is familiar with the XFS codebase enough to
+ write new code, documentation, and tests.
+
+ Developers can often be found in the IRC channel mentioned by the ``C:``
+ entry in the kernel MAINTAINERS file.
+
+- **Senior Developer**: A developer who is very familiar with at least
+ some part of the XFS codebase and/or other subsystems in the kernel.
+ These people collectively decide the long term goals of the project
+ and nudge the community in that direction.
+ They should help prioritize development and review work for each release
+ cycle.
+
+ Senior developers tend to be more active participants in the IRC channel.
+
+- **Reviewer**: Someone (most likely also a developer) who reads code
+ submissions to decide:
+
+ 0. Is the idea behind the contribution sound?
+ 1. Does the idea fit the goals of the project?
+ 2. Is the contribution designed correctly?
+ 3. Is the contribution polished?
+ 4. Can the contribution be tested effectively?
+
+ Reviewers should identify themselves with an ``R:`` entry in the kernel
+ and fstests MAINTAINERS files.
+
+- **Testing Lead**: This person is responsible for setting the test
+ coverage goals of the project, negotiating with developers to decide
+ on new tests for new features, and making sure that developers and
+ release managers execute on the testing.
+
+ The testing lead should identify themselves with an ``M:`` entry in
+ the XFS section of the fstests MAINTAINERS file.
+
+- **Bug Triager**: Someone who examines incoming bug reports in just
+ enough detail to identify the person to whom the report should be
+ forwarded.
+
+ The bug triagers should identify themselves with a ``B:`` entry in
+ the kernel MAINTAINERS file.
+
+- **Release Manager**: This person merges reviewed patchsets into an
+ integration branch, tests the result locally, pushes the branch to a
+ public git repository, and sends pull requests further upstream.
+ The release manager is not expected to work on new feature patchsets.
+ If a developer and a reviewer fail to reach a resolution on some point,
+ the release manager must have the ability to intervene to try to drive a
+ resolution.
+
+ The release manager should identify themselves with an ``M:`` entry in
+ the kernel MAINTAINERS file.
+
+- **Community Manager**: This person calls and moderates meetings of as many
+ XFS participants as they can get when mailing list discussions prove
+ insufficient for collective decisionmaking.
+ They may also serve as liaison between managers of the organizations
+ sponsoring work on any part of XFS.
+
+- **LTS Maintainer**: Someone who backports and tests bug fixes from
+ upstream to the LTS kernels.
+ There tend to be six separate LTS trees at any given time.
+
+ The maintainer for a given LTS release should identify themselves with an
+ ``M:`` entry in the MAINTAINERS file for that LTS tree.
+ Unmaintained LTS kernels should be marked with status ``S: Orphan`` in that
+ same file.
+
+Submission Checklist Addendum
+-----------------------------
+Please follow these additional rules when submitting to XFS:
+
+- Patches affecting only the filesystem itself should be based against
+ the latest -rc or the for-next branch.
+ These patches will be merged back to the for-next branch.
+
+- Authors of patches touching other subsystems need to coordinate with
+ the maintainers of XFS and the relevant subsystems to decide how to
+ proceed with a merge.
+
+- Any patchset changing XFS should be cc'd in its entirety to linux-xfs.
+ Do not send partial patchsets; that makes analysis of the broader
+ context of the changes unnecessarily difficult.
+
+- Anyone making kernel changes that have corresponding changes to the
+ userspace utilities should send the userspace changes as separate
+ patchsets immediately after the kernel patchsets.
+
+- Authors of bug fix patches are expected to use fstests[2] to perform
+ an A/B test of the patch to determine that there are no regressions.
+ When possible, a new regression test case should be written for
+ fstests.
+
+- Authors of new feature patchsets must ensure that fstests will have
+ appropriate functional and input corner-case test cases for the new
+ feature.
+
+- When implementing a new feature, it is strongly suggested that the
+ developers write a design document to answer the following questions:
+
+ * **What** problem is this trying to solve?
+
+ * **Who** will benefit from this solution, and **where** will they
+ access it?
+
+ * **How** will this new feature work? This should touch on major data
+ structures and algorithms supporting the solution at a higher level
+ than code comments.
+
+ * **What** userspace interfaces are necessary to build off of the new
+ features?
+
+ * **How** will this work be tested to ensure that it solves the
+ problems laid out in the design document without causing new
+ problems?
+
+ The design document should be committed in the kernel documentation
+ directory.
+ It may be omitted if the feature is already well known to the
+ community.
+
+- Patchsets for the new tests should be submitted as separate patchsets
+ immediately after the kernel and userspace code patchsets.
+
+- Changes to the on-disk format of XFS must be described in the ondisk
+ format document[3] and submitted as a patchset after the fstests
+ patchsets.
+
+- Patchsets implementing bug fixes and further code cleanups should put
+ the bug fixes at the beginning of the series to ease backporting.
+
+Key Release Cycle Dates
+-----------------------
+Bug fixes may be sent at any time, though the release manager may decide to
+defer a patch when the next merge window is close.
+
+Code submissions targeting the next merge window should be sent between
+-rc1 and -rc6.
+This gives the community time to review the changes, to suggest other changes,
+and for the author to retest those changes.
+
+Code submissions also requiring changes to fs/iomap and targeting the
+next merge window should be sent between -rc1 and -rc4.
+This allows the broader kernel community adequate time to test the
+infrastructure changes.
+
+Review Cadence
+--------------
+In general, please wait at least one week before pinging for feedback.
+To find reviewers, either consult the MAINTAINERS file, or ask
+developers that have Reviewed-by tags for XFS changes to take a look and
+offer their opinion.
+
+References
+----------
+| [0] https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/
+| [1] https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/
+| [2] https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/
+| [3] https://git.kernel.org/pub/scm/fs/xfs/xfs-documentation.git/
diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
index 791ab264b77e..3d9233f403db 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
@@ -105,10 +105,8 @@ occur; this capability aids both strategies.
TLDR; Show Me the Code!
-----------------------
-Code is posted to the kernel.org git trees as follows:
-`kernel changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`_,
-`userspace changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_, and
-`QA test changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_.
+Kernel and userspace code has been fully merged as of October 2025.
+
Each kernel patchset adding an online repair function will use the same branch
name across the kernel, xfsprogs, and fstests git repos.
@@ -249,7 +247,7 @@ sharing and lock acquisition rules as the regular filesystem.
This means that scrub cannot take *any* shortcuts to save time, because doing
so could lead to concurrency problems.
In other words, online fsck is not a complete replacement for offline fsck, and
-a complete run of online fsck may take longer than online fsck.
+a complete run of online fsck may take longer than offline fsck.
However, both of these limitations are acceptable tradeoffs to satisfy the
different motivations of online fsck, which are to **minimize system downtime**
and to **increase predictability of operation**.
@@ -293,7 +291,7 @@ The seven phases are as follows:
Before starting repairs, the summary counters are checked and any necessary
repairs are performed so that subsequent repairs will not fail the resource
reservation step due to wildly incorrect summary counters.
- Unsuccesful repairs are requeued as long as forward progress on repairs is
+ Unsuccessful repairs are requeued as long as forward progress on repairs is
made somewhere in the filesystem.
Free space in the filesystem is trimmed at the end of phase 4 if the
filesystem is clean.
@@ -454,7 +452,7 @@ filesystem so that it can apply pending filesystem updates to the staging
information.
Once the scan is done, the owning object is re-locked, the live data is used to
write a new ondisk structure, and the repairs are committed atomically.
-The hooks are disabled and the staging staging area is freed.
+The hooks are disabled and the staging area is freed.
Finally, the storage from the old data structure are carefully reaped.
Introducing concurrency helps online repair avoid various locking problems, but
@@ -475,7 +473,7 @@ operation, which may cause application failure or an unplanned filesystem
shutdown.
Inspiration for the secondary metadata repair strategy was drawn from section
-2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File")
+2.4 of Srinivasan above, and sections 2 ("NSF: Index Build Without Side-File")
and 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan, `"Algorithms for
Creating Indexes for Very Large Tables Without Quiescing Updates"
<https://dl.acm.org/doi/10.1145/130283.130337>`_, 1992.
@@ -542,7 +540,7 @@ ondisk structure.
Inspiration for quota and file link count repair strategies were drawn from
sections 2.12 ("Online Index Operations") through 2.14 ("Incremental View
-Maintenace") of G. Graefe, `"Concurrent Queries and Updates in Summary Views
+Maintenance") of G. Graefe, `"Concurrent Queries and Updates in Summary Views
and Their Indexes"
<http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`_, 2011.
@@ -605,7 +603,7 @@ functionality.
The cron job does not have this protection.
- **Fuzz Kiddiez**: There are many people now who seem to think that running
- automated fuzz testing of ondisk artifacts to find mischevious behavior and
+ automated fuzz testing of ondisk artifacts to find mischievous behavior and
spraying exploit code onto the public mailing list for instant zero-day
disclosure is somehow of some social benefit.
In the view of this author, the benefit is realized only when the fuzz
@@ -764,12 +762,8 @@ allow the online fsck developers to compare online fsck against offline fsck,
and they enable XFS developers to find deficiencies in the code base.
Proposed patchsets include
-`general fuzzer improvements
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements>`_,
`fuzzing baselines
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_,
-and `improvements in fuzz testing comprehensiveness
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`_.
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_.
Stress Testing
--------------
@@ -801,11 +795,6 @@ Success is defined by the ability to run all of these tests without observing
any unexpected filesystem shutdowns due to corrupted metadata, kernel hang
check warnings, or any other sort of mischief.
-Proposed patchsets include `general stress testing
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes>`_
-and the `evolution of existing per-function stress testing
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress>`_.
-
4. User Interface
=================
@@ -886,10 +875,6 @@ apply as nice of a priority to IO and CPU scheduling as possible.
This measure was taken to minimize delays in the rest of the filesystem.
No such hardening has been performed for the cron job.
-Proposed patchset:
-`Enabling the xfs_scrub background service
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_.
-
Health Reporting
----------------
@@ -912,13 +897,6 @@ notifications and initiate a repair?
*Answer*: These questions remain unanswered, but should be a part of the
conversation with early adopters and potential downstream users of XFS.
-Proposed patchsets include
-`wiring up health reports to correction returns
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-reports>`_
-and
-`preservation of sickness info during memory reclaim
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting>`_.
-
5. Kernel Algorithms and Data Structures
========================================
@@ -962,7 +940,7 @@ disk, but these buffer verifiers cannot provide any consistency checking
between metadata structures.
For more information, please see the documentation for
-Documentation/filesystems/xfs-self-describing-metadata.rst
+Documentation/filesystems/xfs/xfs-self-describing-metadata.rst
Reverse Mapping
---------------
@@ -1310,21 +1288,6 @@ Space allocation records are cross-referenced as follows:
are there the same number of reverse mapping records for each block as the
reference count record claims?
-Proposed patchsets are the series to find gaps in
-`refcount btree
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-gaps>`_,
-`inode btree
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gaps>`_, and
-`rmap btree
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-gaps>`_ records;
-to find
-`mergeable records
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-records>`_;
-and to
-`improve cross referencing with rmap
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-checking>`_
-before starting a repair.
-
Checking Extended Attributes
````````````````````````````
@@ -1351,7 +1314,7 @@ If the leaf information exceeds a single filesystem block, a dabtree (also
rooted at block 0) is created to map hashes of the attribute names to leaf
blocks in the attr fork.
-Checking an extended attribute structure is not so straightfoward due to the
+Checking an extended attribute structure is not so straightforward due to the
lack of separation between attr blocks and index blocks.
Scrub must read each block mapped by the attr fork and ignore the non-leaf
blocks:
@@ -1401,7 +1364,7 @@ If the free space has been separated and the second partition grows again
beyond one block, then a dabtree is used to map hashes of dirent names to
directory data blocks.
-Checking a directory is pretty straightfoward:
+Checking a directory is pretty straightforward:
1. Walk the dabtree in the second partition (if present) to ensure that there
are no irregularities in the blocks or dabtree mappings that do not point to
@@ -1524,7 +1487,7 @@ Only online fsck has this requirement of total consistency of AG metadata, and
should be relatively rare as compared to filesystem change operations.
Online fsck coordinates with transaction chains as follows:
-* For each AG, maintain a count of intent items targetting that AG.
+* For each AG, maintain a count of intent items targeting that AG.
The count should be bumped whenever a new item is added to the chain.
The count should be dropped when the filesystem has locked the AG header
buffers and finished the work.
@@ -1585,7 +1548,7 @@ The transaction sequence looks like this:
2. The second transaction contains a physical update to the free space btrees
of AG 3 to release the former BMBT block and a second physical update to the
free space btrees of AG 7 to release the unmapped file space.
- Observe that the the physical updates are resequenced in the correct order
+ Observe that the physical updates are resequenced in the correct order
when possible.
Attached to the transaction is a an extent free done (EFD) log item.
The EFD contains a pointer to the EFI logged in transaction #1 so that log
@@ -1756,10 +1719,6 @@ For scrub, the drain works as follows:
To avoid polling in step 4, the drain provides a waitqueue for scrub threads to
be woken up whenever the intent count drops to zero.
-The proposed patchset is the
-`scrub intent drain series
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>`_.
-
.. _jump_labels:
Static Keys (aka Jump Label Patching)
@@ -1915,19 +1874,13 @@ four of those five higher level data structures.
The fifth use case is discussed in the :ref:`realtime summary <rtsummary>` case
study.
-The most general storage interface supported by the xfile enables the reading
-and writing of arbitrary quantities of data at arbitrary offsets in the xfile.
-This capability is provided by ``xfile_pread`` and ``xfile_pwrite`` functions,
-which behave similarly to their userspace counterparts.
XFS is very record-based, which suggests that the ability to load and store
complete records is important.
-To support these cases, a pair of ``xfile_obj_load`` and ``xfile_obj_store``
-functions are provided to read and persist objects into an xfile.
-They are internally the same as pread and pwrite, except that they treat any
-error as an out of memory error.
-For online repair, squashing error conditions in this manner is an acceptable
-behavior because the only reaction is to abort the operation back to userspace.
-All five xfile usecases can be serviced by these four functions.
+To support these cases, a pair of ``xfile_load`` and ``xfile_store``
+functions are provided to read and persist objects into an xfile that treat any
+error as an out of memory error. For online repair, squashing error conditions
+in this manner is an acceptable behavior because the only reaction is to abort
+the operation back to userspace.
However, no discussion of file access idioms is complete without answering the
question, "But what about mmap?"
@@ -1939,15 +1892,14 @@ tmpfs can only push a pagecache folio to the swap cache if the folio is neither
pinned nor locked, which means the xfile must not pin too many folios.
Short term direct access to xfile contents is done by locking the pagecache
-folio and mapping it into kernel address space.
-Programmatic access (e.g. pread and pwrite) uses this mechanism.
-Folio locks are not supposed to be held for long periods of time, so long
-term direct access to xfile contents is done by bumping the folio refcount,
+folio and mapping it into kernel address space. Object load and store uses this
+mechanism. Folio locks are not supposed to be held for long periods of time, so
+long term direct access to xfile contents is done by bumping the folio refcount,
mapping it into kernel address space, and dropping the folio lock.
These long term users *must* be responsive to memory reclaim by hooking into
the shrinker infrastructure to know when to release folios.
-The ``xfile_get_page`` and ``xfile_put_page`` functions are provided to
+The ``xfile_get_folio`` and ``xfile_put_folio`` functions are provided to
retrieve the (locked) folio that backs part of an xfile and to release it.
The only code to use these folio lease functions are the xfarray
:ref:`sorting<xfarray_sort>` algorithms and the :ref:`in-memory
@@ -2043,10 +1995,6 @@ The ``xfarray_store_anywhere`` function is used to insert a record in any
null record slot in the bag; and the ``xfarray_unset`` function removes a
record from the bag.
-The proposed patchset is the
-`big in-memory array
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>`_.
-
Iterating Array Elements
^^^^^^^^^^^^^^^^^^^^^^^^
@@ -2102,7 +2050,7 @@ quicksort and a heapsort subalgorithm in the spirit of
kernel.
To sort records in a reasonably short amount of time, ``xfarray`` takes
advantage of the binary subpartitioning offered by quicksort, but it also uses
-heapsort to hedge aginst performance collapse if the chosen quicksort pivots
+heapsort to hedge against performance collapse if the chosen quicksort pivots
are poor.
Both algorithms are (in general) O(n * lg(n)), but there is a wide performance
gulf between the two implementations.
@@ -2174,15 +2122,11 @@ The ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate``
function frees them all because compaction is not needed.
The details of repairing directories and extended attributes will be discussed
-in a subsequent section about atomic extent swapping.
+in a subsequent section about atomic file content exchanges.
However, it should be noted that these repair functions only use blob storage
to cache a small number of entries before adding them to a temporary ondisk
file, which is why compaction is not required.
-The proposed patchset is at the start of the
-`extended attribute repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ series.
-
.. _xfbtree:
In-Memory B+Trees
@@ -2192,7 +2136,7 @@ The chapter about :ref:`secondary metadata<secondary_metadata>` mentioned that
checking and repairing of secondary metadata commonly requires coordination
between a live metadata scan of the filesystem and writer threads that are
updating that metadata.
-Keeping the scan data up to date requires requires the ability to propagate
+Keeping the scan data up to date requires the ability to propagate
metadata updates from the filesystem into the data being collected by the scan.
This *can* be done by appending concurrent updates into a separate log file and
applying them before writing the new metadata to disk, but this leads to
@@ -2221,11 +2165,6 @@ xfiles enables reuse of the entire btree library.
Btrees built atop an xfile are collectively known as ``xfbtrees``.
The next few sections describe how they actually work.
-The proposed patchset is the
-`in-memory btree
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>`_
-series.
-
Using xfiles as a Buffer Cache Target
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -2277,13 +2216,12 @@ follows:
pointing to the xfile.
3. Pass the buffer cache target, buffer ops, and other information to
- ``xfbtree_create`` to write an initial tree header and root block to the
- xfile.
+ ``xfbtree_init`` to initialize the passed in ``struct xfbtree`` and write an
+ initial root block to the xfile.
Each btree type should define a wrapper that passes necessary arguments to
the creation function.
For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take care of
all the necessary details for callers.
- A ``struct xfbtree`` object will be returned.
4. Pass the xfbtree object to the btree cursor creation function for the
btree type.
@@ -2467,14 +2405,6 @@ This enables the log to release the old EFI to keep the log moving forwards.
EFIs have a role to play during the commit and reaping phases; please see the
next section and the section about :ref:`reaping<reaping>` for more details.
-Proposed patchsets are the
-`bitmap rework
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework>`_
-and the
-`preparation for bulk loading btrees
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_.
-
-
Writing the New Tree
````````````````````
@@ -2566,8 +2496,8 @@ old metadata blocks:
The transaction rolling in steps 2c and 3 represent a weakness in the repair
algorithm, because a log flush and a crash before the end of the reap step can
result in space leaking.
-Online repair functions minimize the chances of this occuring by using very
-large transactions, which each can accomodate many thousands of block freeing
+Online repair functions minimize the chances of this occurring by using very
+large transactions, which each can accommodate many thousands of block freeing
instructions.
Repair moves on to reaping the old blocks, which will be presented in a
subsequent :ref:`section<reaping>` after a few case studies of bulk loading.
@@ -2631,11 +2561,6 @@ The number of records for the inode btree is the number of xfarray records,
but the record count for the free inode btree has to be computed as inode chunk
records are stored in the xfarray.
-The proposed patchset is the
-`AG btree repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
-series.
-
Case Study: Rebuilding the Space Reference Counts
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -2724,11 +2649,6 @@ Reverse mappings are added to the bag using ``xfarray_store_anywhere`` and
removed via ``xfarray_unset``.
Bag members are examined through ``xfarray_iter`` loops.
-The proposed patchset is the
-`AG btree repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
-series.
-
Case Study: Rebuilding File Fork Mapping Indices
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -2765,11 +2685,6 @@ EXTENTS format instead of BMBT, which may require a conversion.
Third, the incore extent map must be reloaded carefully to avoid disturbing
any delayed allocation extents.
-The proposed patchset is the
-`file mapping repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings>`_
-series.
-
.. _reaping:
Reaping Old Metadata Blocks
@@ -2810,7 +2725,8 @@ follows this format:
Repairs for file-based metadata such as extended attributes, directories,
symbolic links, quota files and realtime bitmaps are performed by building a
-new structure attached to a temporary file and swapping the forks.
+new structure attached to a temporary file and exchanging all mappings in the
+file forks.
Afterward, the mappings in the old file fork are the candidate blocks for
disposal.
@@ -2850,11 +2766,6 @@ blocks.
As stated earlier, online repair functions use very large transactions to
minimize the chances of this occurring.
-The proposed patchset is the
-`preparation for bulk loading btrees
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_
-series.
-
Case Study: Reaping After a Regular Btree Repair
````````````````````````````````````````````````
@@ -2950,11 +2861,6 @@ When the walk is complete, the bitmap disunion operation ``(ag_owner_bitmap &
btrees.
These blocks can then be reaped using the methods outlined above.
-The proposed patchset is the
-`AG btree repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
-series.
-
.. _rmap_reap:
Case Study: Reaping After Repairing Reverse Mapping Btrees
@@ -2979,11 +2885,6 @@ methods outlined above.
The rest of the process of rebuildng the reverse mapping btree is discussed
in a separate :ref:`case study<rmap_repair>`.
-The proposed patchset is the
-`AG btree repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
-series.
-
Case Study: Rebuilding the AGFL
```````````````````````````````
@@ -3031,11 +2932,6 @@ more complicated, because computing the correct value requires traversing the
forks, or if that fails, leaving the fields invalid and waiting for the fork
fsck functions to run.
-The proposed patchset is the
-`inode
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_
-repair series.
-
Quota Record Repairs
--------------------
@@ -3052,11 +2948,6 @@ checking are obviously bad limits and timer values.
Quota usage counters are checked, repaired, and discussed separately in the
section about :ref:`live quotacheck <quotacheck>`.
-The proposed patchset is the
-`quota
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_
-repair series.
-
.. _fscounters:
Freezing to Fix Summary Counters
@@ -3152,11 +3043,6 @@ long enough to check and correct the summary counters.
| This bug was fixed in Linux 5.17. |
+--------------------------------------------------------------------------+
-The proposed patchset is the
-`summary counter cleanup
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_
-series.
-
Full Filesystem Scans
---------------------
@@ -3284,15 +3170,6 @@ Second, if the incore inode is stuck in some intermediate state, the scan
coordinator must release the AGI and push the main filesystem to get the inode
back into a loadable state.
-The proposed patches are the
-`inode scanner
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
-series.
-The first user of the new functionality is the
-`online quotacheck
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
-series.
-
Inode Management
````````````````
@@ -3388,12 +3265,6 @@ To capture these nuances, the online fsck code has a separate ``xchk_irele``
function to set or clear the ``DONTCACHE`` flag to get the required release
behavior.
-Proposed patchsets include fixing
-`scrub iget usage
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixes>`_ and
-`dir iget usage
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>`_.
-
.. _ilocking:
Locking Inodes
@@ -3450,11 +3321,6 @@ If the dotdot entry changes while the directory is unlocked, then a move or
rename operation must have changed the child's parentage, and the scan can
exit early.
-The proposed patchset is the
-`directory repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
-series.
-
.. _fshooks:
Filesystem Hooks
@@ -3601,11 +3467,6 @@ The inode scan APIs are pretty simple:
- ``xchk_iscan_teardown`` to finish the scan
-This functionality is also a part of the
-`inode scanner
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
-series.
-
.. _quotacheck:
Case Study: Quota Counter Checking
@@ -3693,11 +3554,6 @@ needing to hold any locks for a long duration.
If repairs are desired, the real and shadow dquots are locked and their
resource counts are set to the values in the shadow dquot.
-The proposed patchset is the
-`online quotacheck
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
-series.
-
.. _nlinks:
Case Study: File Link Count Checking
@@ -3751,11 +3607,6 @@ shadow information.
If no parents are found, the file must be :ref:`reparented <orphanage>` to the
orphanage to prevent the file from being lost forever.
-The proposed patchset is the
-`file link count repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks>`_
-series.
-
.. _rmap_repair:
Case Study: Rebuilding Reverse Mapping Records
@@ -3835,11 +3686,6 @@ scan for reverse mapping records.
12. Free the xfbtree now that it not needed.
-The proposed patchset is the
-`rmap repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>`_
-series.
-
Staging Repairs with Temporary Files on Disk
--------------------------------------------
@@ -3859,8 +3705,8 @@ Because file forks can consume as much space as the entire filesystem, repairs
cannot be staged in memory, even when a paging scheme is available.
Therefore, online repair of file-based metadata createas a temporary file in
the XFS filesystem, writes a new structure at the correct offsets into the
-temporary file, and atomically swaps the fork mappings (and hence the fork
-contents) to commit the repair.
+temporary file, and atomically exchanges all file fork mappings (and hence the
+fork contents) to commit the repair.
Once the repair is complete, the old fork can be reaped as necessary; if the
system goes down during the reap, the iunlink code will delete the blocks
during log recovery.
@@ -3870,10 +3716,11 @@ consistent to use a temporary file safely!
This dependency is the reason why online repair can only use pageable kernel
memory to stage ondisk space usage information.
-Swapping metadata extents with a temporary file requires the owner field of the
-block headers to match the file being repaired and not the temporary file. The
-directory, extended attribute, and symbolic link functions were all modified to
-allow callers to specify owner numbers explicitly.
+Exchanging metadata file mappings with a temporary file requires the owner
+field of the block headers to match the file being repaired and not the
+temporary file.
+The directory, extended attribute, and symbolic link functions were all
+modified to allow callers to specify owner numbers explicitly.
There is a downside to the reaping process -- if the system crashes during the
reap phase and the fork extents are crosslinked, the iunlink processing will
@@ -3977,13 +3824,8 @@ Once a good copy of a data file has been constructed in a temporary file, it
must be conveyed to the file being repaired, which is the topic of the next
section.
-The proposed patches are in the
-`repair temporary files
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_
-series.
-
-Atomic Extent Swapping
-----------------------
+Logged File Content Exchanges
+-----------------------------
Once repair builds a temporary file with a new data structure written into
it, it must commit the new changes into the existing file.
@@ -4018,20 +3860,19 @@ e. Old blocks in the file may be cross-linked with another structure and must
These problems are overcome by creating a new deferred operation and a new type
of log intent item to track the progress of an operation to exchange two file
ranges.
-The new deferred operation type chains together the same transactions used by
-the reverse-mapping extent swap code.
+The new exchange operation type chains together the same transactions used by
+the reverse-mapping extent swap code, but records intermedia progress in the
+log so that operations can be restarted after a crash.
+This new functionality is called the file contents exchange (xfs_exchrange)
+code.
+The underlying implementation exchanges file fork mappings (xfs_exchmaps).
The new log item records the progress of the exchange to ensure that once an
exchange begins, it will always run to completion, even there are
interruptions.
-The new ``XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP`` log-incompatible feature flag
+The new ``XFS_SB_FEAT_INCOMPAT_EXCHRANGE`` incompatible feature flag
in the superblock protects these new log item records from being replayed on
old kernels.
-The proposed patchset is the
-`atomic extent swap
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_
-series.
-
+--------------------------------------------------------------------------+
| **Sidebar: Using Log-Incompatible Feature Flags** |
+--------------------------------------------------------------------------+
@@ -4055,9 +3896,6 @@ series.
| one ``struct rw_semaphore`` for each feature. |
| The log cleaning code tries to take this rwsem in exclusive mode to |
| clear the bit; if the lock attempt fails, the feature bit remains set. |
-| Filesystem code signals its intention to use a log incompat feature in a |
-| transaction by calling ``xlog_use_incompat_feat``, which takes the rwsem |
-| in shared mode. |
| The code supporting a log incompat feature should create wrapper |
| functions to obtain the log feature and call |
| ``xfs_add_incompat_log_feature`` to set the feature bits in the primary |
@@ -4072,72 +3910,73 @@ series.
| The feature bit will not be cleared from the superblock until the log |
| becomes clean. |
| |
-| Log-assisted extended attribute updates and atomic extent swaps both use |
-| log incompat features and provide convenience wrappers around the |
+| Log-assisted extended attribute updates and file content exchanges bothe |
+| use log incompat features and provide convenience wrappers around the |
| functionality. |
+--------------------------------------------------------------------------+
-Mechanics of an Atomic Extent Swap
-``````````````````````````````````
+Mechanics of a Logged File Content Exchange
+```````````````````````````````````````````
-Swapping entire file forks is a complex task.
+Exchanging contents between file forks is a complex task.
The goal is to exchange all file fork mappings between two file fork offset
ranges.
There are likely to be many extent mappings in each fork, and the edges of
the mappings aren't necessarily aligned.
-Furthermore, there may be other updates that need to happen after the swap,
+Furthermore, there may be other updates that need to happen after the exchange,
such as exchanging file sizes, inode flags, or conversion of fork data to local
format.
-This is roughly the format of the new deferred extent swap work item:
+This is roughly the format of the new deferred exchange-mapping work item:
.. code-block:: c
- struct xfs_swapext_intent {
+ struct xfs_exchmaps_intent {
/* Inodes participating in the operation. */
- struct xfs_inode *sxi_ip1;
- struct xfs_inode *sxi_ip2;
+ struct xfs_inode *xmi_ip1;
+ struct xfs_inode *xmi_ip2;
/* File offset range information. */
- xfs_fileoff_t sxi_startoff1;
- xfs_fileoff_t sxi_startoff2;
- xfs_filblks_t sxi_blockcount;
+ xfs_fileoff_t xmi_startoff1;
+ xfs_fileoff_t xmi_startoff2;
+ xfs_filblks_t xmi_blockcount;
/* Set these file sizes after the operation, unless negative. */
- xfs_fsize_t sxi_isize1;
- xfs_fsize_t sxi_isize2;
+ xfs_fsize_t xmi_isize1;
+ xfs_fsize_t xmi_isize2;
- /* XFS_SWAP_EXT_* log operation flags */
- uint64_t sxi_flags;
+ /* XFS_EXCHMAPS_* log operation flags */
+ uint64_t xmi_flags;
};
The new log intent item contains enough information to track two logical fork
offset ranges: ``(inode1, startoff1, blockcount)`` and ``(inode2, startoff2,
blockcount)``.
-Each step of a swap operation exchanges the largest file range mapping possible
-from one file to the other.
-After each step in the swap operation, the two startoff fields are incremented
-and the blockcount field is decremented to reflect the progress made.
-The flags field captures behavioral parameters such as swapping the attr fork
-instead of the data fork and other work to be done after the extent swap.
-The two isize fields are used to swap the file size at the end of the operation
-if the file data fork is the target of the swap operation.
-
-When the extent swap is initiated, the sequence of operations is as follows:
-
-1. Create a deferred work item for the extent swap.
- At the start, it should contain the entirety of the file ranges to be
- swapped.
+Each step of an exchange operation exchanges the largest file range mapping
+possible from one file to the other.
+After each step in the exchange operation, the two startoff fields are
+incremented and the blockcount field is decremented to reflect the progress
+made.
+The flags field captures behavioral parameters such as exchanging attr fork
+mappings instead of the data fork and other work to be done after the exchange.
+The two isize fields are used to exchange the file sizes at the end of the
+operation if the file data fork is the target of the operation.
+
+When the exchange is initiated, the sequence of operations is as follows:
+
+1. Create a deferred work item for the file mapping exchange.
+ At the start, it should contain the entirety of the file block ranges to be
+ exchanged.
2. Call ``xfs_defer_finish`` to process the exchange.
- This is encapsulated in ``xrep_tempswap_contents`` for scrub operations.
+ This is encapsulated in ``xrep_tempexch_contents`` for scrub operations.
This will log an extent swap intent item to the transaction for the deferred
- extent swap work item.
+ mapping exchange work item.
-3. Until ``sxi_blockcount`` of the deferred extent swap work item is zero,
+3. Until ``xmi_blockcount`` of the deferred mapping exchange work item is zero,
- a. Read the block maps of both file ranges starting at ``sxi_startoff1`` and
- ``sxi_startoff2``, respectively, and compute the longest extent that can
- be swapped in a single step.
+ a. Read the block maps of both file ranges starting at ``xmi_startoff1`` and
+ ``xmi_startoff2``, respectively, and compute the longest extent that can
+ be exchanged in a single step.
This is the minimum of the two ``br_blockcount`` s in the mappings.
Keep advancing through the file forks until at least one of the mappings
contains written blocks.
@@ -4159,20 +3998,20 @@ When the extent swap is initiated, the sequence of operations is as follows:
g. Extend the ondisk size of either file if necessary.
- h. Log an extent swap done log item for the extent swap intent log item
- that was read at the start of step 3.
+ h. Log a mapping exchange done log item for th mapping exchange intent log
+ item that was read at the start of step 3.
i. Compute the amount of file range that has just been covered.
This quantity is ``(map1.br_startoff + map1.br_blockcount -
- sxi_startoff1)``, because step 3a could have skipped holes.
+ xmi_startoff1)``, because step 3a could have skipped holes.
- j. Increase the starting offsets of ``sxi_startoff1`` and ``sxi_startoff2``
+ j. Increase the starting offsets of ``xmi_startoff1`` and ``xmi_startoff2``
by the number of blocks computed in the previous step, and decrease
- ``sxi_blockcount`` by the same quantity.
+ ``xmi_blockcount`` by the same quantity.
This advances the cursor.
- k. Log a new extent swap intent log item reflecting the advanced state of
- the work item.
+ k. Log a new mapping exchange intent log item reflecting the advanced state
+ of the work item.
l. Return the proper error code (EAGAIN) to the deferred operation manager
to inform it that there is more work to be done.
@@ -4183,22 +4022,23 @@ When the extent swap is initiated, the sequence of operations is as follows:
This will be discussed in more detail in subsequent sections.
If the filesystem goes down in the middle of an operation, log recovery will
-find the most recent unfinished extent swap log intent item and restart from
-there.
-This is how extent swapping guarantees that an outside observer will either see
-the old broken structure or the new one, and never a mismash of both.
+find the most recent unfinished mapping exchange log intent item and restart
+from there.
+This is how atomic file mapping exchanges guarantees that an outside observer
+will either see the old broken structure or the new one, and never a mismash of
+both.
-Preparation for Extent Swapping
-```````````````````````````````
+Preparation for File Content Exchanges
+``````````````````````````````````````
There are a few things that need to be taken care of before initiating an
-atomic extent swap operation.
+atomic file mapping exchange operation.
First, regular files require the page cache to be flushed to disk before the
operation begins, and directio writes to be quiesced.
-Like any filesystem operation, extent swapping must determine the maximum
-amount of disk space and quota that can be consumed on behalf of both files in
-the operation, and reserve that quantity of resources to avoid an unrecoverable
-out of space failure once it starts dirtying metadata.
+Like any filesystem operation, file mapping exchanges must determine the
+maximum amount of disk space and quota that can be consumed on behalf of both
+files in the operation, and reserve that quantity of resources to avoid an
+unrecoverable out of space failure once it starts dirtying metadata.
The preparation step scans the ranges of both files to estimate:
- Data device blocks needed to handle the repeated updates to the fork
@@ -4212,56 +4052,59 @@ The preparation step scans the ranges of both files to estimate:
to different extents on the realtime volume, which could happen if the
operation fails to run to completion.
-The need for precise estimation increases the run time of the swap operation,
-but it is very important to maintain correct accounting.
-The filesystem must not run completely out of free space, nor can the extent
-swap ever add more extent mappings to a fork than it can support.
+The need for precise estimation increases the run time of the exchange
+operation, but it is very important to maintain correct accounting.
+The filesystem must not run completely out of free space, nor can the mapping
+exchange ever add more extent mappings to a fork than it can support.
Regular users are required to abide the quota limits, though metadata repairs
may exceed quota to resolve inconsistent metadata elsewhere.
-Special Features for Swapping Metadata File Extents
-```````````````````````````````````````````````````
+Special Features for Exchanging Metadata File Contents
+``````````````````````````````````````````````````````
Extended attributes, symbolic links, and directories can set the fork format to
"local" and treat the fork as a literal area for data storage.
Metadata repairs must take extra steps to support these cases:
- If both forks are in local format and the fork areas are large enough, the
- swap is performed by copying the incore fork contents, logging both forks,
- and committing.
- The atomic extent swap mechanism is not necessary, since this can be done
- with a single transaction.
+ exchange is performed by copying the incore fork contents, logging both
+ forks, and committing.
+ The atomic file mapping exchange mechanism is not necessary, since this can
+ be done with a single transaction.
-- If both forks map blocks, then the regular atomic extent swap is used.
+- If both forks map blocks, then the regular atomic file mapping exchange is
+ used.
- Otherwise, only one fork is in local format.
The contents of the local format fork are converted to a block to perform the
- swap.
+ exchange.
The conversion to block format must be done in the same transaction that
- logs the initial extent swap intent log item.
- The regular atomic extent swap is used to exchange the mappings.
- Special flags are set on the swap operation so that the transaction can be
- rolled one more time to convert the second file's fork back to local format
- so that the second file will be ready to go as soon as the ILOCK is dropped.
+ logs the initial mapping exchange intent log item.
+ The regular atomic mapping exchange is used to exchange the metadata file
+ mappings.
+ Special flags are set on the exchange operation so that the transaction can
+ be rolled one more time to convert the second file's fork back to local
+ format so that the second file will be ready to go as soon as the ILOCK is
+ dropped.
Extended attributes and directories stamp the owning inode into every block,
but the buffer verifiers do not actually check the inode number!
Although there is no verification, it is still important to maintain
-referential integrity, so prior to performing the extent swap, online repair
-builds every block in the new data structure with the owner field of the file
-being repaired.
+referential integrity, so prior to performing the mapping exchange, online
+repair builds every block in the new data structure with the owner field of the
+file being repaired.
-After a successful swap operation, the repair operation must reap the old fork
-blocks by processing each fork mapping through the standard :ref:`file extent
-reaping <reaping>` mechanism that is done post-repair.
+After a successful exchange operation, the repair operation must reap the old
+fork blocks by processing each fork mapping through the standard :ref:`file
+extent reaping <reaping>` mechanism that is done post-repair.
If the filesystem should go down during the reap part of the repair, the
iunlink processing at the end of recovery will free both the temporary file and
whatever blocks were not reaped.
However, this iunlink processing omits the cross-link detection of online
repair, and is not completely foolproof.
-Swapping Temporary File Extents
-```````````````````````````````
+Exchanging Temporary File Contents
+``````````````````````````````````
To repair a metadata file, online repair proceeds as follows:
@@ -4271,14 +4114,14 @@ To repair a metadata file, online repair proceeds as follows:
file.
The same fork must be written to as is being repaired.
-3. Commit the scrub transaction, since the swap estimation step must be
- completed before transaction reservations are made.
+3. Commit the scrub transaction, since the exchange resource estimation step
+ must be completed before transaction reservations are made.
-4. Call ``xrep_tempswap_trans_alloc`` to allocate a new scrub transaction with
+4. Call ``xrep_tempexch_trans_alloc`` to allocate a new scrub transaction with
the appropriate resource reservations, locks, and fill out a ``struct
- xfs_swapext_req`` with the details of the swap operation.
+ xfs_exchmaps_req`` with the details of the exchange operation.
-5. Call ``xrep_tempswap_contents`` to swap the contents.
+5. Call ``xrep_tempexch_contents`` to exchange the contents.
6. Commit the transaction to complete the repair.
@@ -4320,14 +4163,9 @@ To check the summary file against the bitmap:
3. Compare the contents of the xfile against the ondisk file.
To repair the summary file, write the xfile contents into the temporary file
-and use atomic extent swap to commit the new contents.
+and use atomic mapping exchange to commit the new contents.
The temporary file is then reaped.
-The proposed patchset is the
-`realtime summary repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary>`_
-series.
-
Case Study: Salvaging Extended Attributes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -4363,17 +4201,12 @@ Salvaging extended attributes is done as follows:
memory or there are no more attr fork blocks to examine, unlock the file and
add the staged extended attributes to the temporary file.
-3. Use atomic extent swapping to exchange the new and old extended attribute
- structures.
+3. Use atomic file mapping exchange to exchange the new and old extended
+ attribute structures.
The old attribute blocks are now attached to the temporary file.
4. Reap the temporary file.
-The proposed patchset is the
-`extended attribute repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_
-series.
-
Fixing Directories
------------------
@@ -4421,7 +4254,8 @@ salvaging directories is straightforward:
directory and add the staged dirents into the temporary directory.
Truncate the staging files.
-4. Use atomic extent swapping to exchange the new and old directory structures.
+4. Use atomic file mapping exchange to exchange the new and old directory
+ structures.
The old directory blocks are now attached to the temporary file.
5. Reap the temporary file.
@@ -4447,11 +4281,6 @@ Unfortunately, the current dentry cache design doesn't provide a means to walk
every child dentry of a specific directory, which makes this a hard problem.
There is no known solution.
-The proposed patchset is the
-`directory repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
-series.
-
Parent Pointers
```````````````
@@ -4464,10 +4293,10 @@ reconstruction of filesystem space metadata.
The parent pointer feature, however, makes total directory reconstruction
possible.
-XFS parent pointers include the dirent name and location of the entry within
-the parent directory.
+XFS parent pointers contain the information needed to identify the
+corresponding directory entry in the parent directory.
In other words, child files use extended attributes to store pointers to
-parents in the form ``(parent_inum, parent_gen, dirent_pos) → (dirent_name)``.
+parents in the form ``(dirent_name) → (parent_inum, parent_gen)``.
The directory checking process can be strengthened to ensure that the target of
each dirent also contains a parent pointer pointing back to the dirent.
Likewise, each parent pointer can be checked by ensuring that the target of
@@ -4475,8 +4304,6 @@ each parent pointer is a directory and that it contains a dirent matching
the parent pointer.
Both online and offline repair can use this strategy.
-**Note**: The ondisk format of parent pointers is not yet finalized.
-
+--------------------------------------------------------------------------+
| **Historical Sidebar**: |
+--------------------------------------------------------------------------+
@@ -4518,8 +4345,58 @@ Both online and offline repair can use this strategy.
| Chandan increased the maximum extent counts of both data and attribute |
| forks, thereby ensuring that the extended attribute structure can grow |
| to handle the maximum hardlink count of any file. |
+| |
+| For this second effort, the ondisk parent pointer format as originally |
+| proposed was ``(parent_inum, parent_gen, dirent_pos) → (dirent_name)``. |
+| The format was changed during development to eliminate the requirement |
+| of repair tools needing to ensure that the ``dirent_pos`` field always |
+| matched when reconstructing a directory. |
+| |
+| There were a few other ways to have solved that problem: |
+| |
+| 1. The field could be designated advisory, since the other three values |
+| are sufficient to find the entry in the parent. |
+| However, this makes indexed key lookup impossible while repairs are |
+| ongoing. |
+| |
+| 2. We could allow creating directory entries at specified offsets, which |
+| solves the referential integrity problem but runs the risk that |
+| dirent creation will fail due to conflicts with the free space in the |
+| directory. |
+| |
+| These conflicts could be resolved by appending the directory entry |
+| and amending the xattr code to support updating an xattr key and |
+| reindexing the dabtree, though this would have to be performed with |
+| the parent directory still locked. |
+| |
+| 3. Same as above, but remove the old parent pointer entry and add a new |
+| one atomically. |
+| |
+| 4. Change the ondisk xattr format to |
+| ``(parent_inum, name) → (parent_gen)``, which would provide the attr |
+| name uniqueness that we require, without forcing repair code to |
+| update the dirent position. |
+| Unfortunately, this requires changes to the xattr code to support |
+| attr names as long as 263 bytes. |
+| |
+| 5. Change the ondisk xattr format to ``(parent_inum, hash(name)) → |
+| (name, parent_gen)``. |
+| If the hash is sufficiently resistant to collisions (e.g. sha256) |
+| then this should provide the attr name uniqueness that we require. |
+| Names shorter than 247 bytes could be stored directly. |
+| |
+| 6. Change the ondisk xattr format to ``(dirent_name) → (parent_ino, |
+| parent_gen)``. This format doesn't require any of the complicated |
+| nested name hashing of the previous suggestions. However, it was |
+| discovered that multiple hardlinks to the same inode with the same |
+| filename caused performance problems with hashed xattr lookups, so |
+| the parent inumber is now xor'd into the hash index. |
+| |
+| In the end, it was decided that solution #6 was the most compact and the |
+| most performant. A new hash function was designed for parent pointers. |
+--------------------------------------------------------------------------+
+
Case Study: Repairing Directories with Parent Pointers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -4527,8 +4404,9 @@ Directory rebuilding uses a :ref:`coordinated inode scan <iscan>` and
a :ref:`directory entry live update hook <liveupdate>` as follows:
1. Set up a temporary directory for generating the new directory structure,
- an xfblob for storing entry names, and an xfarray for stashing directory
- updates.
+ an xfblob for storing entry names, and an xfarray for stashing the fixed
+ size fields involved in a directory update: ``(child inumber, add vs.
+ remove, name cookie, ftype)``.
2. Set up an inode scanner and hook into the directory entry code to receive
updates on directory operations.
@@ -4537,72 +4415,30 @@ a :ref:`directory entry live update hook <liveupdate>` as follows:
pointer references the directory of interest.
If so:
- a. Stash an addname entry for this dirent in the xfarray for later.
+ a. Stash the parent pointer name and an addname entry for this dirent in the
+ xfblob and xfarray, respectively.
- b. When finished scanning that file, flush the stashed updates to the
- temporary directory.
+ b. When finished scanning that file or the kernel memory consumption exceeds
+ a threshold, flush the stashed updates to the temporary directory.
4. For each live directory update received via the hook, decide if the child
has already been scanned.
If so:
- a. Stash an addname or removename entry for this dirent update in the
- xfarray for later.
+ a. Stash the parent pointer name an addname or removename entry for this
+ dirent update in the xfblob and xfarray for later.
We cannot write directly to the temporary directory because hook
functions are not allowed to modify filesystem metadata.
Instead, we stash updates in the xfarray and rely on the scanner thread
to apply the stashed updates to the temporary directory.
-5. When the scan is complete, atomically swap the contents of the temporary
+5. When the scan is complete, replay any stashed entries in the xfarray.
+
+6. When the scan is complete, atomically exchange the contents of the temporary
directory and the directory being repaired.
The temporary directory now contains the damaged directory structure.
-6. Reap the temporary directory.
-
-7. Update the dirent position field of parent pointers as necessary.
- This may require the queuing of a substantial number of xattr log intent
- items.
-
-The proposed patchset is the
-`parent pointers directory repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-dir-repair>`_
-series.
-
-**Unresolved Question**: How will repair ensure that the ``dirent_pos`` fields
-match in the reconstructed directory?
-
-*Answer*: There are a few ways to solve this problem:
-
-1. The field could be designated advisory, since the other three values are
- sufficient to find the entry in the parent.
- However, this makes indexed key lookup impossible while repairs are ongoing.
-
-2. We could allow creating directory entries at specified offsets, which solves
- the referential integrity problem but runs the risk that dirent creation
- will fail due to conflicts with the free space in the directory.
-
- These conflicts could be resolved by appending the directory entry and
- amending the xattr code to support updating an xattr key and reindexing the
- dabtree, though this would have to be performed with the parent directory
- still locked.
-
-3. Same as above, but remove the old parent pointer entry and add a new one
- atomically.
-
-4. Change the ondisk xattr format to ``(parent_inum, name) → (parent_gen)``,
- which would provide the attr name uniqueness that we require, without
- forcing repair code to update the dirent position.
- Unfortunately, this requires changes to the xattr code to support attr
- names as long as 263 bytes.
-
-5. Change the ondisk xattr format to ``(parent_inum, hash(name)) →
- (name, parent_gen)``.
- If the hash is sufficiently resistant to collisions (e.g. sha256) then
- this should provide the attr name uniqueness that we require.
- Names shorter than 247 bytes could be stored directly.
-
-Discussion is ongoing under the `parent pointers patch deluge
-<https://www.spinics.net/lists/linux-xfs/msg69397.html>`_.
+7. Reap the temporary directory.
Case Study: Repairing Parent Pointers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -4611,8 +4447,9 @@ Online reconstruction of a file's parent pointer information works similarly to
directory reconstruction:
1. Set up a temporary file for generating a new extended attribute structure,
- an `xfblob<xfblob>` for storing parent pointer names, and an xfarray for
- stashing parent pointer updates.
+ an xfblob for storing parent pointer names, and an xfarray for stashing the
+ fixed size fields involved in a parent pointer update: ``(parent inumber,
+ parent generation, add vs. remove, name cookie)``.
2. Set up an inode scanner and hook into the directory entry code to receive
updates on directory operations.
@@ -4621,35 +4458,32 @@ directory reconstruction:
dirent references the file of interest.
If so:
- a. Stash an addpptr entry for this parent pointer in the xfblob and xfarray
- for later.
+ a. Stash the dirent name and an addpptr entry for this parent pointer in the
+ xfblob and xfarray, respectively.
- b. When finished scanning the directory, flush the stashed updates to the
- temporary directory.
+ b. When finished scanning the directory or the kernel memory consumption
+ exceeds a threshold, flush the stashed updates to the temporary file.
4. For each live directory update received via the hook, decide if the parent
has already been scanned.
If so:
- a. Stash an addpptr or removepptr entry for this dirent update in the
- xfarray for later.
+ a. Stash the dirent name and an addpptr or removepptr entry for this dirent
+ update in the xfblob and xfarray for later.
We cannot write parent pointers directly to the temporary file because
hook functions are not allowed to modify filesystem metadata.
Instead, we stash updates in the xfarray and rely on the scanner thread
to apply the stashed parent pointer updates to the temporary file.
-5. Copy all non-parent pointer extended attributes to the temporary file.
+5. When the scan is complete, replay any stashed entries in the xfarray.
-6. When the scan is complete, atomically swap the attribute fork of the
- temporary file and the file being repaired.
- The temporary file now contains the damaged extended attribute structure.
+6. Copy all non-parent pointer extended attributes to the temporary file.
-7. Reap the temporary file.
+7. When the scan is complete, atomically exchange the mappings of the attribute
+ forks of the temporary file and the file being repaired.
+ The temporary file now contains the damaged extended attribute structure.
-The proposed patchset is the
-`parent pointers repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-parent-repair>`_
-series.
+8. Reap the temporary file.
Digression: Offline Checking of Parent Pointers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -4659,26 +4493,56 @@ files are erased long before directory tree connectivity checks are performed.
Parent pointer checks are therefore a second pass to be added to the existing
connectivity checks:
-1. After the set of surviving files has been established (i.e. phase 6),
+1. After the set of surviving files has been established (phase 6),
walk the surviving directories of each AG in the filesystem.
This is already performed as part of the connectivity checks.
-2. For each directory entry found, record the name in an xfblob, and store
- ``(child_ag_inum, parent_inum, parent_gen, dirent_pos)`` tuples in a
- per-AG in-memory slab.
+2. For each directory entry found,
+
+ a. If the name has already been stored in the xfblob, then use that cookie
+ and skip the next step.
+
+ b. Otherwise, record the name in an xfblob, and remember the xfblob cookie.
+ Unique mappings are critical for
+
+ 1. Deduplicating names to reduce memory usage, and
+
+ 2. Creating a stable sort key for the parent pointer indexes so that the
+ parent pointer validation described below will work.
+
+ c. Store ``(child_ag_inum, parent_inum, parent_gen, name_hash, name_len,
+ name_cookie)`` tuples in a per-AG in-memory slab. The ``name_hash``
+ referenced in this section is the regular directory entry name hash, not
+ the specialized one used for parent pointer xattrs.
3. For each AG in the filesystem,
- a. Sort the per-AG tuples in order of child_ag_inum, parent_inum, and
- dirent_pos.
+ a. Sort the per-AG tuple set in order of ``child_ag_inum``, ``parent_inum``,
+ ``name_hash``, and ``name_cookie``.
+ Having a single ``name_cookie`` for each ``name`` is critical for
+ handling the uncommon case of a directory containing multiple hardlinks
+ to the same file where all the names hash to the same value.
b. For each inode in the AG,
1. Scan the inode for parent pointers.
- Record the names in a per-file xfblob, and store ``(parent_inum,
- parent_gen, dirent_pos)`` tuples in a per-file slab.
+ For each parent pointer found,
+
+ a. Validate the ondisk parent pointer.
+ If validation fails, move on to the next parent pointer in the
+ file.
- 2. Sort the per-file tuples in order of parent_inum, and dirent_pos.
+ b. If the name has already been stored in the xfblob, then use that
+ cookie and skip the next step.
+
+ c. Record the name in a per-file xfblob, and remember the xfblob
+ cookie.
+
+ d. Store ``(parent_inum, parent_gen, name_hash, name_len,
+ name_cookie)`` tuples in a per-file slab.
+
+ 2. Sort the per-file tuples in order of ``parent_inum``, ``name_hash``,
+ and ``name_cookie``.
3. Position one slab cursor at the start of the inode's records in the
per-AG tuple slab.
@@ -4687,28 +4551,32 @@ connectivity checks:
4. Position a second slab cursor at the start of the per-file tuple slab.
- 5. Iterate the two cursors in lockstep, comparing the parent_ino and
- dirent_pos fields of the records under each cursor.
+ 5. Iterate the two cursors in lockstep, comparing the ``parent_ino``,
+ ``name_hash``, and ``name_cookie`` fields of the records under each
+ cursor:
- a. Tuples in the per-AG list but not the per-file list are missing and
- need to be written to the inode.
+ a. If the per-AG cursor is at a lower point in the keyspace than the
+ per-file cursor, then the per-AG cursor points to a missing parent
+ pointer.
+ Add the parent pointer to the inode and advance the per-AG
+ cursor.
- b. Tuples in the per-file list but not the per-AG list are dangling
- and need to be removed from the inode.
+ b. If the per-file cursor is at a lower point in the keyspace than
+ the per-AG cursor, then the per-file cursor points to a dangling
+ parent pointer.
+ Remove the parent pointer from the inode and advance the per-file
+ cursor.
- c. For tuples in both lists, update the parent_gen and name components
- of the parent pointer if necessary.
+ c. Otherwise, both cursors point at the same parent pointer.
+ Update the parent_gen component if necessary.
+ Advance both cursors.
4. Move on to examining link counts, as we do today.
-The proposed patchset is the
-`offline parent pointers repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-repair>`_
-series.
-
-Rebuilding directories from parent pointers in offline repair is very
-challenging because it currently uses a single-pass scan of the filesystem
-during phase 3 to decide which files are corrupt enough to be zapped.
+Rebuilding directories from parent pointers in offline repair would be very
+challenging because xfs_repair currently uses two single-pass scans of the
+filesystem during phases 3 and 4 to decide which files are corrupt enough to be
+zapped.
This scan would have to be converted into a multi-pass scan:
1. The first pass of the scan zaps corrupt inodes, forks, and attributes
@@ -4730,6 +4598,124 @@ This scan would have to be converted into a multi-pass scan:
This code has not yet been constructed.
+.. _dirtree:
+
+Case Study: Directory Tree Structure
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+As mentioned earlier, the filesystem directory tree is supposed to be a
+directed acylic graph structure.
+However, each node in this graph is a separate ``xfs_inode`` object with its
+own locks, which makes validating the tree qualities difficult.
+Fortunately, non-directories are allowed to have multiple parents and cannot
+have children, so only directories need to be scanned.
+Directories typically constitute 5-10% of the files in a filesystem, which
+reduces the amount of work dramatically.
+
+If the directory tree could be frozen, it would be easy to discover cycles and
+disconnected regions by running a depth (or breadth) first search downwards
+from the root directory and marking a bitmap for each directory found.
+At any point in the walk, trying to set an already set bit means there is a
+cycle.
+After the scan completes, XORing the marked inode bitmap with the inode
+allocation bitmap reveals disconnected inodes.
+However, one of online repair's design goals is to avoid locking the entire
+filesystem unless it's absolutely necessary.
+Directory tree updates can move subtrees across the scanner wavefront on a live
+filesystem, so the bitmap algorithm cannot be applied.
+
+Directory parent pointers enable an incremental approach to validation of the
+tree structure.
+Instead of using one thread to scan the entire filesystem, multiple threads can
+walk from individual subdirectories upwards towards the root.
+For this to work, all directory entries and parent pointers must be internally
+consistent, each directory entry must have a parent pointer, and the link
+counts of all directories must be correct.
+Each scanner thread must be able to take the IOLOCK of an alleged parent
+directory while holding the IOLOCK of the child directory to prevent either
+directory from being moved within the tree.
+This is not possible since the VFS does not take the IOLOCK of a child
+subdirectory when moving that subdirectory, so instead the scanner stabilizes
+the parent -> child relationship by taking the ILOCKs and installing a dirent
+update hook to detect changes.
+
+The scanning process uses a dirent hook to detect changes to the directories
+mentioned in the scan data.
+The scan works as follows:
+
+1. For each subdirectory in the filesystem,
+
+ a. For each parent pointer of that subdirectory,
+
+ 1. Create a path object for that parent pointer, and mark the
+ subdirectory inode number in the path object's bitmap.
+
+ 2. Record the parent pointer name and inode number in a path structure.
+
+ 3. If the alleged parent is the subdirectory being scrubbed, the path is
+ a cycle.
+ Mark the path for deletion and repeat step 1a with the next
+ subdirectory parent pointer.
+
+ 4. Try to mark the alleged parent inode number in a bitmap in the path
+ object.
+ If the bit is already set, then there is a cycle in the directory
+ tree.
+ Mark the path as a cycle and repeat step 1a with the next subdirectory
+ parent pointer.
+
+ 5. Load the alleged parent.
+ If the alleged parent is not a linked directory, abort the scan
+ because the parent pointer information is inconsistent.
+
+ 6. For each parent pointer of this alleged ancestor directory,
+
+ a. Record the parent pointer name and inode number in the path object
+ if no parent has been set for that level.
+
+ b. If an ancestor has more than one parent, mark the path as corrupt.
+ Repeat step 1a with the next subdirectory parent pointer.
+
+ c. Repeat steps 1a3-1a6 for the ancestor identified in step 1a6a.
+ This repeats until the directory tree root is reached or no parents
+ are found.
+
+ 7. If the walk terminates at the root directory, mark the path as ok.
+
+ 8. If the walk terminates without reaching the root, mark the path as
+ disconnected.
+
+2. If the directory entry update hook triggers, check all paths already found
+ by the scan.
+ If the entry matches part of a path, mark that path and the scan stale.
+ When the scanner thread sees that the scan has been marked stale, it deletes
+ all scan data and starts over.
+
+Repairing the directory tree works as follows:
+
+1. Walk each path of the target subdirectory.
+
+ a. Corrupt paths and cycle paths are counted as suspect.
+
+ b. Paths already marked for deletion are counted as bad.
+
+ c. Paths that reached the root are counted as good.
+
+2. If the subdirectory is either the root directory or has zero link count,
+ delete all incoming directory entries in the immediate parents.
+ Repairs are complete.
+
+3. If the subdirectory has exactly one path, set the dotdot entry to the
+ parent and exit.
+
+4. If the subdirectory has at least one good path, delete all the other
+ incoming directory entries in the immediate parents.
+
+5. If the subdirectory has no good paths and more than one suspect path, delete
+ all the other incoming directory entries in the immediate parents.
+
+6. If the subdirectory has zero paths, attach it to the lost and found.
+
.. _orphanage:
The Orphanage
@@ -4777,19 +4763,22 @@ Orphaned files are adopted by the orphanage as follows:
The ``xrep_orphanage_iolock_two`` function follows the inode locking
strategy discussed earlier.
-3. Call ``xrep_orphanage_compute_blkres`` and ``xrep_orphanage_compute_name``
- to compute the new name in the orphanage and the block reservation required.
-
-4. Use ``xrep_orphanage_adoption_prep`` to reserve resources to the repair
+3. Use ``xrep_adoption_trans_alloc`` to reserve resources to the repair
transaction.
-5. Call ``xrep_orphanage_adopt`` to reparent the orphaned file into the lost
- and found, and update the kernel dentry cache.
+4. Call ``xrep_orphanage_compute_name`` to compute the new name in the
+ orphanage.
-The proposed patches are in the
-`orphanage adoption
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_
-series.
+5. If the adoption is going to happen, call ``xrep_adoption_reparent`` to
+ reparent the orphaned file into the lost and found and invalidate the dentry
+ cache.
+
+6. Call ``xrep_adoption_finish`` to commit any filesystem updates, release the
+ orphanage ILOCK, and clean the scrub transaction. Call
+ ``xrep_adoption_commit`` to commit the updates and the scrub transaction.
+
+7. If a runtime error happens, call ``xrep_adoption_cancel`` to release all
+ resources.
6. Userspace Algorithms and Data Structures
===========================================
@@ -4904,14 +4893,6 @@ first workqueue's workers until the backlog eases.
This doesn't completely solve the balancing problem, but reduces it enough to
move on to more pressing issues.
-The proposed patchsets are the scrub
-`performance tweaks
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-tweaks>`_
-and the
-`inode scan rebalance
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalance>`_
-series.
-
.. _scrubrepair:
Scheduling Repairs
@@ -4992,20 +4973,6 @@ immediately.
Corrupt file data blocks reported by phase 6 cannot be recovered by the
filesystem.
-The proposed patchsets are the
-`repair warning improvements
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warnings>`_,
-refactoring of the
-`repair data dependency
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-deps>`_
-and
-`object tracking
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracking>`_,
-and the
-`repair scheduling
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-scheduling>`_
-improvement series.
-
Checking Names for Confusable Unicode Sequences
-----------------------------------------------
@@ -5090,7 +5057,7 @@ This scan after validation of all filesystem metadata (except for the summary
counters) as phase 6.
The scan starts by calling ``FS_IOC_GETFSMAP`` to scan the filesystem space map
to find areas that are allocated to file data fork extents.
-Gaps betweeen data fork extents that are smaller than 64k are treated as if
+Gaps between data fork extents that are smaller than 64k are treated as if
they were data fork extents to reduce the command setup overhead.
When the space map scan accumulates a region larger than 32MB, a media
verification request is sent to the disk as a directio read of the raw block
@@ -5116,18 +5083,18 @@ make it easier for code readers to understand what has been built, for whom it
has been built, and why.
Please feel free to contact the XFS mailing list with questions.
-FIEXCHANGE_RANGE
-----------------
+XFS_IOC_EXCHANGE_RANGE
+----------------------
-As discussed earlier, a second frontend to the atomic extent swap mechanism is
-a new ioctl call that userspace programs can use to commit updates to files
-atomically.
+As discussed earlier, a second frontend to the atomic file mapping exchange
+mechanism is a new ioctl call that userspace programs can use to commit updates
+to files atomically.
This frontend has been out for review for several years now, though the
necessary refinements to online repair and lack of customer demand mean that
the proposal has not been pushed very hard.
-Extent Swapping with Regular User Files
-```````````````````````````````````````
+File Content Exchanges with Regular User Files
+``````````````````````````````````````````````
As mentioned earlier, XFS has long had the ability to swap extents between
files, which is used almost exclusively by ``xfs_fsr`` to defragment files.
@@ -5142,12 +5109,12 @@ the consistency of the fork mappings with the reverse mapping index was to
develop an iterative mechanism that used deferred bmap and rmap operations to
swap mappings one at a time.
This mechanism is identical to steps 2-3 from the procedure above except for
-the new tracking items, because the atomic extent swap mechanism is an
-iteration of an existing mechanism and not something totally novel.
+the new tracking items, because the atomic file mapping exchange mechanism is
+an iteration of an existing mechanism and not something totally novel.
For the narrow case of file defragmentation, the file contents must be
identical, so the recovery guarantees are not much of a gain.
-Atomic extent swapping is much more flexible than the existing swapext
+Atomic file content exchanges are much more flexible than the existing swapext
implementations because it can guarantee that the caller never sees a mix of
old and new contents even after a crash, and it can operate on two arbitrary
file fork ranges.
@@ -5158,11 +5125,11 @@ The extra flexibility enables several new use cases:
Next, it opens a temporary file and calls the file clone operation to reflink
the first file's contents into the temporary file.
Writes to the original file should instead be written to the temporary file.
- Finally, the process calls the atomic extent swap system call
- (``FIEXCHANGE_RANGE``) to exchange the file contents, thereby committing all
- of the updates to the original file, or none of them.
+ Finally, the process calls the atomic file mapping exchange system call
+ (``XFS_IOC_EXCHANGE_RANGE``) to exchange the file contents, thereby
+ committing all of the updates to the original file, or none of them.
-.. _swapext_if_unchanged:
+.. _exchrange_if_unchanged:
- **Transactional file updates**: The same mechanism as above, but the caller
only wants the commit to occur if the original file's contents have not
@@ -5171,19 +5138,22 @@ The extra flexibility enables several new use cases:
change timestamps of the original file before reflinking its data to the
temporary file.
When the program is ready to commit the changes, it passes the timestamps
- into the kernel as arguments to the atomic extent swap system call.
+ into the kernel as arguments to the atomic file mapping exchange system call.
The kernel only commits the changes if the provided timestamps match the
original file.
+ A new ioctl (``XFS_IOC_COMMIT_RANGE``) is provided to perform this.
- **Emulation of atomic block device writes**: Export a block device with a
logical sector size matching the filesystem block size to force all writes
to be aligned to the filesystem block size.
Stage all writes to a temporary file, and when that is complete, call the
- atomic extent swap system call with a flag to indicate that holes in the
- temporary file should be ignored.
+ atomic file mapping exchange system call with a flag to indicate that holes
+ in the temporary file should be ignored.
This emulates an atomic device write in software, and can support arbitrary
scattered writes.
+(This functionality was merged into mainline as of 2025)
+
Vectorized Scrub
----------------
@@ -5205,13 +5175,7 @@ It is hoped that ``io_uring`` will pick up enough of this functionality that
online fsck can use that instead of adding a separate vectored scrub system
call to XFS.
-The relevant patchsets are the
-`kernel vectorized scrub
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=vectorized-scrub>`_
-and
-`userspace vectorized scrub
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=vectorized-scrub>`_
-series.
+(This functionality was merged into mainline as of 2025)
Quality of Service Targets for Scrub
------------------------------------
@@ -5262,8 +5226,8 @@ of the file to try to share the physical space with a dummy file.
Cloning the extent means that the original owners cannot overwrite the
contents; any changes will be written somewhere else via copy-on-write.
Clearspace makes its own copy of the frozen extent in an area that is not being
-cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic extent swap
-<swapext_if_unchanged>` feature) to change the target file's data extent
+cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic file content exchanges
+<exchrange_if_unchanged>` feature) to change the target file's data extent
mapping away from the area being cleared.
When all other mappings have been moved, clearspace reflinks the space into the
space collector file so that it becomes unavailable.
diff --git a/Documentation/filesystems/xfs-self-describing-metadata.rst b/Documentation/filesystems/xfs/xfs-self-describing-metadata.rst
index a10c4ae6955e..a10c4ae6955e 100644
--- a/Documentation/filesystems/xfs-self-describing-metadata.rst
+++ b/Documentation/filesystems/xfs/xfs-self-describing-metadata.rst
diff --git a/Documentation/filesystems/zonefs.rst b/Documentation/filesystems/zonefs.rst
index 394b9f15dce0..c22124c2213d 100644
--- a/Documentation/filesystems/zonefs.rst
+++ b/Documentation/filesystems/zonefs.rst
@@ -378,7 +378,7 @@ The attributes defined are as follows.
sequential zone files. Failure to do so can result in write errors.
* **max_active_seq_files**: This attribute reports the maximum number of
sequential zone files that are in an active state, that is, sequential zone
- files that are partially writen (not empty nor full) or that have a zone that
+ files that are partially written (not empty nor full) or that have a zone that
is explicitly open (which happens only if the *explicit-open* mount option is
used). This number is always equal to the maximum number of active zones that
the device supports. A value of 0 means that the mounted device has no limit