summaryrefslogtreecommitdiff
path: root/Documentation/filesystems
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/fscrypt.rst45
-rw-r--r--Documentation/filesystems/fsverity.rst3
-rw-r--r--Documentation/filesystems/iomap/design.rst3
-rw-r--r--Documentation/filesystems/iomap/operations.rst57
-rw-r--r--Documentation/filesystems/locking.rst8
-rw-r--r--Documentation/filesystems/porting.rst42
-rw-r--r--Documentation/filesystems/propagate_umount.txt484
-rw-r--r--Documentation/filesystems/vfs.rst37
8 files changed, 597 insertions, 82 deletions
diff --git a/Documentation/filesystems/fscrypt.rst b/Documentation/filesystems/fscrypt.rst
index 29e84d125e02..696a5844bfa3 100644
--- a/Documentation/filesystems/fscrypt.rst
+++ b/Documentation/filesystems/fscrypt.rst
@@ -147,9 +147,8 @@ However, these ioctls have some limitations:
were wiped. To partially solve this, you can add init_on_free=1 to
your kernel command line. However, this has a performance cost.
-- Secret keys might still exist in CPU registers, in crypto
- accelerator hardware (if used by the crypto API to implement any of
- the algorithms), or in other places not explicitly considered here.
+- Secret keys might still exist in CPU registers or in other places
+ not explicitly considered here.
Full system compromise
~~~~~~~~~~~~~~~~~~~~~~
@@ -406,9 +405,12 @@ the work is done by XChaCha12, which is much faster than AES when AES
acceleration is unavailable. For more information about Adiantum, see
`the Adiantum paper <https://eprint.iacr.org/2018/720.pdf>`_.
-The (AES-128-CBC-ESSIV, AES-128-CBC-CTS) pair exists only to support
-systems whose only form of AES acceleration is an off-CPU crypto
-accelerator such as CAAM or CESA that does not support XTS.
+The (AES-128-CBC-ESSIV, AES-128-CBC-CTS) pair was added to try to
+provide a more efficient option for systems that lack AES instructions
+in the CPU but do have a non-inline crypto engine such as CAAM or CESA
+that supports AES-CBC (and not AES-XTS). This is deprecated. It has
+been shown that just doing AES on the CPU is actually faster.
+Moreover, Adiantum is faster still and is recommended on such systems.
The remaining mode pairs are the "national pride ciphers":
@@ -468,14 +470,6 @@ API, but the filenames mode still does.
- Recommended:
- AES-CBC acceleration
-fscrypt also uses HMAC-SHA512 for key derivation, so enabling SHA-512
-acceleration is recommended:
-
-- SHA-512
- - Recommended:
- - arm64: CONFIG_CRYPTO_SHA512_ARM64_CE
- - x86: CONFIG_CRYPTO_SHA512_SSSE3
-
Contents encryption
-------------------
@@ -1326,22 +1320,13 @@ this by validating all top-level encryption policies prior to access.
Inline encryption support
=========================
-By default, fscrypt uses the kernel crypto API for all cryptographic
-operations (other than HKDF, which fscrypt partially implements
-itself). The kernel crypto API supports hardware crypto accelerators,
-but only ones that work in the traditional way where all inputs and
-outputs (e.g. plaintexts and ciphertexts) are in memory. fscrypt can
-take advantage of such hardware, but the traditional acceleration
-model isn't particularly efficient and fscrypt hasn't been optimized
-for it.
-
-Instead, many newer systems (especially mobile SoCs) have *inline
-encryption hardware* that can encrypt/decrypt data while it is on its
-way to/from the storage device. Linux supports inline encryption
-through a set of extensions to the block layer called *blk-crypto*.
-blk-crypto allows filesystems to attach encryption contexts to bios
-(I/O requests) to specify how the data will be encrypted or decrypted
-in-line. For more information about blk-crypto, see
+Many newer systems (especially mobile SoCs) have *inline encryption
+hardware* that can encrypt/decrypt data while it is on its way to/from
+the storage device. Linux supports inline encryption through a set of
+extensions to the block layer called *blk-crypto*. blk-crypto allows
+filesystems to attach encryption contexts to bios (I/O requests) to
+specify how the data will be encrypted or decrypted in-line. For more
+information about blk-crypto, see
:ref:`Documentation/block/inline-encryption.rst <inline_encryption>`.
On supported filesystems (currently ext4 and f2fs), fscrypt can use
diff --git a/Documentation/filesystems/fsverity.rst b/Documentation/filesystems/fsverity.rst
index dacdbc1149e6..412cf11e3298 100644
--- a/Documentation/filesystems/fsverity.rst
+++ b/Documentation/filesystems/fsverity.rst
@@ -185,8 +185,7 @@ FS_IOC_ENABLE_VERITY can fail with the following errors:
- ``ENOKEY``: the ".fs-verity" keyring doesn't contain the certificate
needed to verify the builtin signature
- ``ENOPKG``: fs-verity recognizes the hash algorithm, but it's not
- available in the kernel's crypto API as currently configured (e.g.
- for SHA-512, missing CONFIG_CRYPTO_SHA512).
+ available in the kernel as currently configured
- ``ENOTTY``: this type of filesystem does not implement fs-verity
- ``EOPNOTSUPP``: the kernel was not configured with fs-verity
support; or the filesystem superblock has not had the 'verity'
diff --git a/Documentation/filesystems/iomap/design.rst b/Documentation/filesystems/iomap/design.rst
index f2df9b6df988..0f7672676c0b 100644
--- a/Documentation/filesystems/iomap/design.rst
+++ b/Documentation/filesystems/iomap/design.rst
@@ -167,7 +167,6 @@ structure below:
struct dax_device *dax_dev;
void *inline_data;
void *private;
- const struct iomap_folio_ops *folio_ops;
u64 validity_cookie;
};
@@ -292,8 +291,6 @@ The fields are as follows:
<https://lore.kernel.org/all/20180619164137.13720-7-hch@lst.de/>`_.
This value will be passed unchanged to ``->iomap_end``.
- * ``folio_ops`` will be covered in the section on pagecache operations.
-
* ``validity_cookie`` is a magic freshness value set by the filesystem
that should be used to detect stale mappings.
For pagecache operations this is critical for correct operation
diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst
index 3b628e370d88..067ed8e14ef3 100644
--- a/Documentation/filesystems/iomap/operations.rst
+++ b/Documentation/filesystems/iomap/operations.rst
@@ -57,21 +57,19 @@ The following address space operations can be wrapped easily:
* ``bmap``
* ``swap_activate``
-``struct iomap_folio_ops``
+``struct iomap_write_ops``
--------------------------
-The ``->iomap_begin`` function for pagecache operations may set the
-``struct iomap::folio_ops`` field to an ops structure to override
-default behaviors of iomap:
-
.. code-block:: c
- struct iomap_folio_ops {
+ struct iomap_write_ops {
struct folio *(*get_folio)(struct iomap_iter *iter, loff_t pos,
unsigned len);
void (*put_folio)(struct inode *inode, loff_t pos, unsigned copied,
struct folio *folio);
bool (*iomap_valid)(struct inode *inode, const struct iomap *iomap);
+ int (*read_folio_range)(const struct iomap_iter *iter,
+ struct folio *folio, loff_t pos, size_t len);
};
iomap calls these functions:
@@ -127,6 +125,10 @@ iomap calls these functions:
``->iomap_valid``, then the iomap should considered stale and the
validation failed.
+ - ``read_folio_range``: Called to synchronously read in the range that will
+ be written to. If this function is not provided, iomap will default to
+ submitting a bio read request.
+
These ``struct kiocb`` flags are significant for buffered I/O with iomap:
* ``IOCB_NOWAIT``: Turns on ``IOMAP_NOWAIT``.
@@ -271,7 +273,7 @@ writeback.
It does not lock ``i_rwsem`` or ``invalidate_lock``.
The dirty bit will be cleared for all folios run through the
-``->map_blocks`` machinery described below even if the writeback fails.
+``->writeback_range`` machinery described below even if the writeback fails.
This is to prevent dirty folio clots when storage devices fail; an
``-EIO`` is recorded for userspace to collect via ``fsync``.
@@ -283,15 +285,14 @@ The ``ops`` structure must be specified and is as follows:
.. code-block:: c
struct iomap_writeback_ops {
- int (*map_blocks)(struct iomap_writepage_ctx *wpc, struct inode *inode,
- loff_t offset, unsigned len);
- int (*submit_ioend)(struct iomap_writepage_ctx *wpc, int status);
- void (*discard_folio)(struct folio *folio, loff_t pos);
+ int (*writeback_range)(struct iomap_writepage_ctx *wpc,
+ struct folio *folio, u64 pos, unsigned int len, u64 end_pos);
+ int (*writeback_submit)(struct iomap_writepage_ctx *wpc, int error);
};
The fields are as follows:
- - ``map_blocks``: Sets ``wpc->iomap`` to the space mapping of the file
+ - ``writeback_range``: Sets ``wpc->iomap`` to the space mapping of the file
range (in bytes) given by ``offset`` and ``len``.
iomap calls this function for each dirty fs block in each dirty folio,
though it will `reuse mappings
@@ -306,27 +307,26 @@ The fields are as follows:
This revalidation must be open-coded by the filesystem; it is
unclear if ``iomap::validity_cookie`` can be reused for this
purpose.
- This function must be supplied by the filesystem.
-
- - ``submit_ioend``: Allows the file systems to hook into writeback bio
- submission.
- This might include pre-write space accounting updates, or installing
- a custom ``->bi_end_io`` function for internal purposes, such as
- deferring the ioend completion to a workqueue to run metadata update
- transactions from process context before submitting the bio.
- This function is optional.
- - ``discard_folio``: iomap calls this function after ``->map_blocks``
- fails to schedule I/O for any part of a dirty folio.
- The function should throw away any reservations that may have been
- made for the write.
+ If this methods fails to schedule I/O for any part of a dirty folio, it
+ should throw away any reservations that may have been made for the write.
The folio will be marked clean and an ``-EIO`` recorded in the
pagecache.
Filesystems can use this callback to `remove
<https://lore.kernel.org/all/20201029163313.1766967-1-bfoster@redhat.com/>`_
delalloc reservations to avoid having delalloc reservations for
clean pagecache.
- This function is optional.
+ This function must be supplied by the filesystem.
+
+ - ``writeback_submit``: Submit the previous built writeback context.
+ Block based file systems should use the iomap_ioend_writeback_submit
+ helper, other file system can implement their own.
+ File systems can optionall to hook into writeback bio submission.
+ This might include pre-write space accounting updates, or installing
+ a custom ``->bi_end_io`` function for internal purposes, such as
+ deferring the ioend completion to a workqueue to run metadata update
+ transactions from process context before submitting the bio.
+ This function must be supplied by the filesystem.
Pagecache Writeback Completion
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -340,10 +340,9 @@ If the write failed, it will also set the error bits on the folios and
the address space.
This can happen in interrupt or process context, depending on the
storage device.
-
Filesystems that need to update internal bookkeeping (e.g. unwritten
-extent conversions) should provide a ``->submit_ioend`` function to
-set ``struct iomap_end::bio::bi_end_io`` to its own function.
+extent conversions) should set their own bi_end_io on the bios
+submitted by ``->submit_writeback``
This function should call ``iomap_finish_ioends`` after finishing its
own work (e.g. unwritten extent conversion).
diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index 2e567e341c3b..aa287ccdac2f 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -87,8 +87,8 @@ prototypes::
int (*tmpfile) (struct mnt_idmap *, struct inode *,
struct file *, umode_t);
int (*fileattr_set)(struct mnt_idmap *idmap,
- struct dentry *dentry, struct fileattr *fa);
- int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa);
+ struct dentry *dentry, struct file_kattr *fa);
+ int (*fileattr_get)(struct dentry *dentry, struct file_kattr *fa);
struct posix_acl * (*get_acl)(struct mnt_idmap *, struct dentry *, int);
struct offset_ctx *(*get_offset_ctx)(struct inode *inode);
@@ -253,10 +253,10 @@ prototypes::
int (*writepages)(struct address_space *, struct writeback_control *);
bool (*dirty_folio)(struct address_space *, struct folio *folio);
void (*readahead)(struct readahead_control *);
- int (*write_begin)(struct file *, struct address_space *mapping,
+ int (*write_begin)(const struct kiocb *, struct address_space *mapping,
loff_t pos, unsigned len,
struct folio **foliop, void **fsdata);
- int (*write_end)(struct file *, struct address_space *mapping,
+ int (*write_end)(const struct kiocb *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
struct folio *folio, void *fsdata);
sector_t (*bmap)(struct address_space *, sector_t);
diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 3616d7161dab..85f590254f07 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -1224,9 +1224,6 @@ lookup_noperm_unlocked(), lookup_noperm_positive_unlocked(). They now
take a qstr instead of separate name and length. QSTR() can be used
when strlen() is needed for the length.
-For try_lookup_noperm() a reference to the qstr is passed in case the
-hash might subsequently be needed.
-
These function no longer do any permission checking - they previously
checked that the caller has 'X' permission on the parent. They must
ONLY be used internally by a filesystem on itself when it knows that
@@ -1249,3 +1246,42 @@ Using try_lookup_noperm() will require linux/namei.h to be included.
Calling conventions for ->d_automount() have changed; we should *not* grab
an extra reference to new mount - it should be returned with refcount 1.
+
+---
+
+collect_mounts()/drop_collected_mounts()/iterate_mounts() are gone now.
+Replacement is collect_paths()/drop_collected_path(), with no special
+iterator needed. Instead of a cloned mount tree, the new interface returns
+an array of struct path, one for each mount collect_mounts() would've
+created. These struct path point to locations in the caller's namespace
+that would be roots of the cloned mounts.
+
+---
+
+**mandatory**
+
+If your filesystem sets the default dentry_operations, use set_default_d_op()
+rather than manually setting sb->s_d_op.
+
+---
+
+**mandatory**
+
+d_set_d_op() is no longer exported (or public, for that matter); _if_
+your filesystem really needed that, make use of d_splice_alias_ops()
+to have them set. Better yet, think hard whether you need different
+->d_op for different dentries - if not, just use set_default_d_op()
+at mount time and be done with that. Currently procfs is the only
+thing that really needs ->d_op varying between dentries.
+
+---
+
+**highly recommended**
+
+The file operations mmap() callback is deprecated in favour of
+mmap_prepare(). This passes a pointer to a vm_area_desc to the callback
+rather than a VMA, as the VMA at this stage is not yet valid.
+
+The vm_area_desc provides the minimum required information for a filesystem
+to initialise state upon memory mapping of a file-backed region, and output
+parameters for the file system to set this state.
diff --git a/Documentation/filesystems/propagate_umount.txt b/Documentation/filesystems/propagate_umount.txt
new file mode 100644
index 000000000000..c90349e5b889
--- /dev/null
+++ b/Documentation/filesystems/propagate_umount.txt
@@ -0,0 +1,484 @@
+ Notes on propagate_umount()
+
+Umount propagation starts with a set of mounts we are already going to
+take out. Ideally, we would like to add all downstream cognates to
+that set - anything with the same mountpoint as one of the removed
+mounts and with parent that would receive events from the parent of that
+mount. However, there are some constraints the resulting set must
+satisfy.
+
+It is convenient to define several properties of sets of mounts:
+
+1) A set S of mounts is non-shifting if for any mount X belonging
+to S all subtrees mounted strictly inside of X (i.e. not overmounting
+the root of X) contain only elements of S.
+
+2) A set S is non-revealing if all locked mounts that belong to S have
+parents that also belong to S.
+
+3) A set S is closed if it contains all children of its elements.
+
+The set of mounts taken out by umount(2) must be non-shifting and
+non-revealing; the first constraint is what allows to reparent
+any remaining mounts and the second is what prevents the exposure
+of any concealed mountpoints.
+
+propagate_umount() takes the original set as an argument and tries to
+extend that set. The original set is a full subtree and its root is
+unlocked; what matters is that it's closed and non-revealing.
+Resulting set may not be closed; there might still be mounts outside
+of that set, but only on top of stacks of root-overmounting elements
+of set. They can be reparented to the place where the bottom of
+stack is attached to a mount that will survive. NOTE: doing that
+will violate a constraint on having no more than one mount with
+the same parent/mountpoint pair; however, the caller (umount_tree())
+will immediately remedy that - it may keep unmounted element attached
+to parent, but only if the parent itself is unmounted. Since all
+conflicts created by reparenting have common parent *not* in the
+set and one side of the conflict (bottom of the stack of overmounts)
+is in the set, it will be resolved. However, we rely upon umount_tree()
+doing that pretty much immediately after the call of propagate_umount().
+
+Algorithm is based on two statements:
+ 1) for any set S, there is a maximal non-shifting subset of S
+and it can be calculated in O(#S) time.
+ 2) for any non-shifting set S, there is a maximal non-revealing
+subset of S. That subset is also non-shifting and it can be calculated
+in O(#S) time.
+
+ Finding candidates.
+
+We are given a closed set U and we want to find all mounts that have
+the same mountpoint as some mount m in U *and* whose parent receives
+propagation from the parent of the same mount m. Naive implementation
+would be
+ S = {}
+ for each m in U
+ add m to S
+ p = parent(m)
+ for each q in Propagation(p) - {p}
+ child = look_up(q, mountpoint(m))
+ if child
+ add child to S
+but that can lead to excessive work - there might be propagation among the
+subtrees of U, in which case we'd end up examining the same candidates
+many times. Since propagation is transitive, the same will happen to
+everything downstream of that candidate and it's not hard to construct
+cases where the approach above leads to the time quadratic by the actual
+number of candidates.
+
+Note that if we run into a candidate we'd already seen, it must've been
+added on an earlier iteration of the outer loop - all additions made
+during one iteration of the outer loop have different parents. So
+if we find a child already added to the set, we know that everything
+in Propagation(parent(child)) with the same mountpoint has been already
+added.
+ S = {}
+ for each m in U
+ if m in S
+ continue
+ add m to S
+ p = parent(m)
+ q = propagation_next(p, p)
+ while q
+ child = look_up(q, mountpoint(m))
+ if child
+ if child in S
+ q = skip_them(q, p)
+ continue;
+ add child to S
+ q = propagation_next(q, p)
+where
+skip_them(q, p)
+ keep walking Propagation(p) from q until we find something
+ not in Propagation(q)
+
+would get rid of that problem, but we need a sane implementation of
+skip_them(). That's not hard to do - split propagation_next() into
+"down into mnt_slave_list" and "forward-and-up" parts, with the
+skip_them() being "repeat the forward-and-up part until we get NULL
+or something that isn't a peer of the one we are skipping".
+
+Note that there can be no absolute roots among the extra candidates -
+they all come from mount lookups. Absolute root among the original
+set is _currently_ impossible, but it might be worth protecting
+against.
+
+ Maximal non-shifting subsets.
+
+Let's call a mount m in a set S forbidden in that set if there is a
+subtree mounted strictly inside m and containing mounts that do not
+belong to S.
+
+The set is non-shifting when none of its elements are forbidden in it.
+
+If mount m is forbidden in a set S, it is forbidden in any subset S' it
+belongs to. In other words, it can't belong to any of the non-shifting
+subsets of S. If we had a way to find a forbidden mount or show that
+there's none, we could use it to find the maximal non-shifting subset
+simply by finding and removing them until none remain.
+
+Suppose mount m is forbidden in S; then any mounts forbidden in S - {m}
+must have been forbidden in S itself. Indeed, since m has descendents
+that do not belong to S, any subtree that fits into S will fit into
+S - {m} as well.
+
+So in principle we could go through elements of S, checking if they
+are forbidden in S and removing the ones that are. Removals will
+not invalidate the checks done for earlier mounts - if they were not
+forbidden at the time we checked, they won't become forbidden later.
+It's too costly to be practical, but there is a similar approach that
+is linear by size of S.
+
+Let's say that mount x in a set S is forbidden by mount y, if
+ * both x and y belong to S.
+ * there is a chain of mounts starting at x and leaving S
+ immediately after passing through y, with the first
+ mountpoint strictly inside x.
+Note 1: x may be equal to y - that's the case when something not
+belonging to S is mounted strictly inside x.
+Note 2: if y does not belong to S, it can't forbid anything in S.
+Note 3: if y has no children outside of S, it can't forbid anything in S.
+
+It's easy to show that mount x is forbidden in S if and only if x is
+forbidden in S by some mount y. And it's easy to find all mounts in S
+forbidden by a given mount.
+
+Consider the following operation:
+ Trim(S, m) = S - {x : x is forbidden by m in S}
+
+Note that if m does not belong to S or has no children outside of S we
+are guaranteed that Trim(S, m) is equal to S.
+
+The following is true: if x is forbidden by y in Trim(S, m), it was
+already forbidden by y in S.
+
+Proof: Suppose x is forbidden by y in Trim(S, m). Then there is a
+chain of mounts (x_0 = x, ..., x_k = y, x_{k+1} = r), such that x_{k+1}
+is the first element that doesn't belong to Trim(S, m) and the
+mountpoint of x_1 is strictly inside x. If mount r belongs to S, it must
+have been removed by Trim(S, m), i.e. it was forbidden in S by m.
+Then there was a mount chain from r to some child of m that stayed in
+S all the way until m, but that's impossible since x belongs to Trim(S, m)
+and prepending (x_0, ..., x_k) to that chain demonstrates that x is also
+forbidden in S by m, and thus can't belong to Trim(S, m).
+Therefore r can not belong to S and our chain demonstrates that
+x is forbidden by y in S. QED.
+
+Corollary: no mount is forbidden by m in Trim(S, m). Indeed, any
+such mount would have been forbidden by m in S and thus would have been
+in the part of S removed in Trim(S, m).
+
+Corollary: no mount is forbidden by m in Trim(Trim(S, m), n). Indeed,
+any such would have to have been forbidden by m in Trim(S, m), which
+is impossible.
+
+Corollary: after
+ S = Trim(S, x_1)
+ S = Trim(S, x_2)
+ ...
+ S = Trim(S, x_k)
+no mount remaining in S will be forbidden by either of x_1,...,x_k.
+
+The following will reduce S to its maximal non-shifting subset:
+ visited = {}
+ while S contains elements not belonging to visited
+ let m be an arbitrary such element of S
+ S = Trim(S, m)
+ add m to visited
+
+S never grows, so the number of elements of S not belonging to visited
+decreases at least by one on each iteration. When the loop terminates,
+all mounts remaining in S belong to visited. It's easy to see that at
+the beginning of each iteration no mount remaining in S will be forbidden
+by any element of visited. In other words, no mount remaining in S will
+be forbidden, i.e. final value of S will be non-shifting. It will be
+the maximal non-shifting subset, since we were removing only forbidden
+elements.
+
+ There are two difficulties in implementing the above in linear
+time, both due to the fact that Trim() might need to remove more than one
+element. Naive implementation of Trim() is vulnerable to running into a
+long chain of mounts, each mounted on top of parent's root. Nothing in
+that chain is forbidden, so nothing gets removed from it. We need to
+recognize such chains and avoid walking them again on subsequent calls of
+Trim(), otherwise we will end up with worst-case time being quadratic by
+the number of elements in S. Another difficulty is in implementing the
+outer loop - we need to iterate through all elements of a shrinking set.
+That would be trivial if we never removed more than one element at a time
+(linked list, with list_for_each_entry_safe for iterator), but we may
+need to remove more than one entry, possibly including the ones we have
+already visited.
+
+ Let's start with naive algorithm for Trim():
+
+Trim_one(m)
+ found = false
+ for each n in children(m)
+ if n not in S
+ found = true
+ if (mountpoint(n) != root(m))
+ remove m from S
+ break
+ if found
+ Trim_ancestors(m)
+
+Trim_ancestors(m)
+ for (; parent(m) in S; m = parent(m)) {
+ if (mountpoint(m) != root(parent(m)))
+ remove parent(m) from S
+ }
+
+If m belongs to S, Trim_one(m) will replace S with Trim(S, m).
+Proof:
+ Consider the chains excluding elements from Trim(S, m). The last
+two elements in such chain are m and some child of m that does not belong
+to S. If m has no such children, Trim(S, m) is equal to S.
+ m itself is removed if and only if the chain has exactly two
+elements, i.e. when the last element does not overmount the root of m.
+In other words, that happens when m has a child not in S that does not
+overmount the root of m.
+ All other elements to remove will be ancestors of m, such that
+the entire descent chain from them to m is contained in S. Let
+(x_0, x_1, ..., x_k = m) be the longest such chain. x_i needs to be
+removed if and only if x_{i+1} does not overmount its root. It's easy
+to see that Trim_ancestors(m) will iterate through that chain from
+x_k to x_1 and that it will remove exactly the elements that need to be
+removed.
+
+ Note that if the loop in Trim_ancestors() walks into an already
+visited element, we are guaranteed that remaining iterations will see
+only elements that had already been visited and remove none of them.
+That's the weakness that makes it vulnerable to long chains of full
+overmounts.
+
+ It's easy to deal with, if we can afford setting marks on
+elements of S; we would mark all elements already visited by
+Trim_ancestors() and have it bail out as soon as it sees an already
+marked element.
+
+ The problems with iterating through the set can be dealt with in
+several ways, depending upon the representation we choose for our set.
+One useful observation is that we are given a closed subset in S - the
+original set passed to propagate_umount(). Its elements can neither
+forbid anything nor be forbidden by anything - all their descendents
+belong to S, so they can not occur anywhere in any excluding chain.
+In other words, the elements of that subset will remain in S until
+the end and Trim_one(S, m) is a no-op for all m from that subset.
+
+ That suggests keeping S as a disjoint union of a closed set U
+('will be unmounted, no matter what') and the set of all elements of
+S that do not belong to U. That set ('candidates') is all we need
+to iterate through. Let's represent it as a subset in a cyclic list,
+consisting of all list elements that are marked as candidates (initially -
+all of them). Then we could have Trim_ancestors() only remove the mark,
+leaving the elements on the list. Then Trim_one() would never remove
+anything other than its argument from the containing list, allowing to
+use list_for_each_entry_safe() as iterator.
+
+ Assuming that representation we get the following:
+
+ list_for_each_entry_safe(m, ..., Candidates, ...)
+ Trim_one(m)
+where
+Trim_one(m)
+ if (m is not marked as a candidate)
+ strip the "seen by Trim_ancestors" mark from m
+ remove m from the Candidates list
+ return
+
+ remove_this = false
+ found = false
+ for each n in children(m)
+ if n not in S
+ found = true
+ if (mountpoint(n) != root(m))
+ remove_this = true
+ break
+ if found
+ Trim_ancestors(m)
+ if remove_this
+ strip the "seen by Trim_ancestors" mark from m
+ strip the "candidate" mark from m
+ remove m from the Candidate list
+
+Trim_ancestors(m)
+ for (p = parent(m); p is marked as candidate ; m = p, p = parent(p)) {
+ if m is marked as seen by Trim_ancestors
+ return
+ mark m as seen by Trim_ancestors
+ if (mountpoint(m) != root(p))
+ strip the "candidate" mark from p
+ }
+
+ Terminating condition in the loop in Trim_ancestors() is correct,
+since that that loop will never run into p belonging to U - p is always
+an ancestor of argument of Trim_one() and since U is closed, the argument
+of Trim_one() would also have to belong to U. But Trim_one() is never
+called for elements of U. In other words, p belongs to S if and only
+if it belongs to candidates.
+
+ Time complexity:
+* we get no more than O(#S) calls of Trim_one()
+* the loop over children in Trim_one() never looks at the same child
+twice through all the calls.
+* iterations of that loop for children in S are no more than O(#S)
+in the worst case
+* at most two children that are not elements of S are considered per
+call of Trim_one().
+* the loop in Trim_ancestors() sets its mark once per iteration and
+no element of S has is set more than once.
+
+ In the end we may have some elements excluded from S by
+Trim_ancestors() still stuck on the list. We could do a separate
+loop removing them from the list (also no worse than O(#S) time),
+but it's easier to leave that until the next phase - there we will
+iterate through the candidates anyway.
+
+ The caller has already removed all elements of U from their parents'
+lists of children, which means that checking if child belongs to S is
+equivalent to checking if it's marked as a candidate; we'll never see
+the elements of U in the loop over children in Trim_one().
+
+ What's more, if we see that children(m) is empty and m is not
+locked, we can immediately move m into the committed subset (remove
+from the parent's list of children, etc.). That's one fewer mount we'll
+have to look into when we check the list of children of its parent *and*
+when we get to building the non-revealing subset.
+
+ Maximal non-revealing subsets
+
+If S is not a non-revealing subset, there is a locked element x in S
+such that parent of x is not in S.
+
+Obviously, no non-revealing subset of S may contain x. Removing such
+elements one by one will obviously end with the maximal non-revealing
+subset (possibly empty one). Note that removal of an element will
+require removal of all its locked children, etc.
+
+If the set had been non-shifting, it will remain non-shifting after
+such removals.
+Proof: suppose S was non-shifting, x is a locked element of S, parent of x
+is not in S and S - {x} is not non-shifting. Then there is an element m
+in S - {x} and a subtree mounted strictly inside m, such that m contains
+an element not in in S - {x}. Since S is non-shifting, everything in
+that subtree must belong to S. But that means that this subtree must
+contain x somewhere *and* that parent of x either belongs that subtree
+or is equal to m. Either way it must belong to S. Contradiction.
+
+// same representation as for finding maximal non-shifting subsets:
+// S is a disjoint union of a non-revealing set U (the ones we are committed
+// to unmount) and a set of candidates, represented as a subset of list
+// elements that have "is a candidate" mark on them.
+// Elements of U are removed from their parents' lists of children.
+// In the end candidates becomes empty and maximal non-revealing non-shifting
+// subset of S is now in U
+ while (Candidates list is non-empty)
+ handle_locked(first(Candidates))
+
+handle_locked(m)
+ if m is not marked as a candidate
+ strip the "seen by Trim_ancestors" mark from m
+ remove m from the list
+ return
+ cutoff = m
+ for (p = m; p in candidates; p = parent(p)) {
+ strip the "seen by Trim_ancestors" mark from p
+ strip the "candidate" mark from p
+ remove p from the Candidates list
+ if (!locked(p))
+ cutoff = parent(p)
+ }
+ if p in U
+ cutoff = p
+ while m != cutoff
+ remove m from children(parent(m))
+ add m to U
+ m = parent(m)
+
+Let (x_0, ..., x_n = m) be the maximal chain of descent of m within S.
+* If it contains some elements of U, let x_k be the last one of those.
+Then union of U with {x_{k+1}, ..., x_n} is obviously non-revealing.
+* otherwise if all its elements are locked, then none of {x_0, ..., x_n}
+may be elements of a non-revealing subset of S.
+* otherwise let x_k be the first unlocked element of the chain. Then none
+of {x_0, ..., x_{k-1}} may be an element of a non-revealing subset of
+S and union of U and {x_k, ..., x_n} is non-revealing.
+
+handle_locked(m) finds which of these cases applies and adjusts Candidates
+and U accordingly. U remains non-revealing, union of Candidates and
+U still contains any non-revealing subset of S and after the call of
+handle_locked(m) m is guaranteed to be not in Candidates list. So having
+it called for each element of S would suffice to empty Candidates,
+leaving U the maximal non-revealing subset of S.
+
+However, handle_locked(m) is a no-op when m belongs to U, so it's enough
+to have it called for elements of Candidates list until none remain.
+
+Time complexity: number of calls of handle_locked() is limited by
+#Candidates, each iteration of the first loop in handle_locked() removes
+an element from the list, so their total number of executions is also
+limited by #Candidates; number of iterations in the second loop is no
+greater than the number of iterations of the first loop.
+
+
+ Reparenting
+
+After we'd calculated the final set, we still need to deal with
+reparenting - if an element of the final set has a child not in it,
+we need to reparent such child.
+
+Such children can only be root-overmounting (otherwise the set wouldn't
+be non-shifting) and their parents can not belong to the original set,
+since the original is guaranteed to be closed.
+
+
+ Putting all of that together
+
+The plan is to
+ * find all candidates
+ * trim down to maximal non-shifting subset
+ * trim down to maximal non-revealing subset
+ * reparent anything that needs to be reparented
+ * return the resulting set to the caller
+
+For the 2nd and 3rd steps we want to separate the set into growing
+non-revealing subset, initially containing the original set ("U" in
+terms of the pseudocode above) and everything we are still not sure about
+("candidates"). It means that for the output of the 1st step we'd like
+the extra candidates separated from the stuff already in the original set.
+For the 4th step we would like the additions to U separate from the
+original set.
+
+So let's go for
+ * original set ("set"). Linkage via mnt_list
+ * undecided candidates ("candidates"). Subset of a list,
+consisting of all its elements marked with a new flag (T_UMOUNT_CANDIDATE).
+Initially all elements of the list will be marked that way; in the
+end the list will become empty and no mounts will remain marked with
+that flag.
+ * Reuse T_MARKED for "has been already seen by trim_ancestors()".
+ * anything in U that hadn't been in the original set - elements of
+candidates will gradually be either discarded or moved there. In other
+words, it's the candidates we have already decided to unmount. Its role
+is reasonably close to the old "to_umount", so let's use that name.
+Linkage via mnt_list.
+
+For gather_candidates() we'll need to maintain both candidates (S -
+set) and intersection of S with set. Use T_UMOUNT_CANDIDATE for
+all elements we encounter, putting the ones not already in the original
+set into the list of candidates. When we are done, strip that flag from
+all elements of the original set. That gives a cheap way to check
+if element belongs to S (in gather_candidates) and to candidates
+itself (at later stages). Call that predicate is_candidate(); it would
+be m->mnt_t_flags & T_UMOUNT_CANDIDATE.
+
+All elements of the original set are marked with MNT_UMOUNT and we'll
+need the same for elements added when joining the contents of to_umount
+to set in the end. Let's set MNT_UMOUNT at the time we add an element
+to to_umount; that's close to what the old 'umount_one' is doing, so
+let's keep that name. It also gives us another predicate we need -
+"belongs to union of set and to_umount"; will_be_unmounted() for now.
+
+Removals from the candidates list should strip both T_MARKED and
+T_UMOUNT_CANDIDATE; call it remove_from_candidates_list().
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index fd32a9a17bfb..486a91633474 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -515,8 +515,8 @@ As of kernel 2.6.22, the following members are defined:
struct posix_acl * (*get_acl)(struct mnt_idmap *, struct dentry *, int);
int (*set_acl)(struct mnt_idmap *, struct dentry *, struct posix_acl *, int);
int (*fileattr_set)(struct mnt_idmap *idmap,
- struct dentry *dentry, struct fileattr *fa);
- int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa);
+ struct dentry *dentry, struct file_kattr *fa);
+ int (*fileattr_get)(struct dentry *dentry, struct file_kattr *fa);
struct offset_ctx *(*get_offset_ctx)(struct inode *inode);
};
@@ -758,8 +758,9 @@ process is more complicated and uses write_begin/write_end or
dirty_folio to write data into the address_space, and
writepages to writeback data to storage.
-Adding and removing pages to/from an address_space is protected by the
-inode's i_mutex.
+Removing pages from an address_space requires holding the inode's i_rwsem
+exclusively, while adding pages to the address_space requires holding the
+inode's i_mapping->invalidate_lock exclusively.
When data is written to a page, the PG_Dirty flag should be set. It
typically remains set until writepages asks for it to be written. This
@@ -822,10 +823,10 @@ cache in your filesystem. The following members are defined:
int (*writepages)(struct address_space *, struct writeback_control *);
bool (*dirty_folio)(struct address_space *, struct folio *);
void (*readahead)(struct readahead_control *);
- int (*write_begin)(struct file *, struct address_space *mapping,
+ int (*write_begin)(const struct kiocb *, struct address_space *mapping,
loff_t pos, unsigned len,
- struct page **pagep, void **fsdata);
- int (*write_end)(struct file *, struct address_space *mapping,
+ struct page **pagep, void **fsdata);
+ int (*write_end)(const struct kiocb *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
struct folio *folio, void *fsdata);
sector_t (*bmap)(struct address_space *, sector_t);
@@ -1071,12 +1072,14 @@ This describes how the VFS can manipulate an open file. As of kernel
struct file_operations {
struct module *owner;
+ fop_flags_t fop_flags;
loff_t (*llseek) (struct file *, loff_t, int);
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
- int (*iopoll)(struct kiocb *kiocb, bool spin);
+ int (*iopoll)(struct kiocb *kiocb, struct io_comp_batch *,
+ unsigned int flags);
int (*iterate_shared) (struct file *, struct dir_context *);
__poll_t (*poll) (struct file *, struct poll_table_struct *);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
@@ -1093,18 +1096,24 @@ This describes how the VFS can manipulate an open file. As of kernel
int (*flock) (struct file *, int, struct file_lock *);
ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
- int (*setlease)(struct file *, long, struct file_lock **, void **);
+ void (*splice_eof)(struct file *file);
+ int (*setlease)(struct file *, int, struct file_lease **, void **);
long (*fallocate)(struct file *file, int mode, loff_t offset,
loff_t len);
void (*show_fdinfo)(struct seq_file *m, struct file *f);
#ifndef CONFIG_MMU
unsigned (*mmap_capabilities)(struct file *);
#endif
- ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int);
+ ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
+ loff_t, size_t, unsigned int);
loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t len, unsigned int remap_flags);
int (*fadvise)(struct file *, loff_t, loff_t, int);
+ int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
+ int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
+ unsigned int poll_flags);
+ int (*mmap_prepare)(struct vm_area_desc *);
};
Again, all methods are called without any locks being held, unless
@@ -1144,7 +1153,8 @@ otherwise noted.
used on 64 bit kernels.
``mmap``
- called by the mmap(2) system call
+ called by the mmap(2) system call. Deprecated in favour of
+ ``mmap_prepare``.
``open``
called by the VFS when an inode should be opened. When the VFS
@@ -1221,6 +1231,11 @@ otherwise noted.
``fadvise``
possibly called by the fadvise64() system call.
+``mmap_prepare``
+ Called by the mmap(2) system call. Allows a VFS to set up a
+ file-backed memory mapping, most notably establishing relevant
+ private state and VMA callbacks.
+
Note that the file operations are implemented by the specific
filesystem in which the inode resides. When opening a device node
(character or block special) most filesystems will call special