diff options
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r-- | Documentation/filesystems/iomap/design.rst | 3 | ||||
-rw-r--r-- | Documentation/filesystems/iomap/operations.rst | 57 | ||||
-rw-r--r-- | Documentation/filesystems/locking.rst | 8 | ||||
-rw-r--r-- | Documentation/filesystems/porting.rst | 42 | ||||
-rw-r--r-- | Documentation/filesystems/propagate_umount.txt | 484 | ||||
-rw-r--r-- | Documentation/filesystems/vfs.rst | 37 |
6 files changed, 581 insertions, 50 deletions
diff --git a/Documentation/filesystems/iomap/design.rst b/Documentation/filesystems/iomap/design.rst index f2df9b6df988..0f7672676c0b 100644 --- a/Documentation/filesystems/iomap/design.rst +++ b/Documentation/filesystems/iomap/design.rst @@ -167,7 +167,6 @@ structure below: struct dax_device *dax_dev; void *inline_data; void *private; - const struct iomap_folio_ops *folio_ops; u64 validity_cookie; }; @@ -292,8 +291,6 @@ The fields are as follows: <https://lore.kernel.org/all/20180619164137.13720-7-hch@lst.de/>`_. This value will be passed unchanged to ``->iomap_end``. - * ``folio_ops`` will be covered in the section on pagecache operations. - * ``validity_cookie`` is a magic freshness value set by the filesystem that should be used to detect stale mappings. For pagecache operations this is critical for correct operation diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst index 3b628e370d88..067ed8e14ef3 100644 --- a/Documentation/filesystems/iomap/operations.rst +++ b/Documentation/filesystems/iomap/operations.rst @@ -57,21 +57,19 @@ The following address space operations can be wrapped easily: * ``bmap`` * ``swap_activate`` -``struct iomap_folio_ops`` +``struct iomap_write_ops`` -------------------------- -The ``->iomap_begin`` function for pagecache operations may set the -``struct iomap::folio_ops`` field to an ops structure to override -default behaviors of iomap: - .. code-block:: c - struct iomap_folio_ops { + struct iomap_write_ops { struct folio *(*get_folio)(struct iomap_iter *iter, loff_t pos, unsigned len); void (*put_folio)(struct inode *inode, loff_t pos, unsigned copied, struct folio *folio); bool (*iomap_valid)(struct inode *inode, const struct iomap *iomap); + int (*read_folio_range)(const struct iomap_iter *iter, + struct folio *folio, loff_t pos, size_t len); }; iomap calls these functions: @@ -127,6 +125,10 @@ iomap calls these functions: ``->iomap_valid``, then the iomap should considered stale and the validation failed. + - ``read_folio_range``: Called to synchronously read in the range that will + be written to. If this function is not provided, iomap will default to + submitting a bio read request. + These ``struct kiocb`` flags are significant for buffered I/O with iomap: * ``IOCB_NOWAIT``: Turns on ``IOMAP_NOWAIT``. @@ -271,7 +273,7 @@ writeback. It does not lock ``i_rwsem`` or ``invalidate_lock``. The dirty bit will be cleared for all folios run through the -``->map_blocks`` machinery described below even if the writeback fails. +``->writeback_range`` machinery described below even if the writeback fails. This is to prevent dirty folio clots when storage devices fail; an ``-EIO`` is recorded for userspace to collect via ``fsync``. @@ -283,15 +285,14 @@ The ``ops`` structure must be specified and is as follows: .. code-block:: c struct iomap_writeback_ops { - int (*map_blocks)(struct iomap_writepage_ctx *wpc, struct inode *inode, - loff_t offset, unsigned len); - int (*submit_ioend)(struct iomap_writepage_ctx *wpc, int status); - void (*discard_folio)(struct folio *folio, loff_t pos); + int (*writeback_range)(struct iomap_writepage_ctx *wpc, + struct folio *folio, u64 pos, unsigned int len, u64 end_pos); + int (*writeback_submit)(struct iomap_writepage_ctx *wpc, int error); }; The fields are as follows: - - ``map_blocks``: Sets ``wpc->iomap`` to the space mapping of the file + - ``writeback_range``: Sets ``wpc->iomap`` to the space mapping of the file range (in bytes) given by ``offset`` and ``len``. iomap calls this function for each dirty fs block in each dirty folio, though it will `reuse mappings @@ -306,27 +307,26 @@ The fields are as follows: This revalidation must be open-coded by the filesystem; it is unclear if ``iomap::validity_cookie`` can be reused for this purpose. - This function must be supplied by the filesystem. - - - ``submit_ioend``: Allows the file systems to hook into writeback bio - submission. - This might include pre-write space accounting updates, or installing - a custom ``->bi_end_io`` function for internal purposes, such as - deferring the ioend completion to a workqueue to run metadata update - transactions from process context before submitting the bio. - This function is optional. - - ``discard_folio``: iomap calls this function after ``->map_blocks`` - fails to schedule I/O for any part of a dirty folio. - The function should throw away any reservations that may have been - made for the write. + If this methods fails to schedule I/O for any part of a dirty folio, it + should throw away any reservations that may have been made for the write. The folio will be marked clean and an ``-EIO`` recorded in the pagecache. Filesystems can use this callback to `remove <https://lore.kernel.org/all/20201029163313.1766967-1-bfoster@redhat.com/>`_ delalloc reservations to avoid having delalloc reservations for clean pagecache. - This function is optional. + This function must be supplied by the filesystem. + + - ``writeback_submit``: Submit the previous built writeback context. + Block based file systems should use the iomap_ioend_writeback_submit + helper, other file system can implement their own. + File systems can optionall to hook into writeback bio submission. + This might include pre-write space accounting updates, or installing + a custom ``->bi_end_io`` function for internal purposes, such as + deferring the ioend completion to a workqueue to run metadata update + transactions from process context before submitting the bio. + This function must be supplied by the filesystem. Pagecache Writeback Completion ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -340,10 +340,9 @@ If the write failed, it will also set the error bits on the folios and the address space. This can happen in interrupt or process context, depending on the storage device. - Filesystems that need to update internal bookkeeping (e.g. unwritten -extent conversions) should provide a ``->submit_ioend`` function to -set ``struct iomap_end::bio::bi_end_io`` to its own function. +extent conversions) should set their own bi_end_io on the bios +submitted by ``->submit_writeback`` This function should call ``iomap_finish_ioends`` after finishing its own work (e.g. unwritten extent conversion). diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst index 2e567e341c3b..aa287ccdac2f 100644 --- a/Documentation/filesystems/locking.rst +++ b/Documentation/filesystems/locking.rst @@ -87,8 +87,8 @@ prototypes:: int (*tmpfile) (struct mnt_idmap *, struct inode *, struct file *, umode_t); int (*fileattr_set)(struct mnt_idmap *idmap, - struct dentry *dentry, struct fileattr *fa); - int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa); + struct dentry *dentry, struct file_kattr *fa); + int (*fileattr_get)(struct dentry *dentry, struct file_kattr *fa); struct posix_acl * (*get_acl)(struct mnt_idmap *, struct dentry *, int); struct offset_ctx *(*get_offset_ctx)(struct inode *inode); @@ -253,10 +253,10 @@ prototypes:: int (*writepages)(struct address_space *, struct writeback_control *); bool (*dirty_folio)(struct address_space *, struct folio *folio); void (*readahead)(struct readahead_control *); - int (*write_begin)(struct file *, struct address_space *mapping, + int (*write_begin)(const struct kiocb *, struct address_space *mapping, loff_t pos, unsigned len, struct folio **foliop, void **fsdata); - int (*write_end)(struct file *, struct address_space *mapping, + int (*write_end)(const struct kiocb *, struct address_space *mapping, loff_t pos, unsigned len, unsigned copied, struct folio *folio, void *fsdata); sector_t (*bmap)(struct address_space *, sector_t); diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst index 3616d7161dab..85f590254f07 100644 --- a/Documentation/filesystems/porting.rst +++ b/Documentation/filesystems/porting.rst @@ -1224,9 +1224,6 @@ lookup_noperm_unlocked(), lookup_noperm_positive_unlocked(). They now take a qstr instead of separate name and length. QSTR() can be used when strlen() is needed for the length. -For try_lookup_noperm() a reference to the qstr is passed in case the -hash might subsequently be needed. - These function no longer do any permission checking - they previously checked that the caller has 'X' permission on the parent. They must ONLY be used internally by a filesystem on itself when it knows that @@ -1249,3 +1246,42 @@ Using try_lookup_noperm() will require linux/namei.h to be included. Calling conventions for ->d_automount() have changed; we should *not* grab an extra reference to new mount - it should be returned with refcount 1. + +--- + +collect_mounts()/drop_collected_mounts()/iterate_mounts() are gone now. +Replacement is collect_paths()/drop_collected_path(), with no special +iterator needed. Instead of a cloned mount tree, the new interface returns +an array of struct path, one for each mount collect_mounts() would've +created. These struct path point to locations in the caller's namespace +that would be roots of the cloned mounts. + +--- + +**mandatory** + +If your filesystem sets the default dentry_operations, use set_default_d_op() +rather than manually setting sb->s_d_op. + +--- + +**mandatory** + +d_set_d_op() is no longer exported (or public, for that matter); _if_ +your filesystem really needed that, make use of d_splice_alias_ops() +to have them set. Better yet, think hard whether you need different +->d_op for different dentries - if not, just use set_default_d_op() +at mount time and be done with that. Currently procfs is the only +thing that really needs ->d_op varying between dentries. + +--- + +**highly recommended** + +The file operations mmap() callback is deprecated in favour of +mmap_prepare(). This passes a pointer to a vm_area_desc to the callback +rather than a VMA, as the VMA at this stage is not yet valid. + +The vm_area_desc provides the minimum required information for a filesystem +to initialise state upon memory mapping of a file-backed region, and output +parameters for the file system to set this state. diff --git a/Documentation/filesystems/propagate_umount.txt b/Documentation/filesystems/propagate_umount.txt new file mode 100644 index 000000000000..c90349e5b889 --- /dev/null +++ b/Documentation/filesystems/propagate_umount.txt @@ -0,0 +1,484 @@ + Notes on propagate_umount() + +Umount propagation starts with a set of mounts we are already going to +take out. Ideally, we would like to add all downstream cognates to +that set - anything with the same mountpoint as one of the removed +mounts and with parent that would receive events from the parent of that +mount. However, there are some constraints the resulting set must +satisfy. + +It is convenient to define several properties of sets of mounts: + +1) A set S of mounts is non-shifting if for any mount X belonging +to S all subtrees mounted strictly inside of X (i.e. not overmounting +the root of X) contain only elements of S. + +2) A set S is non-revealing if all locked mounts that belong to S have +parents that also belong to S. + +3) A set S is closed if it contains all children of its elements. + +The set of mounts taken out by umount(2) must be non-shifting and +non-revealing; the first constraint is what allows to reparent +any remaining mounts and the second is what prevents the exposure +of any concealed mountpoints. + +propagate_umount() takes the original set as an argument and tries to +extend that set. The original set is a full subtree and its root is +unlocked; what matters is that it's closed and non-revealing. +Resulting set may not be closed; there might still be mounts outside +of that set, but only on top of stacks of root-overmounting elements +of set. They can be reparented to the place where the bottom of +stack is attached to a mount that will survive. NOTE: doing that +will violate a constraint on having no more than one mount with +the same parent/mountpoint pair; however, the caller (umount_tree()) +will immediately remedy that - it may keep unmounted element attached +to parent, but only if the parent itself is unmounted. Since all +conflicts created by reparenting have common parent *not* in the +set and one side of the conflict (bottom of the stack of overmounts) +is in the set, it will be resolved. However, we rely upon umount_tree() +doing that pretty much immediately after the call of propagate_umount(). + +Algorithm is based on two statements: + 1) for any set S, there is a maximal non-shifting subset of S +and it can be calculated in O(#S) time. + 2) for any non-shifting set S, there is a maximal non-revealing +subset of S. That subset is also non-shifting and it can be calculated +in O(#S) time. + + Finding candidates. + +We are given a closed set U and we want to find all mounts that have +the same mountpoint as some mount m in U *and* whose parent receives +propagation from the parent of the same mount m. Naive implementation +would be + S = {} + for each m in U + add m to S + p = parent(m) + for each q in Propagation(p) - {p} + child = look_up(q, mountpoint(m)) + if child + add child to S +but that can lead to excessive work - there might be propagation among the +subtrees of U, in which case we'd end up examining the same candidates +many times. Since propagation is transitive, the same will happen to +everything downstream of that candidate and it's not hard to construct +cases where the approach above leads to the time quadratic by the actual +number of candidates. + +Note that if we run into a candidate we'd already seen, it must've been +added on an earlier iteration of the outer loop - all additions made +during one iteration of the outer loop have different parents. So +if we find a child already added to the set, we know that everything +in Propagation(parent(child)) with the same mountpoint has been already +added. + S = {} + for each m in U + if m in S + continue + add m to S + p = parent(m) + q = propagation_next(p, p) + while q + child = look_up(q, mountpoint(m)) + if child + if child in S + q = skip_them(q, p) + continue; + add child to S + q = propagation_next(q, p) +where +skip_them(q, p) + keep walking Propagation(p) from q until we find something + not in Propagation(q) + +would get rid of that problem, but we need a sane implementation of +skip_them(). That's not hard to do - split propagation_next() into +"down into mnt_slave_list" and "forward-and-up" parts, with the +skip_them() being "repeat the forward-and-up part until we get NULL +or something that isn't a peer of the one we are skipping". + +Note that there can be no absolute roots among the extra candidates - +they all come from mount lookups. Absolute root among the original +set is _currently_ impossible, but it might be worth protecting +against. + + Maximal non-shifting subsets. + +Let's call a mount m in a set S forbidden in that set if there is a +subtree mounted strictly inside m and containing mounts that do not +belong to S. + +The set is non-shifting when none of its elements are forbidden in it. + +If mount m is forbidden in a set S, it is forbidden in any subset S' it +belongs to. In other words, it can't belong to any of the non-shifting +subsets of S. If we had a way to find a forbidden mount or show that +there's none, we could use it to find the maximal non-shifting subset +simply by finding and removing them until none remain. + +Suppose mount m is forbidden in S; then any mounts forbidden in S - {m} +must have been forbidden in S itself. Indeed, since m has descendents +that do not belong to S, any subtree that fits into S will fit into +S - {m} as well. + +So in principle we could go through elements of S, checking if they +are forbidden in S and removing the ones that are. Removals will +not invalidate the checks done for earlier mounts - if they were not +forbidden at the time we checked, they won't become forbidden later. +It's too costly to be practical, but there is a similar approach that +is linear by size of S. + +Let's say that mount x in a set S is forbidden by mount y, if + * both x and y belong to S. + * there is a chain of mounts starting at x and leaving S + immediately after passing through y, with the first + mountpoint strictly inside x. +Note 1: x may be equal to y - that's the case when something not +belonging to S is mounted strictly inside x. +Note 2: if y does not belong to S, it can't forbid anything in S. +Note 3: if y has no children outside of S, it can't forbid anything in S. + +It's easy to show that mount x is forbidden in S if and only if x is +forbidden in S by some mount y. And it's easy to find all mounts in S +forbidden by a given mount. + +Consider the following operation: + Trim(S, m) = S - {x : x is forbidden by m in S} + +Note that if m does not belong to S or has no children outside of S we +are guaranteed that Trim(S, m) is equal to S. + +The following is true: if x is forbidden by y in Trim(S, m), it was +already forbidden by y in S. + +Proof: Suppose x is forbidden by y in Trim(S, m). Then there is a +chain of mounts (x_0 = x, ..., x_k = y, x_{k+1} = r), such that x_{k+1} +is the first element that doesn't belong to Trim(S, m) and the +mountpoint of x_1 is strictly inside x. If mount r belongs to S, it must +have been removed by Trim(S, m), i.e. it was forbidden in S by m. +Then there was a mount chain from r to some child of m that stayed in +S all the way until m, but that's impossible since x belongs to Trim(S, m) +and prepending (x_0, ..., x_k) to that chain demonstrates that x is also +forbidden in S by m, and thus can't belong to Trim(S, m). +Therefore r can not belong to S and our chain demonstrates that +x is forbidden by y in S. QED. + +Corollary: no mount is forbidden by m in Trim(S, m). Indeed, any +such mount would have been forbidden by m in S and thus would have been +in the part of S removed in Trim(S, m). + +Corollary: no mount is forbidden by m in Trim(Trim(S, m), n). Indeed, +any such would have to have been forbidden by m in Trim(S, m), which +is impossible. + +Corollary: after + S = Trim(S, x_1) + S = Trim(S, x_2) + ... + S = Trim(S, x_k) +no mount remaining in S will be forbidden by either of x_1,...,x_k. + +The following will reduce S to its maximal non-shifting subset: + visited = {} + while S contains elements not belonging to visited + let m be an arbitrary such element of S + S = Trim(S, m) + add m to visited + +S never grows, so the number of elements of S not belonging to visited +decreases at least by one on each iteration. When the loop terminates, +all mounts remaining in S belong to visited. It's easy to see that at +the beginning of each iteration no mount remaining in S will be forbidden +by any element of visited. In other words, no mount remaining in S will +be forbidden, i.e. final value of S will be non-shifting. It will be +the maximal non-shifting subset, since we were removing only forbidden +elements. + + There are two difficulties in implementing the above in linear +time, both due to the fact that Trim() might need to remove more than one +element. Naive implementation of Trim() is vulnerable to running into a +long chain of mounts, each mounted on top of parent's root. Nothing in +that chain is forbidden, so nothing gets removed from it. We need to +recognize such chains and avoid walking them again on subsequent calls of +Trim(), otherwise we will end up with worst-case time being quadratic by +the number of elements in S. Another difficulty is in implementing the +outer loop - we need to iterate through all elements of a shrinking set. +That would be trivial if we never removed more than one element at a time +(linked list, with list_for_each_entry_safe for iterator), but we may +need to remove more than one entry, possibly including the ones we have +already visited. + + Let's start with naive algorithm for Trim(): + +Trim_one(m) + found = false + for each n in children(m) + if n not in S + found = true + if (mountpoint(n) != root(m)) + remove m from S + break + if found + Trim_ancestors(m) + +Trim_ancestors(m) + for (; parent(m) in S; m = parent(m)) { + if (mountpoint(m) != root(parent(m))) + remove parent(m) from S + } + +If m belongs to S, Trim_one(m) will replace S with Trim(S, m). +Proof: + Consider the chains excluding elements from Trim(S, m). The last +two elements in such chain are m and some child of m that does not belong +to S. If m has no such children, Trim(S, m) is equal to S. + m itself is removed if and only if the chain has exactly two +elements, i.e. when the last element does not overmount the root of m. +In other words, that happens when m has a child not in S that does not +overmount the root of m. + All other elements to remove will be ancestors of m, such that +the entire descent chain from them to m is contained in S. Let +(x_0, x_1, ..., x_k = m) be the longest such chain. x_i needs to be +removed if and only if x_{i+1} does not overmount its root. It's easy +to see that Trim_ancestors(m) will iterate through that chain from +x_k to x_1 and that it will remove exactly the elements that need to be +removed. + + Note that if the loop in Trim_ancestors() walks into an already +visited element, we are guaranteed that remaining iterations will see +only elements that had already been visited and remove none of them. +That's the weakness that makes it vulnerable to long chains of full +overmounts. + + It's easy to deal with, if we can afford setting marks on +elements of S; we would mark all elements already visited by +Trim_ancestors() and have it bail out as soon as it sees an already +marked element. + + The problems with iterating through the set can be dealt with in +several ways, depending upon the representation we choose for our set. +One useful observation is that we are given a closed subset in S - the +original set passed to propagate_umount(). Its elements can neither +forbid anything nor be forbidden by anything - all their descendents +belong to S, so they can not occur anywhere in any excluding chain. +In other words, the elements of that subset will remain in S until +the end and Trim_one(S, m) is a no-op for all m from that subset. + + That suggests keeping S as a disjoint union of a closed set U +('will be unmounted, no matter what') and the set of all elements of +S that do not belong to U. That set ('candidates') is all we need +to iterate through. Let's represent it as a subset in a cyclic list, +consisting of all list elements that are marked as candidates (initially - +all of them). Then we could have Trim_ancestors() only remove the mark, +leaving the elements on the list. Then Trim_one() would never remove +anything other than its argument from the containing list, allowing to +use list_for_each_entry_safe() as iterator. + + Assuming that representation we get the following: + + list_for_each_entry_safe(m, ..., Candidates, ...) + Trim_one(m) +where +Trim_one(m) + if (m is not marked as a candidate) + strip the "seen by Trim_ancestors" mark from m + remove m from the Candidates list + return + + remove_this = false + found = false + for each n in children(m) + if n not in S + found = true + if (mountpoint(n) != root(m)) + remove_this = true + break + if found + Trim_ancestors(m) + if remove_this + strip the "seen by Trim_ancestors" mark from m + strip the "candidate" mark from m + remove m from the Candidate list + +Trim_ancestors(m) + for (p = parent(m); p is marked as candidate ; m = p, p = parent(p)) { + if m is marked as seen by Trim_ancestors + return + mark m as seen by Trim_ancestors + if (mountpoint(m) != root(p)) + strip the "candidate" mark from p + } + + Terminating condition in the loop in Trim_ancestors() is correct, +since that that loop will never run into p belonging to U - p is always +an ancestor of argument of Trim_one() and since U is closed, the argument +of Trim_one() would also have to belong to U. But Trim_one() is never +called for elements of U. In other words, p belongs to S if and only +if it belongs to candidates. + + Time complexity: +* we get no more than O(#S) calls of Trim_one() +* the loop over children in Trim_one() never looks at the same child +twice through all the calls. +* iterations of that loop for children in S are no more than O(#S) +in the worst case +* at most two children that are not elements of S are considered per +call of Trim_one(). +* the loop in Trim_ancestors() sets its mark once per iteration and +no element of S has is set more than once. + + In the end we may have some elements excluded from S by +Trim_ancestors() still stuck on the list. We could do a separate +loop removing them from the list (also no worse than O(#S) time), +but it's easier to leave that until the next phase - there we will +iterate through the candidates anyway. + + The caller has already removed all elements of U from their parents' +lists of children, which means that checking if child belongs to S is +equivalent to checking if it's marked as a candidate; we'll never see +the elements of U in the loop over children in Trim_one(). + + What's more, if we see that children(m) is empty and m is not +locked, we can immediately move m into the committed subset (remove +from the parent's list of children, etc.). That's one fewer mount we'll +have to look into when we check the list of children of its parent *and* +when we get to building the non-revealing subset. + + Maximal non-revealing subsets + +If S is not a non-revealing subset, there is a locked element x in S +such that parent of x is not in S. + +Obviously, no non-revealing subset of S may contain x. Removing such +elements one by one will obviously end with the maximal non-revealing +subset (possibly empty one). Note that removal of an element will +require removal of all its locked children, etc. + +If the set had been non-shifting, it will remain non-shifting after +such removals. +Proof: suppose S was non-shifting, x is a locked element of S, parent of x +is not in S and S - {x} is not non-shifting. Then there is an element m +in S - {x} and a subtree mounted strictly inside m, such that m contains +an element not in in S - {x}. Since S is non-shifting, everything in +that subtree must belong to S. But that means that this subtree must +contain x somewhere *and* that parent of x either belongs that subtree +or is equal to m. Either way it must belong to S. Contradiction. + +// same representation as for finding maximal non-shifting subsets: +// S is a disjoint union of a non-revealing set U (the ones we are committed +// to unmount) and a set of candidates, represented as a subset of list +// elements that have "is a candidate" mark on them. +// Elements of U are removed from their parents' lists of children. +// In the end candidates becomes empty and maximal non-revealing non-shifting +// subset of S is now in U + while (Candidates list is non-empty) + handle_locked(first(Candidates)) + +handle_locked(m) + if m is not marked as a candidate + strip the "seen by Trim_ancestors" mark from m + remove m from the list + return + cutoff = m + for (p = m; p in candidates; p = parent(p)) { + strip the "seen by Trim_ancestors" mark from p + strip the "candidate" mark from p + remove p from the Candidates list + if (!locked(p)) + cutoff = parent(p) + } + if p in U + cutoff = p + while m != cutoff + remove m from children(parent(m)) + add m to U + m = parent(m) + +Let (x_0, ..., x_n = m) be the maximal chain of descent of m within S. +* If it contains some elements of U, let x_k be the last one of those. +Then union of U with {x_{k+1}, ..., x_n} is obviously non-revealing. +* otherwise if all its elements are locked, then none of {x_0, ..., x_n} +may be elements of a non-revealing subset of S. +* otherwise let x_k be the first unlocked element of the chain. Then none +of {x_0, ..., x_{k-1}} may be an element of a non-revealing subset of +S and union of U and {x_k, ..., x_n} is non-revealing. + +handle_locked(m) finds which of these cases applies and adjusts Candidates +and U accordingly. U remains non-revealing, union of Candidates and +U still contains any non-revealing subset of S and after the call of +handle_locked(m) m is guaranteed to be not in Candidates list. So having +it called for each element of S would suffice to empty Candidates, +leaving U the maximal non-revealing subset of S. + +However, handle_locked(m) is a no-op when m belongs to U, so it's enough +to have it called for elements of Candidates list until none remain. + +Time complexity: number of calls of handle_locked() is limited by +#Candidates, each iteration of the first loop in handle_locked() removes +an element from the list, so their total number of executions is also +limited by #Candidates; number of iterations in the second loop is no +greater than the number of iterations of the first loop. + + + Reparenting + +After we'd calculated the final set, we still need to deal with +reparenting - if an element of the final set has a child not in it, +we need to reparent such child. + +Such children can only be root-overmounting (otherwise the set wouldn't +be non-shifting) and their parents can not belong to the original set, +since the original is guaranteed to be closed. + + + Putting all of that together + +The plan is to + * find all candidates + * trim down to maximal non-shifting subset + * trim down to maximal non-revealing subset + * reparent anything that needs to be reparented + * return the resulting set to the caller + +For the 2nd and 3rd steps we want to separate the set into growing +non-revealing subset, initially containing the original set ("U" in +terms of the pseudocode above) and everything we are still not sure about +("candidates"). It means that for the output of the 1st step we'd like +the extra candidates separated from the stuff already in the original set. +For the 4th step we would like the additions to U separate from the +original set. + +So let's go for + * original set ("set"). Linkage via mnt_list + * undecided candidates ("candidates"). Subset of a list, +consisting of all its elements marked with a new flag (T_UMOUNT_CANDIDATE). +Initially all elements of the list will be marked that way; in the +end the list will become empty and no mounts will remain marked with +that flag. + * Reuse T_MARKED for "has been already seen by trim_ancestors()". + * anything in U that hadn't been in the original set - elements of +candidates will gradually be either discarded or moved there. In other +words, it's the candidates we have already decided to unmount. Its role +is reasonably close to the old "to_umount", so let's use that name. +Linkage via mnt_list. + +For gather_candidates() we'll need to maintain both candidates (S - +set) and intersection of S with set. Use T_UMOUNT_CANDIDATE for +all elements we encounter, putting the ones not already in the original +set into the list of candidates. When we are done, strip that flag from +all elements of the original set. That gives a cheap way to check +if element belongs to S (in gather_candidates) and to candidates +itself (at later stages). Call that predicate is_candidate(); it would +be m->mnt_t_flags & T_UMOUNT_CANDIDATE. + +All elements of the original set are marked with MNT_UMOUNT and we'll +need the same for elements added when joining the contents of to_umount +to set in the end. Let's set MNT_UMOUNT at the time we add an element +to to_umount; that's close to what the old 'umount_one' is doing, so +let's keep that name. It also gives us another predicate we need - +"belongs to union of set and to_umount"; will_be_unmounted() for now. + +Removals from the candidates list should strip both T_MARKED and +T_UMOUNT_CANDIDATE; call it remove_from_candidates_list(). diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst index fd32a9a17bfb..486a91633474 100644 --- a/Documentation/filesystems/vfs.rst +++ b/Documentation/filesystems/vfs.rst @@ -515,8 +515,8 @@ As of kernel 2.6.22, the following members are defined: struct posix_acl * (*get_acl)(struct mnt_idmap *, struct dentry *, int); int (*set_acl)(struct mnt_idmap *, struct dentry *, struct posix_acl *, int); int (*fileattr_set)(struct mnt_idmap *idmap, - struct dentry *dentry, struct fileattr *fa); - int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa); + struct dentry *dentry, struct file_kattr *fa); + int (*fileattr_get)(struct dentry *dentry, struct file_kattr *fa); struct offset_ctx *(*get_offset_ctx)(struct inode *inode); }; @@ -758,8 +758,9 @@ process is more complicated and uses write_begin/write_end or dirty_folio to write data into the address_space, and writepages to writeback data to storage. -Adding and removing pages to/from an address_space is protected by the -inode's i_mutex. +Removing pages from an address_space requires holding the inode's i_rwsem +exclusively, while adding pages to the address_space requires holding the +inode's i_mapping->invalidate_lock exclusively. When data is written to a page, the PG_Dirty flag should be set. It typically remains set until writepages asks for it to be written. This @@ -822,10 +823,10 @@ cache in your filesystem. The following members are defined: int (*writepages)(struct address_space *, struct writeback_control *); bool (*dirty_folio)(struct address_space *, struct folio *); void (*readahead)(struct readahead_control *); - int (*write_begin)(struct file *, struct address_space *mapping, + int (*write_begin)(const struct kiocb *, struct address_space *mapping, loff_t pos, unsigned len, - struct page **pagep, void **fsdata); - int (*write_end)(struct file *, struct address_space *mapping, + struct page **pagep, void **fsdata); + int (*write_end)(const struct kiocb *, struct address_space *mapping, loff_t pos, unsigned len, unsigned copied, struct folio *folio, void *fsdata); sector_t (*bmap)(struct address_space *, sector_t); @@ -1071,12 +1072,14 @@ This describes how the VFS can manipulate an open file. As of kernel struct file_operations { struct module *owner; + fop_flags_t fop_flags; loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); - int (*iopoll)(struct kiocb *kiocb, bool spin); + int (*iopoll)(struct kiocb *kiocb, struct io_comp_batch *, + unsigned int flags); int (*iterate_shared) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *); long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); @@ -1093,18 +1096,24 @@ This describes how the VFS can manipulate an open file. As of kernel int (*flock) (struct file *, int, struct file_lock *); ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int); ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int); - int (*setlease)(struct file *, long, struct file_lock **, void **); + void (*splice_eof)(struct file *file); + int (*setlease)(struct file *, int, struct file_lease **, void **); long (*fallocate)(struct file *file, int mode, loff_t offset, loff_t len); void (*show_fdinfo)(struct seq_file *m, struct file *f); #ifndef CONFIG_MMU unsigned (*mmap_capabilities)(struct file *); #endif - ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int); + ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, + loff_t, size_t, unsigned int); loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, loff_t len, unsigned int remap_flags); int (*fadvise)(struct file *, loff_t, loff_t, int); + int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags); + int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *, + unsigned int poll_flags); + int (*mmap_prepare)(struct vm_area_desc *); }; Again, all methods are called without any locks being held, unless @@ -1144,7 +1153,8 @@ otherwise noted. used on 64 bit kernels. ``mmap`` - called by the mmap(2) system call + called by the mmap(2) system call. Deprecated in favour of + ``mmap_prepare``. ``open`` called by the VFS when an inode should be opened. When the VFS @@ -1221,6 +1231,11 @@ otherwise noted. ``fadvise`` possibly called by the fadvise64() system call. +``mmap_prepare`` + Called by the mmap(2) system call. Allows a VFS to set up a + file-backed memory mapping, most notably establishing relevant + private state and VMA callbacks. + Note that the file operations are implemented by the specific filesystem in which the inode resides. When opening a device node (character or block special) most filesystems will call special |