Documentation/filesystems/propagate_umount.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484

	Notes on propagate_umount()

Umount propagation starts with a set of mounts we are already going to
take out.  Ideally, we would like to add all downstream cognates to
that set - anything with the same mountpoint as one of the removed
mounts and with parent that would receive events from the parent of that
mount.  However, there are some constraints the resulting set must
satisfy.

It is convenient to define several properties of sets of mounts:

1) A set S of mounts is non-shifting if for any mount X belonging
to S all subtrees mounted strictly inside of X (i.e. not overmounting
the root of X) contain only elements of S.

2) A set S is non-revealing if all locked mounts that belong to S have
parents that also belong to S.

3) A set S is closed if it contains all children of its elements.

The set of mounts taken out by umount(2) must be non-shifting and
non-revealing; the first constraint is what allows to reparent
any remaining mounts and the second is what prevents the exposure
of any concealed mountpoints.

propagate_umount() takes the original set as an argument and tries to
extend that set.  The original set is a full subtree and its root is
unlocked; what matters is that it's closed and non-revealing.
Resulting set may not be closed; there might still be mounts outside
of that set, but only on top of stacks of root-overmounting elements
of set.  They can be reparented to the place where the bottom of
stack is attached to a mount that will survive.  NOTE: doing that
will violate a constraint on having no more than one mount with
the same parent/mountpoint pair; however, the caller (umount_tree())
will immediately remedy that - it may keep unmounted element attached
to parent, but only if the parent itself is unmounted.  Since all
conflicts created by reparenting have common parent *not* in the
set and one side of the conflict (bottom of the stack of overmounts)
is in the set, it will be resolved.  However, we rely upon umount_tree()
doing that pretty much immediately after the call of propagate_umount().

Algorithm is based on two statements:
	1) for any set S, there is a maximal non-shifting subset of S
and it can be calculated in O(#S) time.
	2) for any non-shifting set S, there is a maximal non-revealing
subset of S.  That subset is also non-shifting and it can be calculated
in O(#S) time.

		Finding candidates.

We are given a closed set U and we want to find all mounts that have
the same mountpoint as some mount m in U *and* whose parent receives
propagation from the parent of the same mount m.  Naive implementation
would be
	S = {}
	for each m in U
		add m to S
		p = parent(m)
		for each q in Propagation(p) - {p}
			child = look_up(q, mountpoint(m))
			if child
				add child to S
but that can lead to excessive work - there might be propagation among the
subtrees of U, in which case we'd end up examining the same candidates
many times.  Since propagation is transitive, the same will happen to
everything downstream of that candidate and it's not hard to construct
cases where the approach above leads to the time quadratic by the actual
number of candidates.

Note that if we run into a candidate we'd already seen, it must've been
added on an earlier iteration of the outer loop - all additions made
during one iteration of the outer loop have different parents.  So
if we find a child already added to the set, we know that everything
in Propagation(parent(child)) with the same mountpoint has been already
added.
	S = {}
	for each m in U
		if m in S
			continue
		add m to S
		p = parent(m)
		q = propagation_next(p, p)
		while q
			child = look_up(q, mountpoint(m))
			if child
				if child in S
					q = skip_them(q, p)
					continue;
				add child to S
			q = propagation_next(q, p)
where
skip_them(q, p)
	keep walking Propagation(p) from q until we find something
	not in Propagation(q)

would get rid of that problem, but we need a sane implementation of
skip_them().  That's not hard to do - split propagation_next() into
"down into mnt_slave_list" and "forward-and-up" parts, with the
skip_them() being "repeat the forward-and-up part until we get NULL
or something that isn't a peer of the one we are skipping".

Note that there can be no absolute roots among the extra candidates -
they all come from mount lookups.  Absolute root among the original
set is _currently_ impossible, but it might be worth protecting
against.

		Maximal non-shifting subsets.

Let's call a mount m in a set S forbidden in that set if there is a
subtree mounted strictly inside m and containing mounts that do not
belong to S.

The set is non-shifting when none of its elements are forbidden in it.

If mount m is forbidden in a set S, it is forbidden in any subset S' it
belongs to.  In other words, it can't belong to any of the non-shifting
subsets of S.  If we had a way to find a forbidden mount or show that
there's none, we could use it to find the maximal non-shifting subset
simply by finding and removing them until none remain.

Suppose mount m is forbidden in S; then any mounts forbidden in S - {m}
must have been forbidden in S itself.  Indeed, since m has descendents
that do not belong to S, any subtree that fits into S will fit into
S - {m} as well.

So in principle we could go through elements of S, checking if they
are forbidden in S and removing the ones that are.  Removals will
not invalidate the checks done for earlier mounts - if they were not
forbidden at the time we checked, they won't become forbidden later.
It's too costly to be practical, but there is a similar approach that
is linear by size of S.

Let's say that mount x in a set S is forbidden by mount y, if
	* both x and y belong to S.
	* there is a chain of mounts starting at x and leaving S
	  immediately after passing through y, with the first
	  mountpoint strictly inside x.
Note 1: x may be equal to y - that's the case when something not
belonging to S is mounted strictly inside x.
Note 2: if y does not belong to S, it can't forbid anything in S.
Note 3: if y has no children outside of S, it can't forbid anything in S.

It's easy to show that mount x is forbidden in S if and only if x is
forbidden in S by some mount y.  And it's easy to find all mounts in S
forbidden by a given mount.

Consider the following operation:
	Trim(S, m) = S - {x : x is forbidden by m in S}

Note that if m does not belong to S or has no children outside of S we
are guaranteed that Trim(S, m) is equal to S.

The following is true: if x is forbidden by y in Trim(S, m), it was
already forbidden by y in S.

Proof: Suppose x is forbidden by y in Trim(S, m).  Then there is a
chain of mounts (x_0 = x, ..., x_k = y, x_{k+1} = r), such that x_{k+1}
is the first element that doesn't belong to Trim(S, m) and the
mountpoint of x_1 is strictly inside x.  If mount r belongs to S, it must
have been removed by Trim(S, m), i.e. it was forbidden in S by m.
Then there was a mount chain from r to some child of m that stayed in
S all the way until m, but that's impossible since x belongs to Trim(S, m)
and prepending (x_0, ..., x_k) to that chain demonstrates that x is also
forbidden in S by m, and thus can't belong to Trim(S, m).
Therefore r can not belong to S and our chain demonstrates that
x is forbidden by y in S.  QED.

Corollary: no mount is forbidden by m in Trim(S, m).  Indeed, any
such mount would have been forbidden by m in S and thus would have been
in the part of S removed in Trim(S, m).

Corollary: no mount is forbidden by m in Trim(Trim(S, m), n).  Indeed,
any such would have to have been forbidden by m in Trim(S, m), which
is impossible.

Corollary: after
	S = Trim(S, x_1)
	S = Trim(S, x_2)
	...
	S = Trim(S, x_k)
no mount remaining in S will be forbidden by either of x_1,...,x_k.

The following will reduce S to its maximal non-shifting subset:
	visited = {}
	while S contains elements not belonging to visited
		let m be an arbitrary such element of S
		S = Trim(S, m)
		add m to visited

S never grows, so the number of elements of S not belonging to visited
decreases at least by one on each iteration.  When the loop terminates,
all mounts remaining in S belong to visited.  It's easy to see that at
the beginning of each iteration no mount remaining in S will be forbidden
by any element of visited.  In other words, no mount remaining in S will
be forbidden, i.e. final value of S will be non-shifting.  It will be
the maximal non-shifting subset, since we were removing only forbidden
elements.

	There are two difficulties in implementing the above in linear
time, both due to the fact that Trim() might need to remove more than one
element.  Naive implementation of Trim() is vulnerable to running into a
long chain of mounts, each mounted on top of parent's root.  Nothing in
that chain is forbidden, so nothing gets removed from it.  We need to
recognize such chains and avoid walking them again on subsequent calls of
Trim(), otherwise we will end up with worst-case time being quadratic by
the number of elements in S.  Another difficulty is in implementing the
outer loop - we need to iterate through all elements of a shrinking set.
That would be trivial if we never removed more than one element at a time
(linked list, with list_for_each_entry_safe for iterator), but we may
need to remove more than one entry, possibly including the ones we have
already visited.

	Let's start with naive algorithm for Trim():

Trim_one(m)
	found = false
	for each n in children(m)
		if n not in S
			found = true
			if (mountpoint(n) != root(m))
				remove m from S
				break
	if found
		Trim_ancestors(m)

Trim_ancestors(m)
	for (; parent(m) in S; m = parent(m)) {
		if (mountpoint(m) != root(parent(m)))
			remove parent(m) from S
	}

If m belongs to S, Trim_one(m) will replace S with Trim(S, m).
Proof:
	Consider the chains excluding elements from Trim(S, m).  The last
two elements in such chain are m and some child of m that does not belong
to S.  If m has no such children, Trim(S, m) is equal to S.
	m itself is removed if and only if the chain has exactly two
elements, i.e. when the last element does not overmount the root of m.
In other words, that happens when m has a child not in S that does not
overmount the root of m.
	All other elements to remove will be ancestors of m, such that
the entire descent chain from them to m is contained in S.  Let
(x_0, x_1, ..., x_k = m) be the longest such chain.  x_i needs to be
removed if and only if x_{i+1} does not overmount its root.  It's easy
to see that Trim_ancestors(m) will iterate through that chain from
x_k to x_1 and that it will remove exactly the elements that need to be
removed.

	Note that if the loop in Trim_ancestors() walks into an already
visited element, we are guaranteed that remaining iterations will see
only elements that had already been visited and remove none of them.
That's the weakness that makes it vulnerable to long chains of full
overmounts.

	It's easy to deal with, if we can afford setting marks on
elements of S; we would mark all elements already visited by
Trim_ancestors() and have it bail out as soon as it sees an already
marked element.

	The problems with iterating through the set can be dealt with in
several ways, depending upon the representation we choose for our set.
One useful observation is that we are given a closed subset in S - the
original set passed to propagate_umount().  Its elements can neither
forbid anything nor be forbidden by anything - all their descendents
belong to S, so they can not occur anywhere in any excluding chain.
In other words, the elements of that subset will remain in S until
the end and Trim_one(S, m) is a no-op for all m from that subset.

	That suggests keeping S as a disjoint union of a closed set U
('will be unmounted, no matter what') and the set of all elements of
S that do not belong to U.  That set ('candidates') is all we need
to iterate through.  Let's represent it as a subset in a cyclic list,
consisting of all list elements that are marked as candidates (initially -
all of them).  Then we could have Trim_ancestors() only remove the mark,
leaving the elements on the list.  Then Trim_one() would never remove
anything other than its argument from the containing list, allowing to
use list_for_each_entry_safe() as iterator.

	Assuming that representation we get the following:

	list_for_each_entry_safe(m, ..., Candidates, ...)
		Trim_one(m)
where
Trim_one(m)
	if (m is not marked as a candidate)
		strip the "seen by Trim_ancestors" mark from m
		remove m from the Candidates list
		return
		
	remove_this = false
	found = false
	for each n in children(m)
		if n not in S
			found = true
			if (mountpoint(n) != root(m))
				remove_this = true
				break
	if found
		Trim_ancestors(m)
	if remove_this
		strip the "seen by Trim_ancestors" mark from m
		strip the "candidate" mark from m
		remove m from the Candidate list

Trim_ancestors(m)
	for (p = parent(m); p is marked as candidate ; m = p, p = parent(p)) {
		if m is marked as seen by Trim_ancestors
			return
		mark m as seen by Trim_ancestors
		if (mountpoint(m) != root(p))
			strip the "candidate" mark from p
	}

	Terminating condition in the loop in Trim_ancestors() is correct,
since that that loop will never run into p belonging to U - p is always
an ancestor of argument of Trim_one() and since U is closed, the argument
of Trim_one() would also have to belong to U.  But Trim_one() is never
called for elements of U.  In other words, p belongs to S if and only
if it belongs to candidates.

	Time complexity:
* we get no more than O(#S) calls of Trim_one()
* the loop over children in Trim_one() never looks at the same child
twice through all the calls.
* iterations of that loop for children in S are no more than O(#S)
in the worst case
* at most two children that are not elements of S are considered per
call of Trim_one().
* the loop in Trim_ancestors() sets its mark once per iteration and
no element of S has is set more than once.

	In the end we may have some elements excluded from S by
Trim_ancestors() still stuck on the list.  We could do a separate
loop removing them from the list (also no worse than O(#S) time),
but it's easier to leave that until the next phase - there we will
iterate through the candidates anyway.

	The caller has already removed all elements of U from their parents'
lists of children, which means that checking if child belongs to S is
equivalent to checking if it's marked as a candidate; we'll never see
the elements of U in the loop over children in Trim_one().

	What's more, if we see that children(m) is empty and m is not
locked, we can immediately move m into the committed subset (remove
from the parent's list of children, etc.).  That's one fewer mount we'll
have to look into when we check the list of children of its parent *and*
when we get to building the non-revealing subset.

		Maximal non-revealing subsets

If S is not a non-revealing subset, there is a locked element x in S
such that parent of x is not in S.

Obviously, no non-revealing subset of S may contain x.  Removing such
elements one by one will obviously end with the maximal non-revealing
subset (possibly empty one).  Note that removal of an element will
require removal of all its locked children, etc.

If the set had been non-shifting, it will remain non-shifting after
such removals.
Proof: suppose S was non-shifting, x is a locked element of S, parent of x
is not in S and S - {x} is not non-shifting.  Then there is an element m
in S - {x} and a subtree mounted strictly inside m, such that m contains
an element not in in S - {x}.  Since S is non-shifting, everything in
that subtree must belong to S.  But that means that this subtree must
contain x somewhere *and* that parent of x either belongs that subtree
or is equal to m.  Either way it must belong to S.  Contradiction.

// same representation as for finding maximal non-shifting subsets:
// S is a disjoint union of a non-revealing set U (the ones we are committed
// to unmount) and a set of candidates, represented as a subset of list
// elements that have "is a candidate" mark on them.
// Elements of U are removed from their parents' lists of children.
// In the end candidates becomes empty and maximal non-revealing non-shifting
// subset of S is now in U
	while (Candidates list is non-empty)
		handle_locked(first(Candidates))

handle_locked(m)
	if m is not marked as a candidate
		strip the "seen by Trim_ancestors" mark from m
		remove m from the list
		return
	cutoff = m
	for (p = m; p in candidates; p = parent(p)) {
		strip the "seen by Trim_ancestors" mark from p
		strip the "candidate" mark from p
		remove p from the Candidates list
		if (!locked(p))
			cutoff = parent(p)
	}
	if p in U
		cutoff = p
	while m != cutoff
		remove m from children(parent(m))
		add m to U
		m = parent(m)

Let (x_0, ..., x_n = m) be the maximal chain of descent of m within S.
* If it contains some elements of U, let x_k be the last one of those.
Then union of U with {x_{k+1}, ..., x_n} is obviously non-revealing.
* otherwise if all its elements are locked, then none of {x_0, ..., x_n}
may be elements of a non-revealing subset of S.
* otherwise let x_k be the first unlocked element of the chain.  Then none
of {x_0, ..., x_{k-1}} may be an element of a non-revealing subset of
S and union of U and {x_k, ..., x_n} is non-revealing.

handle_locked(m) finds which of these cases applies and adjusts Candidates
and U accordingly.  U remains non-revealing, union of Candidates and
U still contains any non-revealing subset of S and after the call of
handle_locked(m) m is guaranteed to be not in Candidates list.  So having
it called for each element of S would suffice to empty Candidates,
leaving U the maximal non-revealing subset of S.

However, handle_locked(m) is a no-op when m belongs to U, so it's enough
to have it called for elements of Candidates list until none remain.

Time complexity: number of calls of handle_locked() is limited by
#Candidates, each iteration of the first loop in handle_locked() removes
an element from the list, so their total number of executions is also
limited by #Candidates; number of iterations in the second loop is no
greater than the number of iterations of the first loop.


		Reparenting

After we'd calculated the final set, we still need to deal with
reparenting - if an element of the final set has a child not in it,
we need to reparent such child.

Such children can only be root-overmounting (otherwise the set wouldn't
be non-shifting) and their parents can not belong to the original set,
since the original is guaranteed to be closed.


		Putting all of that together

The plan is to
	* find all candidates
	* trim down to maximal non-shifting subset
	* trim down to maximal non-revealing subset
	* reparent anything that needs to be reparented
	* return the resulting set to the caller

For the 2nd and 3rd steps we want to separate the set into growing
non-revealing subset, initially containing the original set ("U" in
terms of the pseudocode above) and everything we are still not sure about
("candidates").  It means that for the output of the 1st step we'd like
the extra candidates separated from the stuff already in the original set.
For the 4th step we would like the additions to U separate from the
original set.

So let's go for
	* original set ("set").  Linkage via mnt_list
	* undecided candidates ("candidates").  Subset of a list,
consisting of all its elements marked with a new flag (T_UMOUNT_CANDIDATE).
Initially all elements of the list will be marked that way; in the
end the list will become empty and no mounts will remain marked with
that flag.
	* Reuse T_MARKED for "has been already seen by trim_ancestors()".
	* anything in U that hadn't been in the original set - elements of
candidates will gradually be either discarded or moved there.  In other
words, it's the candidates we have already decided to unmount.	Its role
is reasonably close to the old "to_umount", so let's use that name.
Linkage via mnt_list.

For gather_candidates() we'll need to maintain both candidates (S -
set) and intersection of S with set.  Use T_UMOUNT_CANDIDATE for
all elements we encounter, putting the ones not already in the original
set into the list of candidates.  When we are done, strip that flag from
all elements of the original set.  That gives a cheap way to check
if element belongs to S (in gather_candidates) and to candidates
itself (at later stages).  Call that predicate is_candidate(); it would
be m->mnt_t_flags & T_UMOUNT_CANDIDATE.

All elements of the original set are marked with MNT_UMOUNT and we'll
need the same for elements added when joining the contents of to_umount
to set in the end.  Let's set MNT_UMOUNT at the time we add an element
to to_umount; that's close to what the old 'umount_one' is doing, so
let's keep that name.  It also gives us another predicate we need -
"belongs to union of set and to_umount"; will_be_unmounted() for now.

Removals from the candidates list should strip both T_MARKED and
T_UMOUNT_CANDIDATE; call it remove_from_candidates_list().