summaryrefslogtreecommitdiff
path: root/Documentation/RCU/Design/Memory-Ordering
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/RCU/Design/Memory-Ordering')
-rw-r--r--Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Diagram.html9
-rw-r--r--Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html704
-rw-r--r--Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst624
-rw-r--r--Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg2
-rw-r--r--Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg2
5 files changed, 626 insertions, 715 deletions
diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Diagram.html b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Diagram.html
deleted file mode 100644
index e5b42a798ff3..000000000000
--- a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Diagram.html
+++ /dev/null
@@ -1,9 +0,0 @@
-<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
- "http://www.w3.org/TR/html4/loose.dtd">
- <html>
- <head><title>A Diagram of TREE_RCU's Grace-Period Memory Ordering</title>
- <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
-
-<p><img src="TreeRCU-gp.svg" alt="TreeRCU-gp.svg">
-
-</body></html>
diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html
deleted file mode 100644
index c64f8d26609f..000000000000
--- a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html
+++ /dev/null
@@ -1,704 +0,0 @@
-<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
- "http://www.w3.org/TR/html4/loose.dtd">
- <html>
- <head><title>A Tour Through TREE_RCU's Grace-Period Memory Ordering</title>
- <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
-
- <p>August 8, 2017</p>
- <p>This article was contributed by Paul E.&nbsp;McKenney</p>
-
-<h3>Introduction</h3>
-
-<p>This document gives a rough visual overview of how Tree RCU's
-grace-period memory ordering guarantee is provided.
-
-<ol>
-<li> <a href="#What Is Tree RCU's Grace Period Memory Ordering Guarantee?">
- What Is Tree RCU's Grace Period Memory Ordering Guarantee?</a>
-<li> <a href="#Tree RCU Grace Period Memory Ordering Building Blocks">
- Tree RCU Grace Period Memory Ordering Building Blocks</a>
-<li> <a href="#Tree RCU Grace Period Memory Ordering Components">
- Tree RCU Grace Period Memory Ordering Components</a>
-<li> <a href="#Putting It All Together">Putting It All Together</a>
-</ol>
-
-<h3><a name="What Is Tree RCU's Grace Period Memory Ordering Guarantee?">
-What Is Tree RCU's Grace Period Memory Ordering Guarantee?</a></h3>
-
-<p>RCU grace periods provide extremely strong memory-ordering guarantees
-for non-idle non-offline code.
-Any code that happens after the end of a given RCU grace period is guaranteed
-to see the effects of all accesses prior to the beginning of that grace
-period that are within RCU read-side critical sections.
-Similarly, any code that happens before the beginning of a given RCU grace
-period is guaranteed to see the effects of all accesses following the end
-of that grace period that are within RCU read-side critical sections.
-
-<p>Note well that RCU-sched read-side critical sections include any region
-of code for which preemption is disabled.
-Given that each individual machine instruction can be thought of as
-an extremely small region of preemption-disabled code, one can think of
-<tt>synchronize_rcu()</tt> as <tt>smp_mb()</tt> on steroids.
-
-<p>RCU updaters use this guarantee by splitting their updates into
-two phases, one of which is executed before the grace period and
-the other of which is executed after the grace period.
-In the most common use case, phase one removes an element from
-a linked RCU-protected data structure, and phase two frees that element.
-For this to work, any readers that have witnessed state prior to the
-phase-one update (in the common case, removal) must not witness state
-following the phase-two update (in the common case, freeing).
-
-<p>The RCU implementation provides this guarantee using a network
-of lock-based critical sections, memory barriers, and per-CPU
-processing, as is described in the following sections.
-
-<h3><a name="Tree RCU Grace Period Memory Ordering Building Blocks">
-Tree RCU Grace Period Memory Ordering Building Blocks</a></h3>
-
-<p>The workhorse for RCU's grace-period memory ordering is the
-critical section for the <tt>rcu_node</tt> structure's
-<tt>-&gt;lock</tt>.
-These critical sections use helper functions for lock acquisition, including
-<tt>raw_spin_lock_rcu_node()</tt>,
-<tt>raw_spin_lock_irq_rcu_node()</tt>, and
-<tt>raw_spin_lock_irqsave_rcu_node()</tt>.
-Their lock-release counterparts are
-<tt>raw_spin_unlock_rcu_node()</tt>,
-<tt>raw_spin_unlock_irq_rcu_node()</tt>, and
-<tt>raw_spin_unlock_irqrestore_rcu_node()</tt>,
-respectively.
-For completeness, a
-<tt>raw_spin_trylock_rcu_node()</tt>
-is also provided.
-The key point is that the lock-acquisition functions, including
-<tt>raw_spin_trylock_rcu_node()</tt>, all invoke
-<tt>smp_mb__after_unlock_lock()</tt> immediately after successful
-acquisition of the lock.
-
-<p>Therefore, for any given <tt>rcu_node</tt> structure, any access
-happening before one of the above lock-release functions will be seen
-by all CPUs as happening before any access happening after a later
-one of the above lock-acquisition functions.
-Furthermore, any access happening before one of the
-above lock-release function on any given CPU will be seen by all
-CPUs as happening before any access happening after a later one
-of the above lock-acquisition functions executing on that same CPU,
-even if the lock-release and lock-acquisition functions are operating
-on different <tt>rcu_node</tt> structures.
-Tree RCU uses these two ordering guarantees to form an ordering
-network among all CPUs that were in any way involved in the grace
-period, including any CPUs that came online or went offline during
-the grace period in question.
-
-<p>The following litmus test exhibits the ordering effects of these
-lock-acquisition and lock-release functions:
-
-<pre>
- 1 int x, y, z;
- 2
- 3 void task0(void)
- 4 {
- 5 raw_spin_lock_rcu_node(rnp);
- 6 WRITE_ONCE(x, 1);
- 7 r1 = READ_ONCE(y);
- 8 raw_spin_unlock_rcu_node(rnp);
- 9 }
-10
-11 void task1(void)
-12 {
-13 raw_spin_lock_rcu_node(rnp);
-14 WRITE_ONCE(y, 1);
-15 r2 = READ_ONCE(z);
-16 raw_spin_unlock_rcu_node(rnp);
-17 }
-18
-19 void task2(void)
-20 {
-21 WRITE_ONCE(z, 1);
-22 smp_mb();
-23 r3 = READ_ONCE(x);
-24 }
-25
-26 WARN_ON(r1 == 0 &amp;&amp; r2 == 0 &amp;&amp; r3 == 0);
-</pre>
-
-<p>The <tt>WARN_ON()</tt> is evaluated at &ldquo;the end of time&rdquo;,
-after all changes have propagated throughout the system.
-Without the <tt>smp_mb__after_unlock_lock()</tt> provided by the
-acquisition functions, this <tt>WARN_ON()</tt> could trigger, for example
-on PowerPC.
-The <tt>smp_mb__after_unlock_lock()</tt> invocations prevent this
-<tt>WARN_ON()</tt> from triggering.
-
-<p>This approach must be extended to include idle CPUs, which need
-RCU's grace-period memory ordering guarantee to extend to any
-RCU read-side critical sections preceding and following the current
-idle sojourn.
-This case is handled by calls to the strongly ordered
-<tt>atomic_add_return()</tt> read-modify-write atomic operation that
-is invoked within <tt>rcu_dynticks_eqs_enter()</tt> at idle-entry
-time and within <tt>rcu_dynticks_eqs_exit()</tt> at idle-exit time.
-The grace-period kthread invokes <tt>rcu_dynticks_snap()</tt> and
-<tt>rcu_dynticks_in_eqs_since()</tt> (both of which invoke
-an <tt>atomic_add_return()</tt> of zero) to detect idle CPUs.
-
-<table>
-<tr><th>&nbsp;</th></tr>
-<tr><th align="left">Quick Quiz:</th></tr>
-<tr><td>
- But what about CPUs that remain offline for the entire
- grace period?
-</td></tr>
-<tr><th align="left">Answer:</th></tr>
-<tr><td bgcolor="#ffffff"><font color="ffffff">
- Such CPUs will be offline at the beginning of the grace period,
- so the grace period won't expect quiescent states from them.
- Races between grace-period start and CPU-hotplug operations
- are mediated by the CPU's leaf <tt>rcu_node</tt> structure's
- <tt>-&gt;lock</tt> as described above.
-</font></td></tr>
-<tr><td>&nbsp;</td></tr>
-</table>
-
-<p>The approach must be extended to handle one final case, that
-of waking a task blocked in <tt>synchronize_rcu()</tt>.
-This task might be affinitied to a CPU that is not yet aware that
-the grace period has ended, and thus might not yet be subject to
-the grace period's memory ordering.
-Therefore, there is an <tt>smp_mb()</tt> after the return from
-<tt>wait_for_completion()</tt> in the <tt>synchronize_rcu()</tt>
-code path.
-
-<table>
-<tr><th>&nbsp;</th></tr>
-<tr><th align="left">Quick Quiz:</th></tr>
-<tr><td>
- What? Where???
- I don't see any <tt>smp_mb()</tt> after the return from
- <tt>wait_for_completion()</tt>!!!
-</td></tr>
-<tr><th align="left">Answer:</th></tr>
-<tr><td bgcolor="#ffffff"><font color="ffffff">
- That would be because I spotted the need for that
- <tt>smp_mb()</tt> during the creation of this documentation,
- and it is therefore unlikely to hit mainline before v4.14.
- Kudos to Lance Roy, Will Deacon, Peter Zijlstra, and
- Jonathan Cameron for asking questions that sensitized me
- to the rather elaborate sequence of events that demonstrate
- the need for this memory barrier.
-</font></td></tr>
-<tr><td>&nbsp;</td></tr>
-</table>
-
-<p>Tree RCU's grace--period memory-ordering guarantees rely most
-heavily on the <tt>rcu_node</tt> structure's <tt>-&gt;lock</tt>
-field, so much so that it is necessary to abbreviate this pattern
-in the diagrams in the next section.
-For example, consider the <tt>rcu_prepare_for_idle()</tt> function
-shown below, which is one of several functions that enforce ordering
-of newly arrived RCU callbacks against future grace periods:
-
-<pre>
- 1 static void rcu_prepare_for_idle(void)
- 2 {
- 3 bool needwake;
- 4 struct rcu_data *rdp;
- 5 struct rcu_dynticks *rdtp = this_cpu_ptr(&amp;rcu_dynticks);
- 6 struct rcu_node *rnp;
- 7 struct rcu_state *rsp;
- 8 int tne;
- 9
-10 if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL) ||
-11 rcu_is_nocb_cpu(smp_processor_id()))
-12 return;
-13 tne = READ_ONCE(tick_nohz_active);
-14 if (tne != rdtp-&gt;tick_nohz_enabled_snap) {
-15 if (rcu_cpu_has_callbacks(NULL))
-16 invoke_rcu_core();
-17 rdtp-&gt;tick_nohz_enabled_snap = tne;
-18 return;
-19 }
-20 if (!tne)
-21 return;
-22 if (rdtp-&gt;all_lazy &amp;&amp;
-23 rdtp-&gt;nonlazy_posted != rdtp-&gt;nonlazy_posted_snap) {
-24 rdtp-&gt;all_lazy = false;
-25 rdtp-&gt;nonlazy_posted_snap = rdtp-&gt;nonlazy_posted;
-26 invoke_rcu_core();
-27 return;
-28 }
-29 if (rdtp-&gt;last_accelerate == jiffies)
-30 return;
-31 rdtp-&gt;last_accelerate = jiffies;
-32 for_each_rcu_flavor(rsp) {
-33 rdp = this_cpu_ptr(rsp-&gt;rda);
-34 if (rcu_segcblist_pend_cbs(&amp;rdp-&gt;cblist))
-35 continue;
-36 rnp = rdp-&gt;mynode;
-37 raw_spin_lock_rcu_node(rnp);
-38 needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
-39 raw_spin_unlock_rcu_node(rnp);
-40 if (needwake)
-41 rcu_gp_kthread_wake(rsp);
-42 }
-43 }
-</pre>
-
-<p>But the only part of <tt>rcu_prepare_for_idle()</tt> that really
-matters for this discussion are lines&nbsp;37&ndash;39.
-We will therefore abbreviate this function as follows:
-
-</p><p><img src="rcu_node-lock.svg" alt="rcu_node-lock.svg">
-
-<p>The box represents the <tt>rcu_node</tt> structure's <tt>-&gt;lock</tt>
-critical section, with the double line on top representing the additional
-<tt>smp_mb__after_unlock_lock()</tt>.
-
-<h3><a name="Tree RCU Grace Period Memory Ordering Components">
-Tree RCU Grace Period Memory Ordering Components</a></h3>
-
-<p>Tree RCU's grace-period memory-ordering guarantee is provided by
-a number of RCU components:
-
-<ol>
-<li> <a href="#Callback Registry">Callback Registry</a>
-<li> <a href="#Grace-Period Initialization">Grace-Period Initialization</a>
-<li> <a href="#Self-Reported Quiescent States">
- Self-Reported Quiescent States</a>
-<li> <a href="#Dynamic Tick Interface">Dynamic Tick Interface</a>
-<li> <a href="#CPU-Hotplug Interface">CPU-Hotplug Interface</a>
-<li> <a href="Forcing Quiescent States">Forcing Quiescent States</a>
-<li> <a href="Grace-Period Cleanup">Grace-Period Cleanup</a>
-<li> <a href="Callback Invocation">Callback Invocation</a>
-</ol>
-
-<p>Each of the following section looks at the corresponding component
-in detail.
-
-<h4><a name="Callback Registry">Callback Registry</a></h4>
-
-<p>If RCU's grace-period guarantee is to mean anything at all, any
-access that happens before a given invocation of <tt>call_rcu()</tt>
-must also happen before the corresponding grace period.
-The implementation of this portion of RCU's grace period guarantee
-is shown in the following figure:
-
-</p><p><img src="TreeRCU-callback-registry.svg" alt="TreeRCU-callback-registry.svg">
-
-<p>Because <tt>call_rcu()</tt> normally acts only on CPU-local state,
-it provides no ordering guarantees, either for itself or for
-phase one of the update (which again will usually be removal of
-an element from an RCU-protected data structure).
-It simply enqueues the <tt>rcu_head</tt> structure on a per-CPU list,
-which cannot become associated with a grace period until a later
-call to <tt>rcu_accelerate_cbs()</tt>, as shown in the diagram above.
-
-<p>One set of code paths shown on the left invokes
-<tt>rcu_accelerate_cbs()</tt> via
-<tt>note_gp_changes()</tt>, either directly from <tt>call_rcu()</tt> (if
-the current CPU is inundated with queued <tt>rcu_head</tt> structures)
-or more likely from an <tt>RCU_SOFTIRQ</tt> handler.
-Another code path in the middle is taken only in kernels built with
-<tt>CONFIG_RCU_FAST_NO_HZ=y</tt>, which invokes
-<tt>rcu_accelerate_cbs()</tt> via <tt>rcu_prepare_for_idle()</tt>.
-The final code path on the right is taken only in kernels built with
-<tt>CONFIG_HOTPLUG_CPU=y</tt>, which invokes
-<tt>rcu_accelerate_cbs()</tt> via
-<tt>rcu_advance_cbs()</tt>, <tt>rcu_migrate_callbacks</tt>,
-<tt>rcutree_migrate_callbacks()</tt>, and <tt>takedown_cpu()</tt>,
-which in turn is invoked on a surviving CPU after the outgoing
-CPU has been completely offlined.
-
-<p>There are a few other code paths within grace-period processing
-that opportunistically invoke <tt>rcu_accelerate_cbs()</tt>.
-However, either way, all of the CPU's recently queued <tt>rcu_head</tt>
-structures are associated with a future grace-period number under
-the protection of the CPU's lead <tt>rcu_node</tt> structure's
-<tt>-&gt;lock</tt>.
-In all cases, there is full ordering against any prior critical section
-for that same <tt>rcu_node</tt> structure's <tt>-&gt;lock</tt>, and
-also full ordering against any of the current task's or CPU's prior critical
-sections for any <tt>rcu_node</tt> structure's <tt>-&gt;lock</tt>.
-
-<p>The next section will show how this ordering ensures that any
-accesses prior to the <tt>call_rcu()</tt> (particularly including phase
-one of the update)
-happen before the start of the corresponding grace period.
-
-<table>
-<tr><th>&nbsp;</th></tr>
-<tr><th align="left">Quick Quiz:</th></tr>
-<tr><td>
- But what about <tt>synchronize_rcu()</tt>?
-</td></tr>
-<tr><th align="left">Answer:</th></tr>
-<tr><td bgcolor="#ffffff"><font color="ffffff">
- The <tt>synchronize_rcu()</tt> passes <tt>call_rcu()</tt>
- to <tt>wait_rcu_gp()</tt>, which invokes it.
- So either way, it eventually comes down to <tt>call_rcu()</tt>.
-</font></td></tr>
-<tr><td>&nbsp;</td></tr>
-</table>
-
-<h4><a name="Grace-Period Initialization">Grace-Period Initialization</a></h4>
-
-<p>Grace-period initialization is carried out by
-the grace-period kernel thread, which makes several passes over the
-<tt>rcu_node</tt> tree within the <tt>rcu_gp_init()</tt> function.
-This means that showing the full flow of ordering through the
-grace-period computation will require duplicating this tree.
-If you find this confusing, please note that the state of the
-<tt>rcu_node</tt> changes over time, just like Heraclitus's river.
-However, to keep the <tt>rcu_node</tt> river tractable, the
-grace-period kernel thread's traversals are presented in multiple
-parts, starting in this section with the various phases of
-grace-period initialization.
-
-<p>The first ordering-related grace-period initialization action is to
-advance the <tt>rcu_state</tt> structure's <tt>-&gt;gp_seq</tt>
-grace-period-number counter, as shown below:
-
-</p><p><img src="TreeRCU-gp-init-1.svg" alt="TreeRCU-gp-init-1.svg" width="75%">
-
-<p>The actual increment is carried out using <tt>smp_store_release()</tt>,
-which helps reject false-positive RCU CPU stall detection.
-Note that only the root <tt>rcu_node</tt> structure is touched.
-
-<p>The first pass through the <tt>rcu_node</tt> tree updates bitmasks
-based on CPUs having come online or gone offline since the start of
-the previous grace period.
-In the common case where the number of online CPUs for this <tt>rcu_node</tt>
-structure has not transitioned to or from zero,
-this pass will scan only the leaf <tt>rcu_node</tt> structures.
-However, if the number of online CPUs for a given leaf <tt>rcu_node</tt>
-structure has transitioned from zero,
-<tt>rcu_init_new_rnp()</tt> will be invoked for the first incoming CPU.
-Similarly, if the number of online CPUs for a given leaf <tt>rcu_node</tt>
-structure has transitioned to zero,
-<tt>rcu_cleanup_dead_rnp()</tt> will be invoked for the last outgoing CPU.
-The diagram below shows the path of ordering if the leftmost
-<tt>rcu_node</tt> structure onlines its first CPU and if the next
-<tt>rcu_node</tt> structure has no online CPUs
-(or, alternatively if the leftmost <tt>rcu_node</tt> structure offlines
-its last CPU and if the next <tt>rcu_node</tt> structure has no online CPUs).
-
-</p><p><img src="TreeRCU-gp-init-2.svg" alt="TreeRCU-gp-init-1.svg" width="75%">
-
-<p>The final <tt>rcu_gp_init()</tt> pass through the <tt>rcu_node</tt>
-tree traverses breadth-first, setting each <tt>rcu_node</tt> structure's
-<tt>-&gt;gp_seq</tt> field to the newly advanced value from the
-<tt>rcu_state</tt> structure, as shown in the following diagram.
-
-</p><p><img src="TreeRCU-gp-init-3.svg" alt="TreeRCU-gp-init-1.svg" width="75%">
-
-<p>This change will also cause each CPU's next call to
-<tt>__note_gp_changes()</tt>
-to notice that a new grace period has started, as described in the next
-section.
-But because the grace-period kthread started the grace period at the
-root (with the advancing of the <tt>rcu_state</tt> structure's
-<tt>-&gt;gp_seq</tt> field) before setting each leaf <tt>rcu_node</tt>
-structure's <tt>-&gt;gp_seq</tt> field, each CPU's observation of
-the start of the grace period will happen after the actual start
-of the grace period.
-
-<table>
-<tr><th>&nbsp;</th></tr>
-<tr><th align="left">Quick Quiz:</th></tr>
-<tr><td>
- But what about the CPU that started the grace period?
- Why wouldn't it see the start of the grace period right when
- it started that grace period?
-</td></tr>
-<tr><th align="left">Answer:</th></tr>
-<tr><td bgcolor="#ffffff"><font color="ffffff">
- In some deep philosophical and overly anthromorphized
- sense, yes, the CPU starting the grace period is immediately
- aware of having done so.
- However, if we instead assume that RCU is not self-aware,
- then even the CPU starting the grace period does not really
- become aware of the start of this grace period until its
- first call to <tt>__note_gp_changes()</tt>.
- On the other hand, this CPU potentially gets early notification
- because it invokes <tt>__note_gp_changes()</tt> during its
- last <tt>rcu_gp_init()</tt> pass through its leaf
- <tt>rcu_node</tt> structure.
-</font></td></tr>
-<tr><td>&nbsp;</td></tr>
-</table>
-
-<h4><a name="Self-Reported Quiescent States">
-Self-Reported Quiescent States</a></h4>
-
-<p>When all entities that might block the grace period have reported
-quiescent states (or as described in a later section, had quiescent
-states reported on their behalf), the grace period can end.
-Online non-idle CPUs report their own quiescent states, as shown
-in the following diagram:
-
-</p><p><img src="TreeRCU-qs.svg" alt="TreeRCU-qs.svg" width="75%">
-
-<p>This is for the last CPU to report a quiescent state, which signals
-the end of the grace period.
-Earlier quiescent states would push up the <tt>rcu_node</tt> tree
-only until they encountered an <tt>rcu_node</tt> structure that
-is waiting for additional quiescent states.
-However, ordering is nevertheless preserved because some later quiescent
-state will acquire that <tt>rcu_node</tt> structure's <tt>-&gt;lock</tt>.
-
-<p>Any number of events can lead up to a CPU invoking
-<tt>note_gp_changes</tt> (or alternatively, directly invoking
-<tt>__note_gp_changes()</tt>), at which point that CPU will notice
-the start of a new grace period while holding its leaf
-<tt>rcu_node</tt> lock.
-Therefore, all execution shown in this diagram happens after the
-start of the grace period.
-In addition, this CPU will consider any RCU read-side critical
-section that started before the invocation of <tt>__note_gp_changes()</tt>
-to have started before the grace period, and thus a critical
-section that the grace period must wait on.
-
-<table>
-<tr><th>&nbsp;</th></tr>
-<tr><th align="left">Quick Quiz:</th></tr>
-<tr><td>
- But a RCU read-side critical section might have started
- after the beginning of the grace period
- (the advancing of <tt>-&gt;gp_seq</tt> from earlier), so why should
- the grace period wait on such a critical section?
-</td></tr>
-<tr><th align="left">Answer:</th></tr>
-<tr><td bgcolor="#ffffff"><font color="ffffff">
- It is indeed not necessary for the grace period to wait on such
- a critical section.
- However, it is permissible to wait on it.
- And it is furthermore important to wait on it, as this
- lazy approach is far more scalable than a &ldquo;big bang&rdquo;
- all-at-once grace-period start could possibly be.
-</font></td></tr>
-<tr><td>&nbsp;</td></tr>
-</table>
-
-<p>If the CPU does a context switch, a quiescent state will be
-noted by <tt>rcu_node_context_switch()</tt> on the left.
-On the other hand, if the CPU takes a scheduler-clock interrupt
-while executing in usermode, a quiescent state will be noted by
-<tt>rcu_sched_clock_irq()</tt> on the right.
-Either way, the passage through a quiescent state will be noted
-in a per-CPU variable.
-
-<p>The next time an <tt>RCU_SOFTIRQ</tt> handler executes on
-this CPU (for example, after the next scheduler-clock
-interrupt), <tt>rcu_core()</tt> will invoke
-<tt>rcu_check_quiescent_state()</tt>, which will notice the
-recorded quiescent state, and invoke
-<tt>rcu_report_qs_rdp()</tt>.
-If <tt>rcu_report_qs_rdp()</tt> verifies that the quiescent state
-really does apply to the current grace period, it invokes
-<tt>rcu_report_rnp()</tt> which traverses up the <tt>rcu_node</tt>
-tree as shown at the bottom of the diagram, clearing bits from
-each <tt>rcu_node</tt> structure's <tt>-&gt;qsmask</tt> field,
-and propagating up the tree when the result is zero.
-
-<p>Note that traversal passes upwards out of a given <tt>rcu_node</tt>
-structure only if the current CPU is reporting the last quiescent
-state for the subtree headed by that <tt>rcu_node</tt> structure.
-A key point is that if a CPU's traversal stops at a given <tt>rcu_node</tt>
-structure, then there will be a later traversal by another CPU
-(or perhaps the same one) that proceeds upwards
-from that point, and the <tt>rcu_node</tt> <tt>-&gt;lock</tt>
-guarantees that the first CPU's quiescent state happens before the
-remainder of the second CPU's traversal.
-Applying this line of thought repeatedly shows that all CPUs'
-quiescent states happen before the last CPU traverses through
-the root <tt>rcu_node</tt> structure, the &ldquo;last CPU&rdquo;
-being the one that clears the last bit in the root <tt>rcu_node</tt>
-structure's <tt>-&gt;qsmask</tt> field.
-
-<h4><a name="Dynamic Tick Interface">Dynamic Tick Interface</a></h4>
-
-<p>Due to energy-efficiency considerations, RCU is forbidden from
-disturbing idle CPUs.
-CPUs are therefore required to notify RCU when entering or leaving idle
-state, which they do via fully ordered value-returning atomic operations
-on a per-CPU variable.
-The ordering effects are as shown below:
-
-</p><p><img src="TreeRCU-dyntick.svg" alt="TreeRCU-dyntick.svg" width="50%">
-
-<p>The RCU grace-period kernel thread samples the per-CPU idleness
-variable while holding the corresponding CPU's leaf <tt>rcu_node</tt>
-structure's <tt>-&gt;lock</tt>.
-This means that any RCU read-side critical sections that precede the
-idle period (the oval near the top of the diagram above) will happen
-before the end of the current grace period.
-Similarly, the beginning of the current grace period will happen before
-any RCU read-side critical sections that follow the
-idle period (the oval near the bottom of the diagram above).
-
-<p>Plumbing this into the full grace-period execution is described
-<a href="#Forcing Quiescent States">below</a>.
-
-<h4><a name="CPU-Hotplug Interface">CPU-Hotplug Interface</a></h4>
-
-<p>RCU is also forbidden from disturbing offline CPUs, which might well
-be powered off and removed from the system completely.
-CPUs are therefore required to notify RCU of their comings and goings
-as part of the corresponding CPU hotplug operations.
-The ordering effects are shown below:
-
-</p><p><img src="TreeRCU-hotplug.svg" alt="TreeRCU-hotplug.svg" width="50%">
-
-<p>Because CPU hotplug operations are much less frequent than idle transitions,
-they are heavier weight, and thus acquire the CPU's leaf <tt>rcu_node</tt>
-structure's <tt>-&gt;lock</tt> and update this structure's
-<tt>-&gt;qsmaskinitnext</tt>.
-The RCU grace-period kernel thread samples this mask to detect CPUs
-having gone offline since the beginning of this grace period.
-
-<p>Plumbing this into the full grace-period execution is described
-<a href="#Forcing Quiescent States">below</a>.
-
-<h4><a name="Forcing Quiescent States">Forcing Quiescent States</a></h4>
-
-<p>As noted above, idle and offline CPUs cannot report their own
-quiescent states, and therefore the grace-period kernel thread
-must do the reporting on their behalf.
-This process is called &ldquo;forcing quiescent states&rdquo;, it is
-repeated every few jiffies, and its ordering effects are shown below:
-
-</p><p><img src="TreeRCU-gp-fqs.svg" alt="TreeRCU-gp-fqs.svg" width="100%">
-
-<p>Each pass of quiescent state forcing is guaranteed to traverse the
-leaf <tt>rcu_node</tt> structures, and if there are no new quiescent
-states due to recently idled and/or offlined CPUs, then only the
-leaves are traversed.
-However, if there is a newly offlined CPU as illustrated on the left
-or a newly idled CPU as illustrated on the right, the corresponding
-quiescent state will be driven up towards the root.
-As with self-reported quiescent states, the upwards driving stops
-once it reaches an <tt>rcu_node</tt> structure that has quiescent
-states outstanding from other CPUs.
-
-<table>
-<tr><th>&nbsp;</th></tr>
-<tr><th align="left">Quick Quiz:</th></tr>
-<tr><td>
- The leftmost drive to root stopped before it reached
- the root <tt>rcu_node</tt> structure, which means that
- there are still CPUs subordinate to that structure on
- which the current grace period is waiting.
- Given that, how is it possible that the rightmost drive
- to root ended the grace period?
-</td></tr>
-<tr><th align="left">Answer:</th></tr>
-<tr><td bgcolor="#ffffff"><font color="ffffff">
- Good analysis!
- It is in fact impossible in the absence of bugs in RCU.
- But this diagram is complex enough as it is, so simplicity
- overrode accuracy.
- You can think of it as poetic license, or you can think of
- it as misdirection that is resolved in the
- <a href="#Putting It All Together">stitched-together diagram</a>.
-</font></td></tr>
-<tr><td>&nbsp;</td></tr>
-</table>
-
-<h4><a name="Grace-Period Cleanup">Grace-Period Cleanup</a></h4>
-
-<p>Grace-period cleanup first scans the <tt>rcu_node</tt> tree
-breadth-first advancing all the <tt>-&gt;gp_seq</tt> fields, then it
-advances the <tt>rcu_state</tt> structure's <tt>-&gt;gp_seq</tt> field.
-The ordering effects are shown below:
-
-</p><p><img src="TreeRCU-gp-cleanup.svg" alt="TreeRCU-gp-cleanup.svg" width="75%">
-
-<p>As indicated by the oval at the bottom of the diagram, once
-grace-period cleanup is complete, the next grace period can begin.
-
-<table>
-<tr><th>&nbsp;</th></tr>
-<tr><th align="left">Quick Quiz:</th></tr>
-<tr><td>
- But when precisely does the grace period end?
-</td></tr>
-<tr><th align="left">Answer:</th></tr>
-<tr><td bgcolor="#ffffff"><font color="ffffff">
- There is no useful single point at which the grace period
- can be said to end.
- The earliest reasonable candidate is as soon as the last
- CPU has reported its quiescent state, but it may be some
- milliseconds before RCU becomes aware of this.
- The latest reasonable candidate is once the <tt>rcu_state</tt>
- structure's <tt>-&gt;gp_seq</tt> field has been updated,
- but it is quite possible that some CPUs have already completed
- phase two of their updates by that time.
- In short, if you are going to work with RCU, you need to
- learn to embrace uncertainty.
-</font></td></tr>
-<tr><td>&nbsp;</td></tr>
-</table>
-
-
-<h4><a name="Callback Invocation">Callback Invocation</a></h4>
-
-<p>Once a given CPU's leaf <tt>rcu_node</tt> structure's
-<tt>-&gt;gp_seq</tt> field has been updated, that CPU can begin
-invoking its RCU callbacks that were waiting for this grace period
-to end.
-These callbacks are identified by <tt>rcu_advance_cbs()</tt>,
-which is usually invoked by <tt>__note_gp_changes()</tt>.
-As shown in the diagram below, this invocation can be triggered by
-the scheduling-clock interrupt (<tt>rcu_sched_clock_irq()</tt> on
-the left) or by idle entry (<tt>rcu_cleanup_after_idle()</tt> on
-the right, but only for kernels build with
-<tt>CONFIG_RCU_FAST_NO_HZ=y</tt>).
-Either way, <tt>RCU_SOFTIRQ</tt> is raised, which results in
-<tt>rcu_do_batch()</tt> invoking the callbacks, which in turn
-allows those callbacks to carry out (either directly or indirectly
-via wakeup) the needed phase-two processing for each update.
-
-</p><p><img src="TreeRCU-callback-invocation.svg" alt="TreeRCU-callback-invocation.svg" width="60%">
-
-<p>Please note that callback invocation can also be prompted by any
-number of corner-case code paths, for example, when a CPU notes that
-it has excessive numbers of callbacks queued.
-In all cases, the CPU acquires its leaf <tt>rcu_node</tt> structure's
-<tt>-&gt;lock</tt> before invoking callbacks, which preserves the
-required ordering against the newly completed grace period.
-
-<p>However, if the callback function communicates to other CPUs,
-for example, doing a wakeup, then it is that function's responsibility
-to maintain ordering.
-For example, if the callback function wakes up a task that runs on
-some other CPU, proper ordering must in place in both the callback
-function and the task being awakened.
-To see why this is important, consider the top half of the
-<a href="#Grace-Period Cleanup">grace-period cleanup</a> diagram.
-The callback might be running on a CPU corresponding to the leftmost
-leaf <tt>rcu_node</tt> structure, and awaken a task that is to run on
-a CPU corresponding to the rightmost leaf <tt>rcu_node</tt> structure,
-and the grace-period kernel thread might not yet have reached the
-rightmost leaf.
-In this case, the grace period's memory ordering might not yet have
-reached that CPU, so again the callback function and the awakened
-task must supply proper ordering.
-
-<h3><a name="Putting It All Together">Putting It All Together</a></h3>
-
-<p>A stitched-together diagram is
-<a href="Tree-RCU-Diagram.html">here</a>.
-
-<h3><a name="Legal Statement">
-Legal Statement</a></h3>
-
-<p>This work represents the view of the author and does not necessarily
-represent the view of IBM.
-
-</p><p>Linux is a registered trademark of Linus Torvalds.
-
-</p><p>Other company, product, and service names may be trademarks or
-service marks of others.
-
-</body></html>
diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst
new file mode 100644
index 000000000000..1a8b129cfc04
--- /dev/null
+++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst
@@ -0,0 +1,624 @@
+======================================================
+A Tour Through TREE_RCU's Grace-Period Memory Ordering
+======================================================
+
+August 8, 2017
+
+This article was contributed by Paul E.&nbsp;McKenney
+
+Introduction
+============
+
+This document gives a rough visual overview of how Tree RCU's
+grace-period memory ordering guarantee is provided.
+
+What Is Tree RCU's Grace Period Memory Ordering Guarantee?
+==========================================================
+
+RCU grace periods provide extremely strong memory-ordering guarantees
+for non-idle non-offline code.
+Any code that happens after the end of a given RCU grace period is guaranteed
+to see the effects of all accesses prior to the beginning of that grace
+period that are within RCU read-side critical sections.
+Similarly, any code that happens before the beginning of a given RCU grace
+period is guaranteed to see the effects of all accesses following the end
+of that grace period that are within RCU read-side critical sections.
+
+Note well that RCU-sched read-side critical sections include any region
+of code for which preemption is disabled.
+Given that each individual machine instruction can be thought of as
+an extremely small region of preemption-disabled code, one can think of
+``synchronize_rcu()`` as ``smp_mb()`` on steroids.
+
+RCU updaters use this guarantee by splitting their updates into
+two phases, one of which is executed before the grace period and
+the other of which is executed after the grace period.
+In the most common use case, phase one removes an element from
+a linked RCU-protected data structure, and phase two frees that element.
+For this to work, any readers that have witnessed state prior to the
+phase-one update (in the common case, removal) must not witness state
+following the phase-two update (in the common case, freeing).
+
+The RCU implementation provides this guarantee using a network
+of lock-based critical sections, memory barriers, and per-CPU
+processing, as is described in the following sections.
+
+Tree RCU Grace Period Memory Ordering Building Blocks
+=====================================================
+
+The workhorse for RCU's grace-period memory ordering is the
+critical section for the ``rcu_node`` structure's
+``-&gt;lock``. These critical sections use helper functions for lock
+acquisition, including ``raw_spin_lock_rcu_node()``,
+``raw_spin_lock_irq_rcu_node()``, and ``raw_spin_lock_irqsave_rcu_node()``.
+Their lock-release counterparts are ``raw_spin_unlock_rcu_node()``,
+``raw_spin_unlock_irq_rcu_node()``, and
+``raw_spin_unlock_irqrestore_rcu_node()``, respectively.
+For completeness, a ``raw_spin_trylock_rcu_node()`` is also provided.
+The key point is that the lock-acquisition functions, including
+``raw_spin_trylock_rcu_node()``, all invoke ``smp_mb__after_unlock_lock()``
+immediately after successful acquisition of the lock.
+
+Therefore, for any given ``rcu_node`` structure, any access
+happening before one of the above lock-release functions will be seen
+by all CPUs as happening before any access happening after a later
+one of the above lock-acquisition functions.
+Furthermore, any access happening before one of the
+above lock-release function on any given CPU will be seen by all
+CPUs as happening before any access happening after a later one
+of the above lock-acquisition functions executing on that same CPU,
+even if the lock-release and lock-acquisition functions are operating
+on different ``rcu_node`` structures.
+Tree RCU uses these two ordering guarantees to form an ordering
+network among all CPUs that were in any way involved in the grace
+period, including any CPUs that came online or went offline during
+the grace period in question.
+
+The following litmus test exhibits the ordering effects of these
+lock-acquisition and lock-release functions::
+
+ 1 int x, y, z;
+ 2
+ 3 void task0(void)
+ 4 {
+ 5 raw_spin_lock_rcu_node(rnp);
+ 6 WRITE_ONCE(x, 1);
+ 7 r1 = READ_ONCE(y);
+ 8 raw_spin_unlock_rcu_node(rnp);
+ 9 }
+ 10
+ 11 void task1(void)
+ 12 {
+ 13 raw_spin_lock_rcu_node(rnp);
+ 14 WRITE_ONCE(y, 1);
+ 15 r2 = READ_ONCE(z);
+ 16 raw_spin_unlock_rcu_node(rnp);
+ 17 }
+ 18
+ 19 void task2(void)
+ 20 {
+ 21 WRITE_ONCE(z, 1);
+ 22 smp_mb();
+ 23 r3 = READ_ONCE(x);
+ 24 }
+ 25
+ 26 WARN_ON(r1 == 0 &amp;&amp; r2 == 0 &amp;&amp; r3 == 0);
+
+The ``WARN_ON()`` is evaluated at &ldquo;the end of time&rdquo;,
+after all changes have propagated throughout the system.
+Without the ``smp_mb__after_unlock_lock()`` provided by the
+acquisition functions, this ``WARN_ON()`` could trigger, for example
+on PowerPC.
+The ``smp_mb__after_unlock_lock()`` invocations prevent this
+``WARN_ON()`` from triggering.
+
+This approach must be extended to include idle CPUs, which need
+RCU's grace-period memory ordering guarantee to extend to any
+RCU read-side critical sections preceding and following the current
+idle sojourn.
+This case is handled by calls to the strongly ordered
+``atomic_add_return()`` read-modify-write atomic operation that
+is invoked within ``rcu_dynticks_eqs_enter()`` at idle-entry
+time and within ``rcu_dynticks_eqs_exit()`` at idle-exit time.
+The grace-period kthread invokes ``rcu_dynticks_snap()`` and
+``rcu_dynticks_in_eqs_since()`` (both of which invoke
+an ``atomic_add_return()`` of zero) to detect idle CPUs.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| But what about CPUs that remain offline for the entire grace period? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| Such CPUs will be offline at the beginning of the grace period, so |
+| the grace period won't expect quiescent states from them. Races |
+| between grace-period start and CPU-hotplug operations are mediated |
+| by the CPU's leaf ``rcu_node`` structure's ``->lock`` as described |
+| above. |
++-----------------------------------------------------------------------+
+
+The approach must be extended to handle one final case, that of waking a
+task blocked in ``synchronize_rcu()``. This task might be affinitied to
+a CPU that is not yet aware that the grace period has ended, and thus
+might not yet be subject to the grace period's memory ordering.
+Therefore, there is an ``smp_mb()`` after the return from
+``wait_for_completion()`` in the ``synchronize_rcu()`` code path.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| What? Where??? I don't see any ``smp_mb()`` after the return from |
+| ``wait_for_completion()``!!! |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| That would be because I spotted the need for that ``smp_mb()`` during |
+| the creation of this documentation, and it is therefore unlikely to |
+| hit mainline before v4.14. Kudos to Lance Roy, Will Deacon, Peter |
+| Zijlstra, and Jonathan Cameron for asking questions that sensitized |
+| me to the rather elaborate sequence of events that demonstrate the |
+| need for this memory barrier. |
++-----------------------------------------------------------------------+
+
+Tree RCU's grace--period memory-ordering guarantees rely most heavily on
+the ``rcu_node`` structure's ``->lock`` field, so much so that it is
+necessary to abbreviate this pattern in the diagrams in the next
+section. For example, consider the ``rcu_prepare_for_idle()`` function
+shown below, which is one of several functions that enforce ordering of
+newly arrived RCU callbacks against future grace periods:
+
+::
+
+ 1 static void rcu_prepare_for_idle(void)
+ 2 {
+ 3 bool needwake;
+ 4 struct rcu_data *rdp;
+ 5 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
+ 6 struct rcu_node *rnp;
+ 7 struct rcu_state *rsp;
+ 8 int tne;
+ 9
+ 10 if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL) ||
+ 11 rcu_is_nocb_cpu(smp_processor_id()))
+ 12 return;
+ 13 tne = READ_ONCE(tick_nohz_active);
+ 14 if (tne != rdtp->tick_nohz_enabled_snap) {
+ 15 if (rcu_cpu_has_callbacks(NULL))
+ 16 invoke_rcu_core();
+ 17 rdtp->tick_nohz_enabled_snap = tne;
+ 18 return;
+ 19 }
+ 20 if (!tne)
+ 21 return;
+ 22 if (rdtp->all_lazy &&
+ 23 rdtp->nonlazy_posted != rdtp->nonlazy_posted_snap) {
+ 24 rdtp->all_lazy = false;
+ 25 rdtp->nonlazy_posted_snap = rdtp->nonlazy_posted;
+ 26 invoke_rcu_core();
+ 27 return;
+ 28 }
+ 29 if (rdtp->last_accelerate == jiffies)
+ 30 return;
+ 31 rdtp->last_accelerate = jiffies;
+ 32 for_each_rcu_flavor(rsp) {
+ 33 rdp = this_cpu_ptr(rsp->rda);
+ 34 if (rcu_segcblist_pend_cbs(&rdp->cblist))
+ 35 continue;
+ 36 rnp = rdp->mynode;
+ 37 raw_spin_lock_rcu_node(rnp);
+ 38 needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
+ 39 raw_spin_unlock_rcu_node(rnp);
+ 40 if (needwake)
+ 41 rcu_gp_kthread_wake(rsp);
+ 42 }
+ 43 }
+
+But the only part of ``rcu_prepare_for_idle()`` that really matters for
+this discussion are lines 37–39. We will therefore abbreviate this
+function as follows:
+
+.. kernel-figure:: rcu_node-lock.svg
+
+The box represents the ``rcu_node`` structure's ``->lock`` critical
+section, with the double line on top representing the additional
+``smp_mb__after_unlock_lock()``.
+
+Tree RCU Grace Period Memory Ordering Components
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Tree RCU's grace-period memory-ordering guarantee is provided by a
+number of RCU components:
+
+#. `Callback Registry`_
+#. `Grace-Period Initialization`_
+#. `Self-Reported Quiescent States`_
+#. `Dynamic Tick Interface`_
+#. `CPU-Hotplug Interface`_
+#. `Forcing Quiescent States`_
+#. `Grace-Period Cleanup`_
+#. `Callback Invocation`_
+
+Each of the following section looks at the corresponding component in
+detail.
+
+Callback Registry
+^^^^^^^^^^^^^^^^^
+
+If RCU's grace-period guarantee is to mean anything at all, any access
+that happens before a given invocation of ``call_rcu()`` must also
+happen before the corresponding grace period. The implementation of this
+portion of RCU's grace period guarantee is shown in the following
+figure:
+
+.. kernel-figure:: TreeRCU-callback-registry.svg
+
+Because ``call_rcu()`` normally acts only on CPU-local state, it
+provides no ordering guarantees, either for itself or for phase one of
+the update (which again will usually be removal of an element from an
+RCU-protected data structure). It simply enqueues the ``rcu_head``
+structure on a per-CPU list, which cannot become associated with a grace
+period until a later call to ``rcu_accelerate_cbs()``, as shown in the
+diagram above.
+
+One set of code paths shown on the left invokes ``rcu_accelerate_cbs()``
+via ``note_gp_changes()``, either directly from ``call_rcu()`` (if the
+current CPU is inundated with queued ``rcu_head`` structures) or more
+likely from an ``RCU_SOFTIRQ`` handler. Another code path in the middle
+is taken only in kernels built with ``CONFIG_RCU_FAST_NO_HZ=y``, which
+invokes ``rcu_accelerate_cbs()`` via ``rcu_prepare_for_idle()``. The
+final code path on the right is taken only in kernels built with
+``CONFIG_HOTPLUG_CPU=y``, which invokes ``rcu_accelerate_cbs()`` via
+``rcu_advance_cbs()``, ``rcu_migrate_callbacks``,
+``rcutree_migrate_callbacks()``, and ``takedown_cpu()``, which in turn
+is invoked on a surviving CPU after the outgoing CPU has been completely
+offlined.
+
+There are a few other code paths within grace-period processing that
+opportunistically invoke ``rcu_accelerate_cbs()``. However, either way,
+all of the CPU's recently queued ``rcu_head`` structures are associated
+with a future grace-period number under the protection of the CPU's lead
+``rcu_node`` structure's ``->lock``. In all cases, there is full
+ordering against any prior critical section for that same ``rcu_node``
+structure's ``->lock``, and also full ordering against any of the
+current task's or CPU's prior critical sections for any ``rcu_node``
+structure's ``->lock``.
+
+The next section will show how this ordering ensures that any accesses
+prior to the ``call_rcu()`` (particularly including phase one of the
+update) happen before the start of the corresponding grace period.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| But what about ``synchronize_rcu()``? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| The ``synchronize_rcu()`` passes ``call_rcu()`` to ``wait_rcu_gp()``, |
+| which invokes it. So either way, it eventually comes down to |
+| ``call_rcu()``. |
++-----------------------------------------------------------------------+
+
+Grace-Period Initialization
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Grace-period initialization is carried out by the grace-period kernel
+thread, which makes several passes over the ``rcu_node`` tree within the
+``rcu_gp_init()`` function. This means that showing the full flow of
+ordering through the grace-period computation will require duplicating
+this tree. If you find this confusing, please note that the state of the
+``rcu_node`` changes over time, just like Heraclitus's river. However,
+to keep the ``rcu_node`` river tractable, the grace-period kernel
+thread's traversals are presented in multiple parts, starting in this
+section with the various phases of grace-period initialization.
+
+The first ordering-related grace-period initialization action is to
+advance the ``rcu_state`` structure's ``->gp_seq`` grace-period-number
+counter, as shown below:
+
+.. kernel-figure:: TreeRCU-gp-init-1.svg
+
+The actual increment is carried out using ``smp_store_release()``, which
+helps reject false-positive RCU CPU stall detection. Note that only the
+root ``rcu_node`` structure is touched.
+
+The first pass through the ``rcu_node`` tree updates bitmasks based on
+CPUs having come online or gone offline since the start of the previous
+grace period. In the common case where the number of online CPUs for
+this ``rcu_node`` structure has not transitioned to or from zero, this
+pass will scan only the leaf ``rcu_node`` structures. However, if the
+number of online CPUs for a given leaf ``rcu_node`` structure has
+transitioned from zero, ``rcu_init_new_rnp()`` will be invoked for the
+first incoming CPU. Similarly, if the number of online CPUs for a given
+leaf ``rcu_node`` structure has transitioned to zero,
+``rcu_cleanup_dead_rnp()`` will be invoked for the last outgoing CPU.
+The diagram below shows the path of ordering if the leftmost
+``rcu_node`` structure onlines its first CPU and if the next
+``rcu_node`` structure has no online CPUs (or, alternatively if the
+leftmost ``rcu_node`` structure offlines its last CPU and if the next
+``rcu_node`` structure has no online CPUs).
+
+.. kernel-figure:: TreeRCU-gp-init-1.svg
+
+The final ``rcu_gp_init()`` pass through the ``rcu_node`` tree traverses
+breadth-first, setting each ``rcu_node`` structure's ``->gp_seq`` field
+to the newly advanced value from the ``rcu_state`` structure, as shown
+in the following diagram.
+
+.. kernel-figure:: TreeRCU-gp-init-1.svg
+
+This change will also cause each CPU's next call to
+``__note_gp_changes()`` to notice that a new grace period has started,
+as described in the next section. But because the grace-period kthread
+started the grace period at the root (with the advancing of the
+``rcu_state`` structure's ``->gp_seq`` field) before setting each leaf
+``rcu_node`` structure's ``->gp_seq`` field, each CPU's observation of
+the start of the grace period will happen after the actual start of the
+grace period.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| But what about the CPU that started the grace period? Why wouldn't it |
+| see the start of the grace period right when it started that grace |
+| period? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| In some deep philosophical and overly anthromorphized sense, yes, the |
+| CPU starting the grace period is immediately aware of having done so. |
+| However, if we instead assume that RCU is not self-aware, then even |
+| the CPU starting the grace period does not really become aware of the |
+| start of this grace period until its first call to |
+| ``__note_gp_changes()``. On the other hand, this CPU potentially gets |
+| early notification because it invokes ``__note_gp_changes()`` during |
+| its last ``rcu_gp_init()`` pass through its leaf ``rcu_node`` |
+| structure. |
++-----------------------------------------------------------------------+
+
+Self-Reported Quiescent States
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+When all entities that might block the grace period have reported
+quiescent states (or as described in a later section, had quiescent
+states reported on their behalf), the grace period can end. Online
+non-idle CPUs report their own quiescent states, as shown in the
+following diagram:
+
+.. kernel-figure:: TreeRCU-qs.svg
+
+This is for the last CPU to report a quiescent state, which signals the
+end of the grace period. Earlier quiescent states would push up the
+``rcu_node`` tree only until they encountered an ``rcu_node`` structure
+that is waiting for additional quiescent states. However, ordering is
+nevertheless preserved because some later quiescent state will acquire
+that ``rcu_node`` structure's ``->lock``.
+
+Any number of events can lead up to a CPU invoking ``note_gp_changes``
+(or alternatively, directly invoking ``__note_gp_changes()``), at which
+point that CPU will notice the start of a new grace period while holding
+its leaf ``rcu_node`` lock. Therefore, all execution shown in this
+diagram happens after the start of the grace period. In addition, this
+CPU will consider any RCU read-side critical section that started before
+the invocation of ``__note_gp_changes()`` to have started before the
+grace period, and thus a critical section that the grace period must
+wait on.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| But a RCU read-side critical section might have started after the |
+| beginning of the grace period (the advancing of ``->gp_seq`` from |
+| earlier), so why should the grace period wait on such a critical |
+| section? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| It is indeed not necessary for the grace period to wait on such a |
+| critical section. However, it is permissible to wait on it. And it is |
+| furthermore important to wait on it, as this lazy approach is far |
+| more scalable than a “big bang” all-at-once grace-period start could |
+| possibly be. |
++-----------------------------------------------------------------------+
+
+If the CPU does a context switch, a quiescent state will be noted by
+``rcu_note_context_switch()`` on the left. On the other hand, if the CPU
+takes a scheduler-clock interrupt while executing in usermode, a
+quiescent state will be noted by ``rcu_sched_clock_irq()`` on the right.
+Either way, the passage through a quiescent state will be noted in a
+per-CPU variable.
+
+The next time an ``RCU_SOFTIRQ`` handler executes on this CPU (for
+example, after the next scheduler-clock interrupt), ``rcu_core()`` will
+invoke ``rcu_check_quiescent_state()``, which will notice the recorded
+quiescent state, and invoke ``rcu_report_qs_rdp()``. If
+``rcu_report_qs_rdp()`` verifies that the quiescent state really does
+apply to the current grace period, it invokes ``rcu_report_rnp()`` which
+traverses up the ``rcu_node`` tree as shown at the bottom of the
+diagram, clearing bits from each ``rcu_node`` structure's ``->qsmask``
+field, and propagating up the tree when the result is zero.
+
+Note that traversal passes upwards out of a given ``rcu_node`` structure
+only if the current CPU is reporting the last quiescent state for the
+subtree headed by that ``rcu_node`` structure. A key point is that if a
+CPU's traversal stops at a given ``rcu_node`` structure, then there will
+be a later traversal by another CPU (or perhaps the same one) that
+proceeds upwards from that point, and the ``rcu_node`` ``->lock``
+guarantees that the first CPU's quiescent state happens before the
+remainder of the second CPU's traversal. Applying this line of thought
+repeatedly shows that all CPUs' quiescent states happen before the last
+CPU traverses through the root ``rcu_node`` structure, the “last CPU”
+being the one that clears the last bit in the root ``rcu_node``
+structure's ``->qsmask`` field.
+
+Dynamic Tick Interface
+^^^^^^^^^^^^^^^^^^^^^^
+
+Due to energy-efficiency considerations, RCU is forbidden from
+disturbing idle CPUs. CPUs are therefore required to notify RCU when
+entering or leaving idle state, which they do via fully ordered
+value-returning atomic operations on a per-CPU variable. The ordering
+effects are as shown below:
+
+.. kernel-figure:: TreeRCU-dyntick.svg
+
+The RCU grace-period kernel thread samples the per-CPU idleness variable
+while holding the corresponding CPU's leaf ``rcu_node`` structure's
+``->lock``. This means that any RCU read-side critical sections that
+precede the idle period (the oval near the top of the diagram above)
+will happen before the end of the current grace period. Similarly, the
+beginning of the current grace period will happen before any RCU
+read-side critical sections that follow the idle period (the oval near
+the bottom of the diagram above).
+
+Plumbing this into the full grace-period execution is described
+`below <#Forcing%20Quiescent%20States>`__.
+
+CPU-Hotplug Interface
+^^^^^^^^^^^^^^^^^^^^^
+
+RCU is also forbidden from disturbing offline CPUs, which might well be
+powered off and removed from the system completely. CPUs are therefore
+required to notify RCU of their comings and goings as part of the
+corresponding CPU hotplug operations. The ordering effects are shown
+below:
+
+.. kernel-figure:: TreeRCU-hotplug.svg
+
+Because CPU hotplug operations are much less frequent than idle
+transitions, they are heavier weight, and thus acquire the CPU's leaf
+``rcu_node`` structure's ``->lock`` and update this structure's
+``->qsmaskinitnext``. The RCU grace-period kernel thread samples this
+mask to detect CPUs having gone offline since the beginning of this
+grace period.
+
+Plumbing this into the full grace-period execution is described
+`below <#Forcing%20Quiescent%20States>`__.
+
+Forcing Quiescent States
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+As noted above, idle and offline CPUs cannot report their own quiescent
+states, and therefore the grace-period kernel thread must do the
+reporting on their behalf. This process is called “forcing quiescent
+states”, it is repeated every few jiffies, and its ordering effects are
+shown below:
+
+.. kernel-figure:: TreeRCU-gp-fqs.svg
+
+Each pass of quiescent state forcing is guaranteed to traverse the leaf
+``rcu_node`` structures, and if there are no new quiescent states due to
+recently idled and/or offlined CPUs, then only the leaves are traversed.
+However, if there is a newly offlined CPU as illustrated on the left or
+a newly idled CPU as illustrated on the right, the corresponding
+quiescent state will be driven up towards the root. As with
+self-reported quiescent states, the upwards driving stops once it
+reaches an ``rcu_node`` structure that has quiescent states outstanding
+from other CPUs.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| The leftmost drive to root stopped before it reached the root |
+| ``rcu_node`` structure, which means that there are still CPUs |
+| subordinate to that structure on which the current grace period is |
+| waiting. Given that, how is it possible that the rightmost drive to |
+| root ended the grace period? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| Good analysis! It is in fact impossible in the absence of bugs in |
+| RCU. But this diagram is complex enough as it is, so simplicity |
+| overrode accuracy. You can think of it as poetic license, or you can |
+| think of it as misdirection that is resolved in the |
+| `stitched-together diagram <#Putting%20It%20All%20Together>`__. |
++-----------------------------------------------------------------------+
+
+Grace-Period Cleanup
+^^^^^^^^^^^^^^^^^^^^
+
+Grace-period cleanup first scans the ``rcu_node`` tree breadth-first
+advancing all the ``->gp_seq`` fields, then it advances the
+``rcu_state`` structure's ``->gp_seq`` field. The ordering effects are
+shown below:
+
+.. kernel-figure:: TreeRCU-gp-cleanup.svg
+
+As indicated by the oval at the bottom of the diagram, once grace-period
+cleanup is complete, the next grace period can begin.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| But when precisely does the grace period end? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| There is no useful single point at which the grace period can be said |
+| to end. The earliest reasonable candidate is as soon as the last CPU |
+| has reported its quiescent state, but it may be some milliseconds |
+| before RCU becomes aware of this. The latest reasonable candidate is |
+| once the ``rcu_state`` structure's ``->gp_seq`` field has been |
+| updated, but it is quite possible that some CPUs have already |
+| completed phase two of their updates by that time. In short, if you |
+| are going to work with RCU, you need to learn to embrace uncertainty. |
++-----------------------------------------------------------------------+
+
+Callback Invocation
+^^^^^^^^^^^^^^^^^^^
+
+Once a given CPU's leaf ``rcu_node`` structure's ``->gp_seq`` field has
+been updated, that CPU can begin invoking its RCU callbacks that were
+waiting for this grace period to end. These callbacks are identified by
+``rcu_advance_cbs()``, which is usually invoked by
+``__note_gp_changes()``. As shown in the diagram below, this invocation
+can be triggered by the scheduling-clock interrupt
+(``rcu_sched_clock_irq()`` on the left) or by idle entry
+(``rcu_cleanup_after_idle()`` on the right, but only for kernels build
+with ``CONFIG_RCU_FAST_NO_HZ=y``). Either way, ``RCU_SOFTIRQ`` is
+raised, which results in ``rcu_do_batch()`` invoking the callbacks,
+which in turn allows those callbacks to carry out (either directly or
+indirectly via wakeup) the needed phase-two processing for each update.
+
+.. kernel-figure:: TreeRCU-callback-invocation.svg
+
+Please note that callback invocation can also be prompted by any number
+of corner-case code paths, for example, when a CPU notes that it has
+excessive numbers of callbacks queued. In all cases, the CPU acquires
+its leaf ``rcu_node`` structure's ``->lock`` before invoking callbacks,
+which preserves the required ordering against the newly completed grace
+period.
+
+However, if the callback function communicates to other CPUs, for
+example, doing a wakeup, then it is that function's responsibility to
+maintain ordering. For example, if the callback function wakes up a task
+that runs on some other CPU, proper ordering must in place in both the
+callback function and the task being awakened. To see why this is
+important, consider the top half of the `grace-period
+cleanup <#Grace-Period%20Cleanup>`__ diagram. The callback might be
+running on a CPU corresponding to the leftmost leaf ``rcu_node``
+structure, and awaken a task that is to run on a CPU corresponding to
+the rightmost leaf ``rcu_node`` structure, and the grace-period kernel
+thread might not yet have reached the rightmost leaf. In this case, the
+grace period's memory ordering might not yet have reached that CPU, so
+again the callback function and the awakened task must supply proper
+ordering.
+
+Putting It All Together
+~~~~~~~~~~~~~~~~~~~~~~~
+
+A stitched-together diagram is here:
+
+.. kernel-figure:: TreeRCU-gp.svg
+
+Legal Statement
+~~~~~~~~~~~~~~~
+
+This work represents the view of the author and does not necessarily
+represent the view of IBM.
+
+Linux is a registered trademark of Linus Torvalds.
+
+Other company, product, and service names may be trademarks or service
+marks of others.
diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg
index 2bcd742d6e49..069f6f8371c2 100644
--- a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg
+++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg
@@ -3880,7 +3880,7 @@
font-style="normal"
y="-4418.6582"
x="3745.7725"
- xml:space="preserve">rcu_node_context_switch()</text>
+ xml:space="preserve">rcu_note_context_switch()</text>
</g>
<g
transform="translate(1881.1886,54048.57)"
diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg
index 779c9ac31a52..7d6c5f7e505c 100644
--- a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg
+++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg
@@ -753,7 +753,7 @@
font-style="normal"
y="-4418.6582"
x="3745.7725"
- xml:space="preserve">rcu_node_context_switch()</text>
+ xml:space="preserve">rcu_note_context_switch()</text>
</g>
<g
transform="translate(3131.2648,-585.6713)"