diff options
Diffstat (limited to 'Documentation/RCU/Design/Data-Structures')
-rw-r--r-- | Documentation/RCU/Design/Data-Structures/Data-Structures.html | 1391 | ||||
-rw-r--r-- | Documentation/RCU/Design/Data-Structures/Data-Structures.rst | 1163 |
2 files changed, 1163 insertions, 1391 deletions
diff --git a/Documentation/RCU/Design/Data-Structures/Data-Structures.html b/Documentation/RCU/Design/Data-Structures/Data-Structures.html deleted file mode 100644 index c30c1957c7e6..000000000000 --- a/Documentation/RCU/Design/Data-Structures/Data-Structures.html +++ /dev/null @@ -1,1391 +0,0 @@ -<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" - "http://www.w3.org/TR/html4/loose.dtd"> - <html> - <head><title>A Tour Through TREE_RCU's Data Structures [LWN.net]</title> - <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> - - <p>December 18, 2016</p> - <p>This article was contributed by Paul E. McKenney</p> - -<h3>Introduction</h3> - -This document describes RCU's major data structures and their relationship -to each other. - -<ol> -<li> <a href="#Data-Structure Relationships"> - Data-Structure Relationships</a> -<li> <a href="#The rcu_state Structure"> - The <tt>rcu_state</tt> Structure</a> -<li> <a href="#The rcu_node Structure"> - The <tt>rcu_node</tt> Structure</a> -<li> <a href="#The rcu_segcblist Structure"> - The <tt>rcu_segcblist</tt> Structure</a> -<li> <a href="#The rcu_data Structure"> - The <tt>rcu_data</tt> Structure</a> -<li> <a href="#The rcu_head Structure"> - The <tt>rcu_head</tt> Structure</a> -<li> <a href="#RCU-Specific Fields in the task_struct Structure"> - RCU-Specific Fields in the <tt>task_struct</tt> Structure</a> -<li> <a href="#Accessor Functions"> - Accessor Functions</a> -</ol> - -<h3><a name="Data-Structure Relationships">Data-Structure Relationships</a></h3> - -<p>RCU is for all intents and purposes a large state machine, and its -data structures maintain the state in such a way as to allow RCU readers -to execute extremely quickly, while also processing the RCU grace periods -requested by updaters in an efficient and extremely scalable fashion. -The efficiency and scalability of RCU updaters is provided primarily -by a combining tree, as shown below: - -</p><p><img src="BigTreeClassicRCU.svg" alt="BigTreeClassicRCU.svg" width="30%"> - -</p><p>This diagram shows an enclosing <tt>rcu_state</tt> structure -containing a tree of <tt>rcu_node</tt> structures. -Each leaf node of the <tt>rcu_node</tt> tree has up to 16 -<tt>rcu_data</tt> structures associated with it, so that there -are <tt>NR_CPUS</tt> number of <tt>rcu_data</tt> structures, -one for each possible CPU. -This structure is adjusted at boot time, if needed, to handle the -common case where <tt>nr_cpu_ids</tt> is much less than -<tt>NR_CPUs</tt>. -For example, a number of Linux distributions set <tt>NR_CPUs=4096</tt>, -which results in a three-level <tt>rcu_node</tt> tree. -If the actual hardware has only 16 CPUs, RCU will adjust itself -at boot time, resulting in an <tt>rcu_node</tt> tree with only a single node. - -</p><p>The purpose of this combining tree is to allow per-CPU events -such as quiescent states, dyntick-idle transitions, -and CPU hotplug operations to be processed efficiently -and scalably. -Quiescent states are recorded by the per-CPU <tt>rcu_data</tt> structures, -and other events are recorded by the leaf-level <tt>rcu_node</tt> -structures. -All of these events are combined at each level of the tree until finally -grace periods are completed at the tree's root <tt>rcu_node</tt> -structure. -A grace period can be completed at the root once every CPU -(or, in the case of <tt>CONFIG_PREEMPT_RCU</tt>, task) -has passed through a quiescent state. -Once a grace period has completed, record of that fact is propagated -back down the tree. - -</p><p>As can be seen from the diagram, on a 64-bit system -a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout -of 64 at the root and a fanout of 16 at the leaves. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Why isn't the fanout at the leaves also 64? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Because there are more types of events that affect the leaf-level - <tt>rcu_node</tt> structures than further up the tree. - Therefore, if the leaf <tt>rcu_node</tt> structures have fanout of - 64, the contention on these structures' <tt>->structures</tt> - becomes excessive. - Experimentation on a wide variety of systems has shown that a fanout - of 16 works well for the leaves of the <tt>rcu_node</tt> tree. - </font> - - <p><font color="ffffff">Of course, further experience with - systems having hundreds or thousands of CPUs may demonstrate - that the fanout for the non-leaf <tt>rcu_node</tt> structures - must also be reduced. - Such reduction can be easily carried out when and if it proves - necessary. - In the meantime, if you are using such a system and running into - contention problems on the non-leaf <tt>rcu_node</tt> structures, - you may use the <tt>CONFIG_RCU_FANOUT</tt> kernel configuration - parameter to reduce the non-leaf fanout as needed. - </font> - - <p><font color="ffffff">Kernels built for systems with - strong NUMA characteristics might also need to adjust - <tt>CONFIG_RCU_FANOUT</tt> so that the domains of the - <tt>rcu_node</tt> structures align with hardware boundaries. - However, there has thus far been no need for this. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p>If your system has more than 1,024 CPUs (or more than 512 CPUs on -a 32-bit system), then RCU will automatically add more levels to the -tree. -For example, if you are crazy enough to build a 64-bit system with 65,536 -CPUs, RCU would configure the <tt>rcu_node</tt> tree as follows: - -</p><p><img src="HugeTreeClassicRCU.svg" alt="HugeTreeClassicRCU.svg" width="50%"> - -</p><p>RCU currently permits up to a four-level tree, which on a 64-bit system -accommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for -32-bit systems. -On the other hand, you can set both <tt>CONFIG_RCU_FANOUT</tt> and -<tt>CONFIG_RCU_FANOUT_LEAF</tt> to be as small as 2, which would result -in a 16-CPU test using a 4-level tree. -This can be useful for testing large-system capabilities on small test -machines. - -</p><p>This multi-level combining tree allows us to get most of the -performance and scalability -benefits of partitioning, even though RCU grace-period detection is -inherently a global operation. -The trick here is that only the last CPU to report a quiescent state -into a given <tt>rcu_node</tt> structure need advance to the <tt>rcu_node</tt> -structure at the next level up the tree. -This means that at the leaf-level <tt>rcu_node</tt> structure, only -one access out of sixteen will progress up the tree. -For the internal <tt>rcu_node</tt> structures, the situation is even -more extreme: Only one access out of sixty-four will progress up -the tree. -Because the vast majority of the CPUs do not progress up the tree, -the lock contention remains roughly constant up the tree. -No matter how many CPUs there are in the system, at most 64 quiescent-state -reports per grace period will progress all the way to the root -<tt>rcu_node</tt> structure, thus ensuring that the lock contention -on that root <tt>rcu_node</tt> structure remains acceptably low. - -</p><p>In effect, the combining tree acts like a big shock absorber, -keeping lock contention under control at all tree levels regardless -of the level of loading on the system. - -</p><p>RCU updaters wait for normal grace periods by registering -RCU callbacks, either directly via <tt>call_rcu()</tt> -or indirectly via <tt>synchronize_rcu()</tt> and friends. -RCU callbacks are represented by <tt>rcu_head</tt> structures, -which are queued on <tt>rcu_data</tt> structures while they are -waiting for a grace period to elapse, as shown in the following figure: - -</p><p><img src="BigTreePreemptRCUBHdyntickCB.svg" alt="BigTreePreemptRCUBHdyntickCB.svg" width="40%"> - -</p><p>This figure shows how <tt>TREE_RCU</tt>'s and -<tt>PREEMPT_RCU</tt>'s major data structures are related. -Lesser data structures will be introduced with the algorithms that -make use of them. - -</p><p>Note that each of the data structures in the above figure has -its own synchronization: - -<p><ol> -<li> Each <tt>rcu_state</tt> structures has a lock and a mutex, - and some fields are protected by the corresponding root - <tt>rcu_node</tt> structure's lock. -<li> Each <tt>rcu_node</tt> structure has a spinlock. -<li> The fields in <tt>rcu_data</tt> are private to the corresponding - CPU, although a few can be read and written by other CPUs. -</ol> - -<p>It is important to note that different data structures can have -very different ideas about the state of RCU at any given time. -For but one example, awareness of the start or end of a given RCU -grace period propagates slowly through the data structures. -This slow propagation is absolutely necessary for RCU to have good -read-side performance. -If this balkanized implementation seems foreign to you, one useful -trick is to consider each instance of these data structures to be -a different person, each having the usual slightly different -view of reality. - -</p><p>The general role of each of these data structures is as -follows: - -</p><ol> -<li> <tt>rcu_state</tt>: - This structure forms the interconnection between the - <tt>rcu_node</tt> and <tt>rcu_data</tt> structures, - tracks grace periods, serves as short-term repository - for callbacks orphaned by CPU-hotplug events, - maintains <tt>rcu_barrier()</tt> state, - tracks expedited grace-period state, - and maintains state used to force quiescent states when - grace periods extend too long, -<li> <tt>rcu_node</tt>: This structure forms the combining - tree that propagates quiescent-state - information from the leaves to the root, and also propagates - grace-period information from the root to the leaves. - It provides local copies of the grace-period state in order - to allow this information to be accessed in a synchronized - manner without suffering the scalability limitations that - would otherwise be imposed by global locking. - In <tt>CONFIG_PREEMPT_RCU</tt> kernels, it manages the lists - of tasks that have blocked while in their current - RCU read-side critical section. - In <tt>CONFIG_PREEMPT_RCU</tt> with - <tt>CONFIG_RCU_BOOST</tt>, it manages the - per-<tt>rcu_node</tt> priority-boosting - kernel threads (kthreads) and state. - Finally, it records CPU-hotplug state in order to determine - which CPUs should be ignored during a given grace period. -<li> <tt>rcu_data</tt>: This per-CPU structure is the - focus of quiescent-state detection and RCU callback queuing. - It also tracks its relationship to the corresponding leaf - <tt>rcu_node</tt> structure to allow more-efficient - propagation of quiescent states up the <tt>rcu_node</tt> - combining tree. - Like the <tt>rcu_node</tt> structure, it provides a local - copy of the grace-period information to allow for-free - synchronized - access to this information from the corresponding CPU. - Finally, this structure records past dyntick-idle state - for the corresponding CPU and also tracks statistics. -<li> <tt>rcu_head</tt>: - This structure represents RCU callbacks, and is the - only structure allocated and managed by RCU users. - The <tt>rcu_head</tt> structure is normally embedded - within the RCU-protected data structure. -</ol> - -<p>If all you wanted from this article was a general notion of how -RCU's data structures are related, you are done. -Otherwise, each of the following sections give more details on -the <tt>rcu_state</tt>, <tt>rcu_node</tt> and <tt>rcu_data</tt> data -structures. - -<h3><a name="The rcu_state Structure"> -The <tt>rcu_state</tt> Structure</a></h3> - -<p>The <tt>rcu_state</tt> structure is the base structure that -represents the state of RCU in the system. -This structure forms the interconnection between the -<tt>rcu_node</tt> and <tt>rcu_data</tt> structures, -tracks grace periods, contains the lock used to -synchronize with CPU-hotplug events, -and maintains state used to force quiescent states when -grace periods extend too long, - -</p><p>A few of the <tt>rcu_state</tt> structure's fields are discussed, -singly and in groups, in the following sections. -The more specialized fields are covered in the discussion of their -use. - -<h5>Relationship to rcu_node and rcu_data Structures</h5> - -This portion of the <tt>rcu_state</tt> structure is declared -as follows: - -<pre> - 1 struct rcu_node node[NUM_RCU_NODES]; - 2 struct rcu_node *level[NUM_RCU_LVLS + 1]; - 3 struct rcu_data __percpu *rda; -</pre> - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Wait a minute! - You said that the <tt>rcu_node</tt> structures formed a tree, - but they are declared as a flat array! - What gives? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - The tree is laid out in the array. - The first node In the array is the head, the next set of nodes in the - array are children of the head node, and so on until the last set of - nodes in the array are the leaves. - </font> - - <p><font color="ffffff">See the following diagrams to see how - this works. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p>The <tt>rcu_node</tt> tree is embedded into the -<tt>->node[]</tt> array as shown in the following figure: - -</p><p><img src="TreeMapping.svg" alt="TreeMapping.svg" width="40%"> - -</p><p>One interesting consequence of this mapping is that a -breadth-first traversal of the tree is implemented as a simple -linear scan of the array, which is in fact what the -<tt>rcu_for_each_node_breadth_first()</tt> macro does. -This macro is used at the beginning and ends of grace periods. - -</p><p>Each entry of the <tt>->level</tt> array references -the first <tt>rcu_node</tt> structure on the corresponding level -of the tree, for example, as shown below: - -</p><p><img src="TreeMappingLevel.svg" alt="TreeMappingLevel.svg" width="40%"> - -</p><p>The zero<sup>th</sup> element of the array references the root -<tt>rcu_node</tt> structure, the first element references the -first child of the root <tt>rcu_node</tt>, and finally the second -element references the first leaf <tt>rcu_node</tt> structure. - -</p><p>For whatever it is worth, if you draw the tree to be tree-shaped -rather than array-shaped, it is easy to draw a planar representation: - -</p><p><img src="TreeLevel.svg" alt="TreeLevel.svg" width="60%"> - -</p><p>Finally, the <tt>->rda</tt> field references a per-CPU -pointer to the corresponding CPU's <tt>rcu_data</tt> structure. - -</p><p>All of these fields are constant once initialization is complete, -and therefore need no protection. - -<h5>Grace-Period Tracking</h5> - -<p>This portion of the <tt>rcu_state</tt> structure is declared -as follows: - -<pre> - 1 unsigned long gp_seq; -</pre> - -<p>RCU grace periods are numbered, and -the <tt>->gp_seq</tt> field contains the current grace-period -sequence number. -The bottom two bits are the state of the current grace period, -which can be zero for not yet started or one for in progress. -In other words, if the bottom two bits of <tt>->gp_seq</tt> are -zero, then RCU is idle. -Any other value in the bottom two bits indicates that something is broken. -This field is protected by the root <tt>rcu_node</tt> structure's -<tt>->lock</tt> field. - -</p><p>There are <tt>->gp_seq</tt> fields -in the <tt>rcu_node</tt> and <tt>rcu_data</tt> structures -as well. -The fields in the <tt>rcu_state</tt> structure represent the -most current value, and those of the other structures are compared -in order to detect the beginnings and ends of grace periods in a distributed -fashion. -The values flow from <tt>rcu_state</tt> to <tt>rcu_node</tt> -(down the tree from the root to the leaves) to <tt>rcu_data</tt>. - -<h5>Miscellaneous</h5> - -<p>This portion of the <tt>rcu_state</tt> structure is declared -as follows: - -<pre> - 1 unsigned long gp_max; - 2 char abbr; - 3 char *name; -</pre> - -<p>The <tt>->gp_max</tt> field tracks the duration of the longest -grace period in jiffies. -It is protected by the root <tt>rcu_node</tt>'s <tt>->lock</tt>. - -<p>The <tt>->name</tt> and <tt>->abbr</tt> fields distinguish -between preemptible RCU (“rcu_preempt” and “p”) -and non-preemptible RCU (“rcu_sched” and “s”). -These fields are used for diagnostic and tracing purposes. - -<h3><a name="The rcu_node Structure"> -The <tt>rcu_node</tt> Structure</a></h3> - -<p>The <tt>rcu_node</tt> structures form the combining -tree that propagates quiescent-state -information from the leaves to the root and also that propagates -grace-period information from the root down to the leaves. -They provides local copies of the grace-period state in order -to allow this information to be accessed in a synchronized -manner without suffering the scalability limitations that -would otherwise be imposed by global locking. -In <tt>CONFIG_PREEMPT_RCU</tt> kernels, they manage the lists -of tasks that have blocked while in their current -RCU read-side critical section. -In <tt>CONFIG_PREEMPT_RCU</tt> with -<tt>CONFIG_RCU_BOOST</tt>, they manage the -per-<tt>rcu_node</tt> priority-boosting -kernel threads (kthreads) and state. -Finally, they record CPU-hotplug state in order to determine -which CPUs should be ignored during a given grace period. - -</p><p>The <tt>rcu_node</tt> structure's fields are discussed, -singly and in groups, in the following sections. - -<h5>Connection to Combining Tree</h5> - -<p>This portion of the <tt>rcu_node</tt> structure is declared -as follows: - -<pre> - 1 struct rcu_node *parent; - 2 u8 level; - 3 u8 grpnum; - 4 unsigned long grpmask; - 5 int grplo; - 6 int grphi; -</pre> - -<p>The <tt>->parent</tt> pointer references the <tt>rcu_node</tt> -one level up in the tree, and is <tt>NULL</tt> for the root -<tt>rcu_node</tt>. -The RCU implementation makes heavy use of this field to push quiescent -states up the tree. -The <tt>->level</tt> field gives the level in the tree, with -the root being at level zero, its children at level one, and so on. -The <tt>->grpnum</tt> field gives this node's position within -the children of its parent, so this number can range between 0 and 31 -on 32-bit systems and between 0 and 63 on 64-bit systems. -The <tt>->level</tt> and <tt>->grpnum</tt> fields are -used only during initialization and for tracing. -The <tt>->grpmask</tt> field is the bitmask counterpart of -<tt>->grpnum</tt>, and therefore always has exactly one bit set. -This mask is used to clear the bit corresponding to this <tt>rcu_node</tt> -structure in its parent's bitmasks, which are described later. -Finally, the <tt>->grplo</tt> and <tt>->grphi</tt> fields -contain the lowest and highest numbered CPU served by this -<tt>rcu_node</tt> structure, respectively. - -</p><p>All of these fields are constant, and thus do not require any -synchronization. - -<h5>Synchronization</h5> - -<p>This field of the <tt>rcu_node</tt> structure is declared -as follows: - -<pre> - 1 raw_spinlock_t lock; -</pre> - -<p>This field is used to protect the remaining fields in this structure, -unless otherwise stated. -That said, all of the fields in this structure can be accessed without -locking for tracing purposes. -Yes, this can result in confusing traces, but better some tracing confusion -than to be heisenbugged out of existence. - -<h5>Grace-Period Tracking</h5> - -<p>This portion of the <tt>rcu_node</tt> structure is declared -as follows: - -<pre> - 1 unsigned long gp_seq; - 2 unsigned long gp_seq_needed; -</pre> - -<p>The <tt>rcu_node</tt> structures' <tt>->gp_seq</tt> fields are -the counterparts of the field of the same name in the <tt>rcu_state</tt> -structure. -They each may lag up to one step behind their <tt>rcu_state</tt> -counterpart. -If the bottom two bits of a given <tt>rcu_node</tt> structure's -<tt>->gp_seq</tt> field is zero, then this <tt>rcu_node</tt> -structure believes that RCU is idle. -</p><p>The <tt>>gp_seq</tt> field of each <tt>rcu_node</tt> -structure is updated at the beginning and the end -of each grace period. - -<p>The <tt>->gp_seq_needed</tt> fields record the -furthest-in-the-future grace period request seen by the corresponding -<tt>rcu_node</tt> structure. The request is considered fulfilled when -the value of the <tt>->gp_seq</tt> field equals or exceeds that of -the <tt>->gp_seq_needed</tt> field. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Suppose that this <tt>rcu_node</tt> structure doesn't see - a request for a very long time. - Won't wrapping of the <tt>->gp_seq</tt> field cause - problems? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - No, because if the <tt>->gp_seq_needed</tt> field lags behind the - <tt>->gp_seq</tt> field, the <tt>->gp_seq_needed</tt> field - will be updated at the end of the grace period. - Modulo-arithmetic comparisons therefore will always get the - correct answer, even with wrapping. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<h5>Quiescent-State Tracking</h5> - -<p>These fields manage the propagation of quiescent states up the -combining tree. - -</p><p>This portion of the <tt>rcu_node</tt> structure has fields -as follows: - -<pre> - 1 unsigned long qsmask; - 2 unsigned long expmask; - 3 unsigned long qsmaskinit; - 4 unsigned long expmaskinit; -</pre> - -<p>The <tt>->qsmask</tt> field tracks which of this -<tt>rcu_node</tt> structure's children still need to report -quiescent states for the current normal grace period. -Such children will have a value of 1 in their corresponding bit. -Note that the leaf <tt>rcu_node</tt> structures should be -thought of as having <tt>rcu_data</tt> structures as their -children. -Similarly, the <tt>->expmask</tt> field tracks which -of this <tt>rcu_node</tt> structure's children still need to report -quiescent states for the current expedited grace period. -An expedited grace period has -the same conceptual properties as a normal grace period, but the -expedited implementation accepts extreme CPU overhead to obtain -much lower grace-period latency, for example, consuming a few -tens of microseconds worth of CPU time to reduce grace-period -duration from milliseconds to tens of microseconds. -The <tt>->qsmaskinit</tt> field tracks which of this -<tt>rcu_node</tt> structure's children cover for at least -one online CPU. -This mask is used to initialize <tt>->qsmask</tt>, -and <tt>->expmaskinit</tt> is used to initialize -<tt>->expmask</tt> and the beginning of the -normal and expedited grace periods, respectively. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Why are these bitmasks protected by locking? - Come on, haven't you heard of atomic instructions??? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Lockless grace-period computation! Such a tantalizing possibility! - </font> - - <p><font color="ffffff">But consider the following sequence of events: - </font> - - <ol> - <li> <font color="ffffff">CPU 0 has been in dyntick-idle - mode for quite some time. - When it wakes up, it notices that the current RCU - grace period needs it to report in, so it sets a - flag where the scheduling clock interrupt will find it. - </font><p> - <li> <font color="ffffff">Meanwhile, CPU 1 is running - <tt>force_quiescent_state()</tt>, - and notices that CPU 0 has been in dyntick idle mode, - which qualifies as an extended quiescent state. - </font><p> - <li> <font color="ffffff">CPU 0's scheduling clock - interrupt fires in the - middle of an RCU read-side critical section, and notices - that the RCU core needs something, so commences RCU softirq - processing. - </font> - <p> - <li> <font color="ffffff">CPU 0's softirq handler - executes and is just about ready - to report its quiescent state up the <tt>rcu_node</tt> - tree. - </font><p> - <li> <font color="ffffff">But CPU 1 beats it to the punch, - completing the current - grace period and starting a new one. - </font><p> - <li> <font color="ffffff">CPU 0 now reports its quiescent - state for the wrong - grace period. - That grace period might now end before the RCU read-side - critical section. - If that happens, disaster will ensue. - </font> - </ol> - - <p><font color="ffffff">So the locking is absolutely required in - order to coordinate clearing of the bits with updating of the - grace-period sequence number in <tt>->gp_seq</tt>. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<h5>Blocked-Task Management</h5> - -<p><tt>PREEMPT_RCU</tt> allows tasks to be preempted in the -midst of their RCU read-side critical sections, and these tasks -must be tracked explicitly. -The details of exactly why and how they are tracked will be covered -in a separate article on RCU read-side processing. -For now, it is enough to know that the <tt>rcu_node</tt> -structure tracks them. - -<pre> - 1 struct list_head blkd_tasks; - 2 struct list_head *gp_tasks; - 3 struct list_head *exp_tasks; - 4 bool wait_blkd_tasks; -</pre> - -<p>The <tt>->blkd_tasks</tt> field is a list header for -the list of blocked and preempted tasks. -As tasks undergo context switches within RCU read-side critical -sections, their <tt>task_struct</tt> structures are enqueued -(via the <tt>task_struct</tt>'s <tt>->rcu_node_entry</tt> -field) onto the head of the <tt>->blkd_tasks</tt> list for the -leaf <tt>rcu_node</tt> structure corresponding to the CPU -on which the outgoing context switch executed. -As these tasks later exit their RCU read-side critical sections, -they remove themselves from the list. -This list is therefore in reverse time order, so that if one of the tasks -is blocking the current grace period, all subsequent tasks must -also be blocking that same grace period. -Therefore, a single pointer into this list suffices to track -all tasks blocking a given grace period. -That pointer is stored in <tt>->gp_tasks</tt> for normal -grace periods and in <tt>->exp_tasks</tt> for expedited -grace periods. -These last two fields are <tt>NULL</tt> if either there is -no grace period in flight or if there are no blocked tasks -preventing that grace period from completing. -If either of these two pointers is referencing a task that -removes itself from the <tt>->blkd_tasks</tt> list, -then that task must advance the pointer to the next task on -the list, or set the pointer to <tt>NULL</tt> if there -are no subsequent tasks on the list. - -</p><p>For example, suppose that tasks T1, T2, and T3 are -all hard-affinitied to the largest-numbered CPU in the system. -Then if task T1 blocked in an RCU read-side -critical section, then an expedited grace period started, -then task T2 blocked in an RCU read-side critical section, -then a normal grace period started, and finally task 3 blocked -in an RCU read-side critical section, then the state of the -last leaf <tt>rcu_node</tt> structure's blocked-task list -would be as shown below: - -</p><p><img src="blkd_task.svg" alt="blkd_task.svg" width="60%"> - -</p><p>Task T1 is blocking both grace periods, task T2 is -blocking only the normal grace period, and task T3 is blocking -neither grace period. -Note that these tasks will not remove themselves from this list -immediately upon resuming execution. -They will instead remain on the list until they execute the outermost -<tt>rcu_read_unlock()</tt> that ends their RCU read-side critical -section. - -<p> -The <tt>->wait_blkd_tasks</tt> field indicates whether or not -the current grace period is waiting on a blocked task. - -<h5>Sizing the <tt>rcu_node</tt> Array</h5> - -<p>The <tt>rcu_node</tt> array is sized via a series of -C-preprocessor expressions as follows: - -<pre> - 1 #ifdef CONFIG_RCU_FANOUT - 2 #define RCU_FANOUT CONFIG_RCU_FANOUT - 3 #else - 4 # ifdef CONFIG_64BIT - 5 # define RCU_FANOUT 64 - 6 # else - 7 # define RCU_FANOUT 32 - 8 # endif - 9 #endif -10 -11 #ifdef CONFIG_RCU_FANOUT_LEAF -12 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF -13 #else -14 # ifdef CONFIG_64BIT -15 # define RCU_FANOUT_LEAF 64 -16 # else -17 # define RCU_FANOUT_LEAF 32 -18 # endif -19 #endif -20 -21 #define RCU_FANOUT_1 (RCU_FANOUT_LEAF) -22 #define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT) -23 #define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT) -24 #define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT) -25 -26 #if NR_CPUS <= RCU_FANOUT_1 -27 # define RCU_NUM_LVLS 1 -28 # define NUM_RCU_LVL_0 1 -29 # define NUM_RCU_NODES NUM_RCU_LVL_0 -30 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0 } -31 # define RCU_NODE_NAME_INIT { "rcu_node_0" } -32 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0" } -33 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0" } -34 #elif NR_CPUS <= RCU_FANOUT_2 -35 # define RCU_NUM_LVLS 2 -36 # define NUM_RCU_LVL_0 1 -37 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) -38 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1) -39 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1 } -40 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1" } -41 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1" } -42 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1" } -43 #elif NR_CPUS <= RCU_FANOUT_3 -44 # define RCU_NUM_LVLS 3 -45 # define NUM_RCU_LVL_0 1 -46 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) -47 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) -48 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2) -49 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 } -50 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2" } -51 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" } -52 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" } -53 #elif NR_CPUS <= RCU_FANOUT_4 -54 # define RCU_NUM_LVLS 4 -55 # define NUM_RCU_LVL_0 1 -56 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3) -57 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) -58 # define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) -59 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3) -60 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 } -61 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" } -62 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" } -63 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" } -64 #else -65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS" -66 #endif -</pre> - -<p>The maximum number of levels in the <tt>rcu_node</tt> structure -is currently limited to four, as specified by lines 21-24 -and the structure of the subsequent “if” statement. -For 32-bit systems, this allows 16*32*32*32=524,288 CPUs, which -should be sufficient for the next few years at least. -For 64-bit systems, 16*64*64*64=4,194,304 CPUs is allowed, which -should see us through the next decade or so. -This four-level tree also allows kernels built with -<tt>CONFIG_RCU_FANOUT=8</tt> to support up to 4096 CPUs, -which might be useful in very large systems having eight CPUs per -socket (but please note that no one has yet shown any measurable -performance degradation due to misaligned socket and <tt>rcu_node</tt> -boundaries). -In addition, building kernels with a full four levels of <tt>rcu_node</tt> -tree permits better testing of RCU's combining-tree code. - -</p><p>The <tt>RCU_FANOUT</tt> symbol controls how many children -are permitted at each non-leaf level of the <tt>rcu_node</tt> tree. -If the <tt>CONFIG_RCU_FANOUT</tt> Kconfig option is not specified, -it is set based on the word size of the system, which is also -the Kconfig default. - -</p><p>The <tt>RCU_FANOUT_LEAF</tt> symbol controls how many CPUs are -handled by each leaf <tt>rcu_node</tt> structure. -Experience has shown that allowing a given leaf <tt>rcu_node</tt> -structure to handle 64 CPUs, as permitted by the number of bits in -the <tt>->qsmask</tt> field on a 64-bit system, results in -excessive contention for the leaf <tt>rcu_node</tt> structures' -<tt>->lock</tt> fields. -The number of CPUs per leaf <tt>rcu_node</tt> structure is therefore -limited to 16 given the default value of <tt>CONFIG_RCU_FANOUT_LEAF</tt>. -If <tt>CONFIG_RCU_FANOUT_LEAF</tt> is unspecified, the value -selected is based on the word size of the system, just as for -<tt>CONFIG_RCU_FANOUT</tt>. -Lines 11-19 perform this computation. - -</p><p>Lines 21-24 compute the maximum number of CPUs supported by -a single-level (which contains a single <tt>rcu_node</tt> structure), -two-level, three-level, and four-level <tt>rcu_node</tt> tree, -respectively, given the fanout specified by <tt>RCU_FANOUT</tt> -and <tt>RCU_FANOUT_LEAF</tt>. -These numbers of CPUs are retained in the -<tt>RCU_FANOUT_1</tt>, -<tt>RCU_FANOUT_2</tt>, -<tt>RCU_FANOUT_3</tt>, and -<tt>RCU_FANOUT_4</tt> -C-preprocessor variables, respectively. - -</p><p>These variables are used to control the C-preprocessor <tt>#if</tt> -statement spanning lines 26-66 that computes the number of -<tt>rcu_node</tt> structures required for each level of the tree, -as well as the number of levels required. -The number of levels is placed in the <tt>NUM_RCU_LVLS</tt> -C-preprocessor variable by lines 27, 35, 44, and 54. -The number of <tt>rcu_node</tt> structures for the topmost level -of the tree is always exactly one, and this value is unconditionally -placed into <tt>NUM_RCU_LVL_0</tt> by lines 28, 36, 45, and 55. -The rest of the levels (if any) of the <tt>rcu_node</tt> tree -are computed by dividing the maximum number of CPUs by the -fanout supported by the number of levels from the current level down, -rounding up. This computation is performed by lines 37, -46-47, and 56-58. -Lines 31-33, 40-42, 50-52, and 62-63 create initializers -for lockdep lock-class names. -Finally, lines 64-66 produce an error if the maximum number of -CPUs is too large for the specified fanout. - -<h3><a name="The rcu_segcblist Structure"> -The <tt>rcu_segcblist</tt> Structure</a></h3> - -The <tt>rcu_segcblist</tt> structure maintains a segmented list of -callbacks as follows: - -<pre> - 1 #define RCU_DONE_TAIL 0 - 2 #define RCU_WAIT_TAIL 1 - 3 #define RCU_NEXT_READY_TAIL 2 - 4 #define RCU_NEXT_TAIL 3 - 5 #define RCU_CBLIST_NSEGS 4 - 6 - 7 struct rcu_segcblist { - 8 struct rcu_head *head; - 9 struct rcu_head **tails[RCU_CBLIST_NSEGS]; -10 unsigned long gp_seq[RCU_CBLIST_NSEGS]; -11 long len; -12 long len_lazy; -13 }; -</pre> - -<p> -The segments are as follows: - -<ol> -<li> <tt>RCU_DONE_TAIL</tt>: Callbacks whose grace periods have elapsed. - These callbacks are ready to be invoked. -<li> <tt>RCU_WAIT_TAIL</tt>: Callbacks that are waiting for the - current grace period. - Note that different CPUs can have different ideas about which - grace period is current, hence the <tt>->gp_seq</tt> field. -<li> <tt>RCU_NEXT_READY_TAIL</tt>: Callbacks waiting for the next - grace period to start. -<li> <tt>RCU_NEXT_TAIL</tt>: Callbacks that have not yet been - associated with a grace period. -</ol> - -<p> -The <tt>->head</tt> pointer references the first callback or -is <tt>NULL</tt> if the list contains no callbacks (which is -<i>not</i> the same as being empty). -Each element of the <tt>->tails[]</tt> array references the -<tt>->next</tt> pointer of the last callback in the corresponding -segment of the list, or the list's <tt>->head</tt> pointer if -that segment and all previous segments are empty. -If the corresponding segment is empty but some previous segment is -not empty, then the array element is identical to its predecessor. -Older callbacks are closer to the head of the list, and new callbacks -are added at the tail. -This relationship between the <tt>->head</tt> pointer, the -<tt>->tails[]</tt> array, and the callbacks is shown in this -diagram: - -</p><p><img src="nxtlist.svg" alt="nxtlist.svg" width="40%"> - -</p><p>In this figure, the <tt>->head</tt> pointer references the -first -RCU callback in the list. -The <tt>->tails[RCU_DONE_TAIL]</tt> array element references -the <tt>->head</tt> pointer itself, indicating that none -of the callbacks is ready to invoke. -The <tt>->tails[RCU_WAIT_TAIL]</tt> array element references callback -CB 2's <tt>->next</tt> pointer, which indicates that -CB 1 and CB 2 are both waiting on the current grace period, -give or take possible disagreements about exactly which grace period -is the current one. -The <tt>->tails[RCU_NEXT_READY_TAIL]</tt> array element -references the same RCU callback that <tt>->tails[RCU_WAIT_TAIL]</tt> -does, which indicates that there are no callbacks waiting on the next -RCU grace period. -The <tt>->tails[RCU_NEXT_TAIL]</tt> array element references -CB 4's <tt>->next</tt> pointer, indicating that all the -remaining RCU callbacks have not yet been assigned to an RCU grace -period. -Note that the <tt>->tails[RCU_NEXT_TAIL]</tt> array element -always references the last RCU callback's <tt>->next</tt> pointer -unless the callback list is empty, in which case it references -the <tt>->head</tt> pointer. - -<p> -There is one additional important special case for the -<tt>->tails[RCU_NEXT_TAIL]</tt> array element: It can be <tt>NULL</tt> -when this list is <i>disabled</i>. -Lists are disabled when the corresponding CPU is offline or when -the corresponding CPU's callbacks are offloaded to a kthread, -both of which are described elsewhere. - -</p><p>CPUs advance their callbacks from the -<tt>RCU_NEXT_TAIL</tt> to the <tt>RCU_NEXT_READY_TAIL</tt> to the -<tt>RCU_WAIT_TAIL</tt> to the <tt>RCU_DONE_TAIL</tt> list segments -as grace periods advance. - -</p><p>The <tt>->gp_seq[]</tt> array records grace-period -numbers corresponding to the list segments. -This is what allows different CPUs to have different ideas as to -which is the current grace period while still avoiding premature -invocation of their callbacks. -In particular, this allows CPUs that go idle for extended periods -to determine which of their callbacks are ready to be invoked after -reawakening. - -</p><p>The <tt>->len</tt> counter contains the number of -callbacks in <tt>->head</tt>, and the -<tt>->len_lazy</tt> contains the number of those callbacks that -are known to only free memory, and whose invocation can therefore -be safely deferred. - -<p><b>Important note</b>: It is the <tt>->len</tt> field that -determines whether or not there are callbacks associated with -this <tt>rcu_segcblist</tt> structure, <i>not</i> the <tt>->head</tt> -pointer. -The reason for this is that all the ready-to-invoke callbacks -(that is, those in the <tt>RCU_DONE_TAIL</tt> segment) are extracted -all at once at callback-invocation time (<tt>rcu_do_batch</tt>), due -to which <tt>->head</tt> may be set to NULL if there are no not-done -callbacks remaining in the <tt>rcu_segcblist</tt>. -If callback invocation must be postponed, for example, because a -high-priority process just woke up on this CPU, then the remaining -callbacks are placed back on the <tt>RCU_DONE_TAIL</tt> segment and -<tt>->head</tt> once again points to the start of the segment. -In short, the head field can briefly be <tt>NULL</tt> even though the -CPU has callbacks present the entire time. -Therefore, it is not appropriate to test the <tt>->head</tt> pointer -for <tt>NULL</tt>. - -<p>In contrast, the <tt>->len</tt> and <tt>->len_lazy</tt> counts -are adjusted only after the corresponding callbacks have been invoked. -This means that the <tt>->len</tt> count is zero only if -the <tt>rcu_segcblist</tt> structure really is devoid of callbacks. -Of course, off-CPU sampling of the <tt>->len</tt> count requires -careful use of appropriate synchronization, for example, memory barriers. -This synchronization can be a bit subtle, particularly in the case -of <tt>rcu_barrier()</tt>. - -<h3><a name="The rcu_data Structure"> -The <tt>rcu_data</tt> Structure</a></h3> - -<p>The <tt>rcu_data</tt> maintains the per-CPU state for the RCU subsystem. -The fields in this structure may be accessed only from the corresponding -CPU (and from tracing) unless otherwise stated. -This structure is the -focus of quiescent-state detection and RCU callback queuing. -It also tracks its relationship to the corresponding leaf -<tt>rcu_node</tt> structure to allow more-efficient -propagation of quiescent states up the <tt>rcu_node</tt> -combining tree. -Like the <tt>rcu_node</tt> structure, it provides a local -copy of the grace-period information to allow for-free -synchronized -access to this information from the corresponding CPU. -Finally, this structure records past dyntick-idle state -for the corresponding CPU and also tracks statistics. - -</p><p>The <tt>rcu_data</tt> structure's fields are discussed, -singly and in groups, in the following sections. - -<h5>Connection to Other Data Structures</h5> - -<p>This portion of the <tt>rcu_data</tt> structure is declared -as follows: - -<pre> - 1 int cpu; - 2 struct rcu_node *mynode; - 3 unsigned long grpmask; - 4 bool beenonline; -</pre> - -<p>The <tt>->cpu</tt> field contains the number of the -corresponding CPU and the <tt>->mynode</tt> field references the -corresponding <tt>rcu_node</tt> structure. -The <tt>->mynode</tt> is used to propagate quiescent states -up the combining tree. -These two fields are constant and therefore do not require synchronization. - -<p>The <tt>->grpmask</tt> field indicates the bit in -the <tt>->mynode->qsmask</tt> corresponding to this -<tt>rcu_data</tt> structure, and is also used when propagating -quiescent states. -The <tt>->beenonline</tt> flag is set whenever the corresponding -CPU comes online, which means that the debugfs tracing need not dump -out any <tt>rcu_data</tt> structure for which this flag is not set. - -<h5>Quiescent-State and Grace-Period Tracking</h5> - -<p>This portion of the <tt>rcu_data</tt> structure is declared -as follows: - -<pre> - 1 unsigned long gp_seq; - 2 unsigned long gp_seq_needed; - 3 bool cpu_no_qs; - 4 bool core_needs_qs; - 5 bool gpwrap; -</pre> - -<p>The <tt>->gp_seq</tt> field is the counterpart of the field of the same -name in the <tt>rcu_state</tt> and <tt>rcu_node</tt> structures. The -<tt>->gp_seq_needed</tt> field is the counterpart of the field of the same -name in the rcu_node</tt> structure. -They may each lag up to one behind their <tt>rcu_node</tt> -counterparts, but in <tt>CONFIG_NO_HZ_IDLE</tt> and -<tt>CONFIG_NO_HZ_FULL</tt> kernels can lag -arbitrarily far behind for CPUs in dyntick-idle mode (but these counters -will catch up upon exit from dyntick-idle mode). -If the lower two bits of a given <tt>rcu_data</tt> structure's -<tt>->gp_seq</tt> are zero, then this <tt>rcu_data</tt> -structure believes that RCU is idle. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - All this replication of the grace period numbers can only cause - massive confusion. - Why not just keep a global sequence number and be done with it??? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Because if there was only a single global sequence - numbers, there would need to be a single global lock to allow - safely accessing and updating it. - And if we are not going to have a single global lock, we need - to carefully manage the numbers on a per-node basis. - Recall from the answer to a previous Quick Quiz that the consequences - of applying a previously sampled quiescent state to the wrong - grace period are quite severe. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p>The <tt>->cpu_no_qs</tt> flag indicates that the -CPU has not yet passed through a quiescent state, -while the <tt>->core_needs_qs</tt> flag indicates that the -RCU core needs a quiescent state from the corresponding CPU. -The <tt>->gpwrap</tt> field indicates that the corresponding -CPU has remained idle for so long that the -<tt>gp_seq</tt> counter is in danger of overflow, which -will cause the CPU to disregard the values of its counters on -its next exit from idle. - -<h5>RCU Callback Handling</h5> - -<p>In the absence of CPU-hotplug events, RCU callbacks are invoked by -the same CPU that registered them. -This is strictly a cache-locality optimization: callbacks can and -do get invoked on CPUs other than the one that registered them. -After all, if the CPU that registered a given callback has gone -offline before the callback can be invoked, there really is no other -choice. - -</p><p>This portion of the <tt>rcu_data</tt> structure is declared -as follows: - -<pre> - 1 struct rcu_segcblist cblist; - 2 long qlen_last_fqs_check; - 3 unsigned long n_cbs_invoked; - 4 unsigned long n_nocbs_invoked; - 5 unsigned long n_cbs_orphaned; - 6 unsigned long n_cbs_adopted; - 7 unsigned long n_force_qs_snap; - 8 long blimit; -</pre> - -<p>The <tt>->cblist</tt> structure is the segmented callback list -described earlier. -The CPU advances the callbacks in its <tt>rcu_data</tt> structure -whenever it notices that another RCU grace period has completed. -The CPU detects the completion of an RCU grace period by noticing -that the value of its <tt>rcu_data</tt> structure's -<tt>->gp_seq</tt> field differs from that of its leaf -<tt>rcu_node</tt> structure. -Recall that each <tt>rcu_node</tt> structure's -<tt>->gp_seq</tt> field is updated at the beginnings and ends of each -grace period. - -<p> -The <tt>->qlen_last_fqs_check</tt> and -<tt>->n_force_qs_snap</tt> coordinate the forcing of quiescent -states from <tt>call_rcu()</tt> and friends when callback -lists grow excessively long. - -</p><p>The <tt>->n_cbs_invoked</tt>, -<tt>->n_cbs_orphaned</tt>, and <tt>->n_cbs_adopted</tt> -fields count the number of callbacks invoked, -sent to other CPUs when this CPU goes offline, -and received from other CPUs when those other CPUs go offline. -The <tt>->n_nocbs_invoked</tt> is used when the CPU's callbacks -are offloaded to a kthread. - -<p> -Finally, the <tt>->blimit</tt> counter is the maximum number of -RCU callbacks that may be invoked at a given time. - -<h5>Dyntick-Idle Handling</h5> - -<p>This portion of the <tt>rcu_data</tt> structure is declared -as follows: - -<pre> - 1 int dynticks_snap; - 2 unsigned long dynticks_fqs; -</pre> - -The <tt>->dynticks_snap</tt> field is used to take a snapshot -of the corresponding CPU's dyntick-idle state when forcing -quiescent states, and is therefore accessed from other CPUs. -Finally, the <tt>->dynticks_fqs</tt> field is used to -count the number of times this CPU is determined to be in -dyntick-idle state, and is used for tracing and debugging purposes. - -<p> -This portion of the rcu_data structure is declared as follows: - -<pre> - 1 long dynticks_nesting; - 2 long dynticks_nmi_nesting; - 3 atomic_t dynticks; - 4 bool rcu_need_heavy_qs; - 5 bool rcu_urgent_qs; -</pre> - -<p>These fields in the rcu_data structure maintain the per-CPU dyntick-idle -state for the corresponding CPU. -The fields may be accessed only from the corresponding CPU (and from tracing) -unless otherwise stated. - -<p>The <tt>->dynticks_nesting</tt> field counts the -nesting depth of process execution, so that in normal circumstances -this counter has value zero or one. -NMIs, irqs, and tracers are counted by the <tt>->dynticks_nmi_nesting</tt> -field. -Because NMIs cannot be masked, changes to this variable have to be -undertaken carefully using an algorithm provided by Andy Lutomirski. -The initial transition from idle adds one, and nested transitions -add two, so that a nesting level of five is represented by a -<tt>->dynticks_nmi_nesting</tt> value of nine. -This counter can therefore be thought of as counting the number -of reasons why this CPU cannot be permitted to enter dyntick-idle -mode, aside from process-level transitions. - -<p>However, it turns out that when running in non-idle kernel context, -the Linux kernel is fully capable of entering interrupt handlers that -never exit and perhaps also vice versa. -Therefore, whenever the <tt>->dynticks_nesting</tt> field is -incremented up from zero, the <tt>->dynticks_nmi_nesting</tt> field -is set to a large positive number, and whenever the -<tt>->dynticks_nesting</tt> field is decremented down to zero, -the the <tt>->dynticks_nmi_nesting</tt> field is set to zero. -Assuming that the number of misnested interrupts is not sufficient -to overflow the counter, this approach corrects the -<tt>->dynticks_nmi_nesting</tt> field every time the corresponding -CPU enters the idle loop from process context. - -</p><p>The <tt>->dynticks</tt> field counts the corresponding -CPU's transitions to and from either dyntick-idle or user mode, so -that this counter has an even value when the CPU is in dyntick-idle -mode or user mode and an odd value otherwise. The transitions to/from -user mode need to be counted for user mode adaptive-ticks support -(see timers/NO_HZ.txt). - -</p><p>The <tt>->rcu_need_heavy_qs</tt> field is used -to record the fact that the RCU core code would really like to -see a quiescent state from the corresponding CPU, so much so that -it is willing to call for heavy-weight dyntick-counter operations. -This flag is checked by RCU's context-switch and <tt>cond_resched()</tt> -code, which provide a momentary idle sojourn in response. - -</p><p>Finally, the <tt>->rcu_urgent_qs</tt> field is used to record -the fact that the RCU core code would really like to see a quiescent state from -the corresponding CPU, with the various other fields indicating just how badly -RCU wants this quiescent state. -This flag is checked by RCU's context-switch path -(<tt>rcu_note_context_switch</tt>) and the cond_resched code. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Why not simply combine the <tt>->dynticks_nesting</tt> - and <tt>->dynticks_nmi_nesting</tt> counters into a - single counter that just counts the number of reasons that - the corresponding CPU is non-idle? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Because this would fail in the presence of interrupts whose - handlers never return and of handlers that manage to return - from a made-up interrupt. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p>Additional fields are present for some special-purpose -builds, and are discussed separately. - -<h3><a name="The rcu_head Structure"> -The <tt>rcu_head</tt> Structure</a></h3> - -<p>Each <tt>rcu_head</tt> structure represents an RCU callback. -These structures are normally embedded within RCU-protected data -structures whose algorithms use asynchronous grace periods. -In contrast, when using algorithms that block waiting for RCU grace periods, -RCU users need not provide <tt>rcu_head</tt> structures. - -</p><p>The <tt>rcu_head</tt> structure has fields as follows: - -<pre> - 1 struct rcu_head *next; - 2 void (*func)(struct rcu_head *head); -</pre> - -<p>The <tt>->next</tt> field is used -to link the <tt>rcu_head</tt> structures together in the -lists within the <tt>rcu_data</tt> structures. -The <tt>->func</tt> field is a pointer to the function -to be called when the callback is ready to be invoked, and -this function is passed a pointer to the <tt>rcu_head</tt> -structure. -However, <tt>kfree_rcu()</tt> uses the <tt>->func</tt> -field to record the offset of the <tt>rcu_head</tt> -structure within the enclosing RCU-protected data structure. - -</p><p>Both of these fields are used internally by RCU. -From the viewpoint of RCU users, this structure is an -opaque “cookie”. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Given that the callback function <tt>->func</tt> - is passed a pointer to the <tt>rcu_head</tt> structure, - how is that function supposed to find the beginning of the - enclosing RCU-protected data structure? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - In actual practice, there is a separate callback function per - type of RCU-protected data structure. - The callback function can therefore use the <tt>container_of()</tt> - macro in the Linux kernel (or other pointer-manipulation facilities - in other software environments) to find the beginning of the - enclosing structure. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<h3><a name="RCU-Specific Fields in the task_struct Structure"> -RCU-Specific Fields in the <tt>task_struct</tt> Structure</a></h3> - -<p>The <tt>CONFIG_PREEMPT_RCU</tt> implementation uses some -additional fields in the <tt>task_struct</tt> structure: - -<pre> - 1 #ifdef CONFIG_PREEMPT_RCU - 2 int rcu_read_lock_nesting; - 3 union rcu_special rcu_read_unlock_special; - 4 struct list_head rcu_node_entry; - 5 struct rcu_node *rcu_blocked_node; - 6 #endif /* #ifdef CONFIG_PREEMPT_RCU */ - 7 #ifdef CONFIG_TASKS_RCU - 8 unsigned long rcu_tasks_nvcsw; - 9 bool rcu_tasks_holdout; -10 struct list_head rcu_tasks_holdout_list; -11 int rcu_tasks_idle_cpu; -12 #endif /* #ifdef CONFIG_TASKS_RCU */ -</pre> - -<p>The <tt>->rcu_read_lock_nesting</tt> field records the -nesting level for RCU read-side critical sections, and -the <tt>->rcu_read_unlock_special</tt> field is a bitmask -that records special conditions that require <tt>rcu_read_unlock()</tt> -to do additional work. -The <tt>->rcu_node_entry</tt> field is used to form lists of -tasks that have blocked within preemptible-RCU read-side critical -sections and the <tt>->rcu_blocked_node</tt> field references -the <tt>rcu_node</tt> structure whose list this task is a member of, -or <tt>NULL</tt> if it is not blocked within a preemptible-RCU -read-side critical section. - -<p>The <tt>->rcu_tasks_nvcsw</tt> field tracks the number of -voluntary context switches that this task had undergone at the -beginning of the current tasks-RCU grace period, -<tt>->rcu_tasks_holdout</tt> is set if the current tasks-RCU -grace period is waiting on this task, <tt>->rcu_tasks_holdout_list</tt> -is a list element enqueuing this task on the holdout list, -and <tt>->rcu_tasks_idle_cpu</tt> tracks which CPU this -idle task is running, but only if the task is currently running, -that is, if the CPU is currently idle. - -<h3><a name="Accessor Functions"> -Accessor Functions</a></h3> - -<p>The following listing shows the -<tt>rcu_get_root()</tt>, <tt>rcu_for_each_node_breadth_first</tt> and -<tt>rcu_for_each_leaf_node()</tt> function and macros: - -<pre> - 1 static struct rcu_node *rcu_get_root(struct rcu_state *rsp) - 2 { - 3 return &rsp->node[0]; - 4 } - 5 - 6 #define rcu_for_each_node_breadth_first(rsp, rnp) \ - 7 for ((rnp) = &(rsp)->node[0]; \ - 8 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) - 9 - 10 #define rcu_for_each_leaf_node(rsp, rnp) \ - 11 for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \ - 12 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) -</pre> - -<p>The <tt>rcu_get_root()</tt> simply returns a pointer to the -first element of the specified <tt>rcu_state</tt> structure's -<tt>->node[]</tt> array, which is the root <tt>rcu_node</tt> -structure. - -</p><p>As noted earlier, the <tt>rcu_for_each_node_breadth_first()</tt> -macro takes advantage of the layout of the <tt>rcu_node</tt> -structures in the <tt>rcu_state</tt> structure's -<tt>->node[]</tt> array, performing a breadth-first traversal by -simply traversing the array in order. -Similarly, the <tt>rcu_for_each_leaf_node()</tt> macro traverses only -the last part of the array, thus traversing only the leaf -<tt>rcu_node</tt> structures. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - What does - <tt>rcu_for_each_leaf_node()</tt> do if the <tt>rcu_node</tt> tree - contains only a single node? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - In the single-node case, - <tt>rcu_for_each_leaf_node()</tt> traverses the single node. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<h3><a name="Summary"> -Summary</a></h3> - -So the state of RCU is represented by an <tt>rcu_state</tt> structure, -which contains a combining tree of <tt>rcu_node</tt> and -<tt>rcu_data</tt> structures. -Finally, in <tt>CONFIG_NO_HZ_IDLE</tt> kernels, each CPU's dyntick-idle -state is tracked by dynticks-related fields in the <tt>rcu_data</tt> structure. - -If you made it this far, you are well prepared to read the code -walkthroughs in the other articles in this series. - -<h3><a name="Acknowledgments"> -Acknowledgments</a></h3> - -I owe thanks to Cyrill Gorcunov, Mathieu Desnoyers, Dhaval Giani, Paul -Turner, Abhishek Srivastava, Matt Kowalczyk, and Serge Hallyn -for helping me get this document into a more human-readable state. - -<h3><a name="Legal Statement"> -Legal Statement</a></h3> - -<p>This work represents the view of the author and does not necessarily -represent the view of IBM. - -</p><p>Linux is a registered trademark of Linus Torvalds. - -</p><p>Other company, product, and service names may be trademarks or -service marks of others. - -</body></html> diff --git a/Documentation/RCU/Design/Data-Structures/Data-Structures.rst b/Documentation/RCU/Design/Data-Structures/Data-Structures.rst new file mode 100644 index 000000000000..4a48e20a46f2 --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/Data-Structures.rst @@ -0,0 +1,1163 @@ +=================================================== +A Tour Through TREE_RCU's Data Structures [LWN.net] +=================================================== + +December 18, 2016 + +This article was contributed by Paul E. McKenney + +Introduction +============ + +This document describes RCU's major data structures and their relationship +to each other. + +Data-Structure Relationships +============================ + +RCU is for all intents and purposes a large state machine, and its +data structures maintain the state in such a way as to allow RCU readers +to execute extremely quickly, while also processing the RCU grace periods +requested by updaters in an efficient and extremely scalable fashion. +The efficiency and scalability of RCU updaters is provided primarily +by a combining tree, as shown below: + +.. kernel-figure:: BigTreeClassicRCU.svg + +This diagram shows an enclosing ``rcu_state`` structure containing a tree +of ``rcu_node`` structures. Each leaf node of the ``rcu_node`` tree has up +to 16 ``rcu_data`` structures associated with it, so that there are +``NR_CPUS`` number of ``rcu_data`` structures, one for each possible CPU. +This structure is adjusted at boot time, if needed, to handle the common +case where ``nr_cpu_ids`` is much less than ``NR_CPUs``. +For example, a number of Linux distributions set ``NR_CPUs=4096``, +which results in a three-level ``rcu_node`` tree. +If the actual hardware has only 16 CPUs, RCU will adjust itself +at boot time, resulting in an ``rcu_node`` tree with only a single node. + +The purpose of this combining tree is to allow per-CPU events +such as quiescent states, dyntick-idle transitions, +and CPU hotplug operations to be processed efficiently +and scalably. +Quiescent states are recorded by the per-CPU ``rcu_data`` structures, +and other events are recorded by the leaf-level ``rcu_node`` +structures. +All of these events are combined at each level of the tree until finally +grace periods are completed at the tree's root ``rcu_node`` +structure. +A grace period can be completed at the root once every CPU +(or, in the case of ``CONFIG_PREEMPT_RCU``, task) +has passed through a quiescent state. +Once a grace period has completed, record of that fact is propagated +back down the tree. + +As can be seen from the diagram, on a 64-bit system +a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout +of 64 at the root and a fanout of 16 at the leaves. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why isn't the fanout at the leaves also 64? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Because there are more types of events that affect the leaf-level | +| ``rcu_node`` structures than further up the tree. Therefore, if the | +| leaf ``rcu_node`` structures have fanout of 64, the contention on | +| these structures' ``->structures`` becomes excessive. Experimentation | +| on a wide variety of systems has shown that a fanout of 16 works well | +| for the leaves of the ``rcu_node`` tree. | +| | +| Of course, further experience with systems having hundreds or | +| thousands of CPUs may demonstrate that the fanout for the non-leaf | +| ``rcu_node`` structures must also be reduced. Such reduction can be | +| easily carried out when and if it proves necessary. In the meantime, | +| if you are using such a system and running into contention problems | +| on the non-leaf ``rcu_node`` structures, you may use the | +| ``CONFIG_RCU_FANOUT`` kernel configuration parameter to reduce the | +| non-leaf fanout as needed. | +| | +| Kernels built for systems with strong NUMA characteristics might | +| also need to adjust ``CONFIG_RCU_FANOUT`` so that the domains of | +| the ``rcu_node`` structures align with hardware boundaries. | +| However, there has thus far been no need for this. | ++-----------------------------------------------------------------------+ + +If your system has more than 1,024 CPUs (or more than 512 CPUs on a +32-bit system), then RCU will automatically add more levels to the tree. +For example, if you are crazy enough to build a 64-bit system with +65,536 CPUs, RCU would configure the ``rcu_node`` tree as follows: + +.. kernel-figure:: HugeTreeClassicRCU.svg + +RCU currently permits up to a four-level tree, which on a 64-bit system +accommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for +32-bit systems. On the other hand, you can set both +``CONFIG_RCU_FANOUT`` and ``CONFIG_RCU_FANOUT_LEAF`` to be as small as +2, which would result in a 16-CPU test using a 4-level tree. This can be +useful for testing large-system capabilities on small test machines. + +This multi-level combining tree allows us to get most of the performance +and scalability benefits of partitioning, even though RCU grace-period +detection is inherently a global operation. The trick here is that only +the last CPU to report a quiescent state into a given ``rcu_node`` +structure need advance to the ``rcu_node`` structure at the next level +up the tree. This means that at the leaf-level ``rcu_node`` structure, +only one access out of sixteen will progress up the tree. For the +internal ``rcu_node`` structures, the situation is even more extreme: +Only one access out of sixty-four will progress up the tree. Because the +vast majority of the CPUs do not progress up the tree, the lock +contention remains roughly constant up the tree. No matter how many CPUs +there are in the system, at most 64 quiescent-state reports per grace +period will progress all the way to the root ``rcu_node`` structure, +thus ensuring that the lock contention on that root ``rcu_node`` +structure remains acceptably low. + +In effect, the combining tree acts like a big shock absorber, keeping +lock contention under control at all tree levels regardless of the level +of loading on the system. + +RCU updaters wait for normal grace periods by registering RCU callbacks, +either directly via ``call_rcu()`` or indirectly via +``synchronize_rcu()`` and friends. RCU callbacks are represented by +``rcu_head`` structures, which are queued on ``rcu_data`` structures +while they are waiting for a grace period to elapse, as shown in the +following figure: + +.. kernel-figure:: BigTreePreemptRCUBHdyntickCB.svg + +This figure shows how ``TREE_RCU``'s and ``PREEMPT_RCU``'s major data +structures are related. Lesser data structures will be introduced with +the algorithms that make use of them. + +Note that each of the data structures in the above figure has its own +synchronization: + +#. Each ``rcu_state`` structures has a lock and a mutex, and some fields + are protected by the corresponding root ``rcu_node`` structure's lock. +#. Each ``rcu_node`` structure has a spinlock. +#. The fields in ``rcu_data`` are private to the corresponding CPU, + although a few can be read and written by other CPUs. + +It is important to note that different data structures can have very +different ideas about the state of RCU at any given time. For but one +example, awareness of the start or end of a given RCU grace period +propagates slowly through the data structures. This slow propagation is +absolutely necessary for RCU to have good read-side performance. If this +balkanized implementation seems foreign to you, one useful trick is to +consider each instance of these data structures to be a different +person, each having the usual slightly different view of reality. + +The general role of each of these data structures is as follows: + +#. ``rcu_state``: This structure forms the interconnection between the + ``rcu_node`` and ``rcu_data`` structures, tracks grace periods, + serves as short-term repository for callbacks orphaned by CPU-hotplug + events, maintains ``rcu_barrier()`` state, tracks expedited + grace-period state, and maintains state used to force quiescent + states when grace periods extend too long, +#. ``rcu_node``: This structure forms the combining tree that propagates + quiescent-state information from the leaves to the root, and also + propagates grace-period information from the root to the leaves. It + provides local copies of the grace-period state in order to allow + this information to be accessed in a synchronized manner without + suffering the scalability limitations that would otherwise be imposed + by global locking. In ``CONFIG_PREEMPT_RCU`` kernels, it manages the + lists of tasks that have blocked while in their current RCU read-side + critical section. In ``CONFIG_PREEMPT_RCU`` with + ``CONFIG_RCU_BOOST``, it manages the per-\ ``rcu_node`` + priority-boosting kernel threads (kthreads) and state. Finally, it + records CPU-hotplug state in order to determine which CPUs should be + ignored during a given grace period. +#. ``rcu_data``: This per-CPU structure is the focus of quiescent-state + detection and RCU callback queuing. It also tracks its relationship + to the corresponding leaf ``rcu_node`` structure to allow + more-efficient propagation of quiescent states up the ``rcu_node`` + combining tree. Like the ``rcu_node`` structure, it provides a local + copy of the grace-period information to allow for-free synchronized + access to this information from the corresponding CPU. Finally, this + structure records past dyntick-idle state for the corresponding CPU + and also tracks statistics. +#. ``rcu_head``: This structure represents RCU callbacks, and is the + only structure allocated and managed by RCU users. The ``rcu_head`` + structure is normally embedded within the RCU-protected data + structure. + +If all you wanted from this article was a general notion of how RCU's +data structures are related, you are done. Otherwise, each of the +following sections give more details on the ``rcu_state``, ``rcu_node`` +and ``rcu_data`` data structures. + +The ``rcu_state`` Structure +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``rcu_state`` structure is the base structure that represents the +state of RCU in the system. This structure forms the interconnection +between the ``rcu_node`` and ``rcu_data`` structures, tracks grace +periods, contains the lock used to synchronize with CPU-hotplug events, +and maintains state used to force quiescent states when grace periods +extend too long, + +A few of the ``rcu_state`` structure's fields are discussed, singly and +in groups, in the following sections. The more specialized fields are +covered in the discussion of their use. + +Relationship to rcu_node and rcu_data Structures +'''''''''''''''''''''''''''''''''''''''''''''''' + +This portion of the ``rcu_state`` structure is declared as follows: + +:: + + 1 struct rcu_node node[NUM_RCU_NODES]; + 2 struct rcu_node *level[NUM_RCU_LVLS + 1]; + 3 struct rcu_data __percpu *rda; + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Wait a minute! You said that the ``rcu_node`` structures formed a | +| tree, but they are declared as a flat array! What gives? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| The tree is laid out in the array. The first node In the array is the | +| head, the next set of nodes in the array are children of the head | +| node, and so on until the last set of nodes in the array are the | +| leaves. | +| See the following diagrams to see how this works. | ++-----------------------------------------------------------------------+ + +The ``rcu_node`` tree is embedded into the ``->node[]`` array as shown +in the following figure: + +.. kernel-figure:: TreeMapping.svg + +One interesting consequence of this mapping is that a breadth-first +traversal of the tree is implemented as a simple linear scan of the +array, which is in fact what the ``rcu_for_each_node_breadth_first()`` +macro does. This macro is used at the beginning and ends of grace +periods. + +Each entry of the ``->level`` array references the first ``rcu_node`` +structure on the corresponding level of the tree, for example, as shown +below: + +.. kernel-figure:: TreeMappingLevel.svg + +The zero\ :sup:`th` element of the array references the root +``rcu_node`` structure, the first element references the first child of +the root ``rcu_node``, and finally the second element references the +first leaf ``rcu_node`` structure. + +For whatever it is worth, if you draw the tree to be tree-shaped rather +than array-shaped, it is easy to draw a planar representation: + +.. kernel-figure:: TreeLevel.svg + +Finally, the ``->rda`` field references a per-CPU pointer to the +corresponding CPU's ``rcu_data`` structure. + +All of these fields are constant once initialization is complete, and +therefore need no protection. + +Grace-Period Tracking +''''''''''''''''''''' + +This portion of the ``rcu_state`` structure is declared as follows: + +:: + + 1 unsigned long gp_seq; + +RCU grace periods are numbered, and the ``->gp_seq`` field contains the +current grace-period sequence number. The bottom two bits are the state +of the current grace period, which can be zero for not yet started or +one for in progress. In other words, if the bottom two bits of +``->gp_seq`` are zero, then RCU is idle. Any other value in the bottom +two bits indicates that something is broken. This field is protected by +the root ``rcu_node`` structure's ``->lock`` field. + +There are ``->gp_seq`` fields in the ``rcu_node`` and ``rcu_data`` +structures as well. The fields in the ``rcu_state`` structure represent +the most current value, and those of the other structures are compared +in order to detect the beginnings and ends of grace periods in a +distributed fashion. The values flow from ``rcu_state`` to ``rcu_node`` +(down the tree from the root to the leaves) to ``rcu_data``. + +Miscellaneous +''''''''''''' + +This portion of the ``rcu_state`` structure is declared as follows: + +:: + + 1 unsigned long gp_max; + 2 char abbr; + 3 char *name; + +The ``->gp_max`` field tracks the duration of the longest grace period +in jiffies. It is protected by the root ``rcu_node``'s ``->lock``. + +The ``->name`` and ``->abbr`` fields distinguish between preemptible RCU +(“rcu_preempt” and “p”) and non-preemptible RCU (“rcu_sched” and “s”). +These fields are used for diagnostic and tracing purposes. + +The ``rcu_node`` Structure +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``rcu_node`` structures form the combining tree that propagates +quiescent-state information from the leaves to the root and also that +propagates grace-period information from the root down to the leaves. +They provides local copies of the grace-period state in order to allow +this information to be accessed in a synchronized manner without +suffering the scalability limitations that would otherwise be imposed by +global locking. In ``CONFIG_PREEMPT_RCU`` kernels, they manage the lists +of tasks that have blocked while in their current RCU read-side critical +section. In ``CONFIG_PREEMPT_RCU`` with ``CONFIG_RCU_BOOST``, they +manage the per-\ ``rcu_node`` priority-boosting kernel threads +(kthreads) and state. Finally, they record CPU-hotplug state in order to +determine which CPUs should be ignored during a given grace period. + +The ``rcu_node`` structure's fields are discussed, singly and in groups, +in the following sections. + +Connection to Combining Tree +'''''''''''''''''''''''''''' + +This portion of the ``rcu_node`` structure is declared as follows: + +:: + + 1 struct rcu_node *parent; + 2 u8 level; + 3 u8 grpnum; + 4 unsigned long grpmask; + 5 int grplo; + 6 int grphi; + +The ``->parent`` pointer references the ``rcu_node`` one level up in the +tree, and is ``NULL`` for the root ``rcu_node``. The RCU implementation +makes heavy use of this field to push quiescent states up the tree. The +``->level`` field gives the level in the tree, with the root being at +level zero, its children at level one, and so on. The ``->grpnum`` field +gives this node's position within the children of its parent, so this +number can range between 0 and 31 on 32-bit systems and between 0 and 63 +on 64-bit systems. The ``->level`` and ``->grpnum`` fields are used only +during initialization and for tracing. The ``->grpmask`` field is the +bitmask counterpart of ``->grpnum``, and therefore always has exactly +one bit set. This mask is used to clear the bit corresponding to this +``rcu_node`` structure in its parent's bitmasks, which are described +later. Finally, the ``->grplo`` and ``->grphi`` fields contain the +lowest and highest numbered CPU served by this ``rcu_node`` structure, +respectively. + +All of these fields are constant, and thus do not require any +synchronization. + +Synchronization +''''''''''''''' + +This field of the ``rcu_node`` structure is declared as follows: + +:: + + 1 raw_spinlock_t lock; + +This field is used to protect the remaining fields in this structure, +unless otherwise stated. That said, all of the fields in this structure +can be accessed without locking for tracing purposes. Yes, this can +result in confusing traces, but better some tracing confusion than to be +heisenbugged out of existence. + +.. _grace-period-tracking-1: + +Grace-Period Tracking +''''''''''''''''''''' + +This portion of the ``rcu_node`` structure is declared as follows: + +:: + + 1 unsigned long gp_seq; + 2 unsigned long gp_seq_needed; + +The ``rcu_node`` structures' ``->gp_seq`` fields are the counterparts of +the field of the same name in the ``rcu_state`` structure. They each may +lag up to one step behind their ``rcu_state`` counterpart. If the bottom +two bits of a given ``rcu_node`` structure's ``->gp_seq`` field is zero, +then this ``rcu_node`` structure believes that RCU is idle. + +The ``>gp_seq`` field of each ``rcu_node`` structure is updated at the +beginning and the end of each grace period. + +The ``->gp_seq_needed`` fields record the furthest-in-the-future grace +period request seen by the corresponding ``rcu_node`` structure. The +request is considered fulfilled when the value of the ``->gp_seq`` field +equals or exceeds that of the ``->gp_seq_needed`` field. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Suppose that this ``rcu_node`` structure doesn't see a request for a | +| very long time. Won't wrapping of the ``->gp_seq`` field cause | +| problems? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| No, because if the ``->gp_seq_needed`` field lags behind the | +| ``->gp_seq`` field, the ``->gp_seq_needed`` field will be updated at | +| the end of the grace period. Modulo-arithmetic comparisons therefore | +| will always get the correct answer, even with wrapping. | ++-----------------------------------------------------------------------+ + +Quiescent-State Tracking +'''''''''''''''''''''''' + +These fields manage the propagation of quiescent states up the combining +tree. + +This portion of the ``rcu_node`` structure has fields as follows: + +:: + + 1 unsigned long qsmask; + 2 unsigned long expmask; + 3 unsigned long qsmaskinit; + 4 unsigned long expmaskinit; + +The ``->qsmask`` field tracks which of this ``rcu_node`` structure's +children still need to report quiescent states for the current normal +grace period. Such children will have a value of 1 in their +corresponding bit. Note that the leaf ``rcu_node`` structures should be +thought of as having ``rcu_data`` structures as their children. +Similarly, the ``->expmask`` field tracks which of this ``rcu_node`` +structure's children still need to report quiescent states for the +current expedited grace period. An expedited grace period has the same +conceptual properties as a normal grace period, but the expedited +implementation accepts extreme CPU overhead to obtain much lower +grace-period latency, for example, consuming a few tens of microseconds +worth of CPU time to reduce grace-period duration from milliseconds to +tens of microseconds. The ``->qsmaskinit`` field tracks which of this +``rcu_node`` structure's children cover for at least one online CPU. +This mask is used to initialize ``->qsmask``, and ``->expmaskinit`` is +used to initialize ``->expmask`` and the beginning of the normal and +expedited grace periods, respectively. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why are these bitmasks protected by locking? Come on, haven't you | +| heard of atomic instructions??? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Lockless grace-period computation! Such a tantalizing possibility! | +| But consider the following sequence of events: | +| | +| #. CPU 0 has been in dyntick-idle mode for quite some time. When it | +| wakes up, it notices that the current RCU grace period needs it to | +| report in, so it sets a flag where the scheduling clock interrupt | +| will find it. | +| #. Meanwhile, CPU 1 is running ``force_quiescent_state()``, and | +| notices that CPU 0 has been in dyntick idle mode, which qualifies | +| as an extended quiescent state. | +| #. CPU 0's scheduling clock interrupt fires in the middle of an RCU | +| read-side critical section, and notices that the RCU core needs | +| something, so commences RCU softirq processing. | +| #. CPU 0's softirq handler executes and is just about ready to report | +| its quiescent state up the ``rcu_node`` tree. | +| #. But CPU 1 beats it to the punch, completing the current grace | +| period and starting a new one. | +| #. CPU 0 now reports its quiescent state for the wrong grace period. | +| That grace period might now end before the RCU read-side critical | +| section. If that happens, disaster will ensue. | +| | +| So the locking is absolutely required in order to coordinate clearing | +| of the bits with updating of the grace-period sequence number in | +| ``->gp_seq``. | ++-----------------------------------------------------------------------+ + +Blocked-Task Management +''''''''''''''''''''''' + +``PREEMPT_RCU`` allows tasks to be preempted in the midst of their RCU +read-side critical sections, and these tasks must be tracked explicitly. +The details of exactly why and how they are tracked will be covered in a +separate article on RCU read-side processing. For now, it is enough to +know that the ``rcu_node`` structure tracks them. + +:: + + 1 struct list_head blkd_tasks; + 2 struct list_head *gp_tasks; + 3 struct list_head *exp_tasks; + 4 bool wait_blkd_tasks; + +The ``->blkd_tasks`` field is a list header for the list of blocked and +preempted tasks. As tasks undergo context switches within RCU read-side +critical sections, their ``task_struct`` structures are enqueued (via +the ``task_struct``'s ``->rcu_node_entry`` field) onto the head of the +``->blkd_tasks`` list for the leaf ``rcu_node`` structure corresponding +to the CPU on which the outgoing context switch executed. As these tasks +later exit their RCU read-side critical sections, they remove themselves +from the list. This list is therefore in reverse time order, so that if +one of the tasks is blocking the current grace period, all subsequent +tasks must also be blocking that same grace period. Therefore, a single +pointer into this list suffices to track all tasks blocking a given +grace period. That pointer is stored in ``->gp_tasks`` for normal grace +periods and in ``->exp_tasks`` for expedited grace periods. These last +two fields are ``NULL`` if either there is no grace period in flight or +if there are no blocked tasks preventing that grace period from +completing. If either of these two pointers is referencing a task that +removes itself from the ``->blkd_tasks`` list, then that task must +advance the pointer to the next task on the list, or set the pointer to +``NULL`` if there are no subsequent tasks on the list. + +For example, suppose that tasks T1, T2, and T3 are all hard-affinitied +to the largest-numbered CPU in the system. Then if task T1 blocked in an +RCU read-side critical section, then an expedited grace period started, +then task T2 blocked in an RCU read-side critical section, then a normal +grace period started, and finally task 3 blocked in an RCU read-side +critical section, then the state of the last leaf ``rcu_node`` +structure's blocked-task list would be as shown below: + +.. kernel-figure:: blkd_task.svg + +Task T1 is blocking both grace periods, task T2 is blocking only the +normal grace period, and task T3 is blocking neither grace period. Note +that these tasks will not remove themselves from this list immediately +upon resuming execution. They will instead remain on the list until they +execute the outermost ``rcu_read_unlock()`` that ends their RCU +read-side critical section. + +The ``->wait_blkd_tasks`` field indicates whether or not the current +grace period is waiting on a blocked task. + +Sizing the ``rcu_node`` Array +''''''''''''''''''''''''''''' + +The ``rcu_node`` array is sized via a series of C-preprocessor +expressions as follows: + +:: + + 1 #ifdef CONFIG_RCU_FANOUT + 2 #define RCU_FANOUT CONFIG_RCU_FANOUT + 3 #else + 4 # ifdef CONFIG_64BIT + 5 # define RCU_FANOUT 64 + 6 # else + 7 # define RCU_FANOUT 32 + 8 # endif + 9 #endif + 10 + 11 #ifdef CONFIG_RCU_FANOUT_LEAF + 12 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF + 13 #else + 14 # ifdef CONFIG_64BIT + 15 # define RCU_FANOUT_LEAF 64 + 16 # else + 17 # define RCU_FANOUT_LEAF 32 + 18 # endif + 19 #endif + 20 + 21 #define RCU_FANOUT_1 (RCU_FANOUT_LEAF) + 22 #define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT) + 23 #define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT) + 24 #define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT) + 25 + 26 #if NR_CPUS <= RCU_FANOUT_1 + 27 # define RCU_NUM_LVLS 1 + 28 # define NUM_RCU_LVL_0 1 + 29 # define NUM_RCU_NODES NUM_RCU_LVL_0 + 30 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0 } + 31 # define RCU_NODE_NAME_INIT { "rcu_node_0" } + 32 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0" } + 33 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0" } + 34 #elif NR_CPUS <= RCU_FANOUT_2 + 35 # define RCU_NUM_LVLS 2 + 36 # define NUM_RCU_LVL_0 1 + 37 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) + 38 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1) + 39 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1 } + 40 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1" } + 41 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1" } + 42 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1" } + 43 #elif NR_CPUS <= RCU_FANOUT_3 + 44 # define RCU_NUM_LVLS 3 + 45 # define NUM_RCU_LVL_0 1 + 46 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) + 47 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) + 48 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2) + 49 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 } + 50 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2" } + 51 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" } + 52 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" } + 53 #elif NR_CPUS <= RCU_FANOUT_4 + 54 # define RCU_NUM_LVLS 4 + 55 # define NUM_RCU_LVL_0 1 + 56 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3) + 57 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) + 58 # define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) + 59 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3) + 60 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 } + 61 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" } + 62 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" } + 63 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" } + 64 #else + 65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS" + 66 #endif + +The maximum number of levels in the ``rcu_node`` structure is currently +limited to four, as specified by lines 21-24 and the structure of the +subsequent “if” statement. For 32-bit systems, this allows +16*32*32*32=524,288 CPUs, which should be sufficient for the next few +years at least. For 64-bit systems, 16*64*64*64=4,194,304 CPUs is +allowed, which should see us through the next decade or so. This +four-level tree also allows kernels built with ``CONFIG_RCU_FANOUT=8`` +to support up to 4096 CPUs, which might be useful in very large systems +having eight CPUs per socket (but please note that no one has yet shown +any measurable performance degradation due to misaligned socket and +``rcu_node`` boundaries). In addition, building kernels with a full four +levels of ``rcu_node`` tree permits better testing of RCU's +combining-tree code. + +The ``RCU_FANOUT`` symbol controls how many children are permitted at +each non-leaf level of the ``rcu_node`` tree. If the +``CONFIG_RCU_FANOUT`` Kconfig option is not specified, it is set based +on the word size of the system, which is also the Kconfig default. + +The ``RCU_FANOUT_LEAF`` symbol controls how many CPUs are handled by +each leaf ``rcu_node`` structure. Experience has shown that allowing a +given leaf ``rcu_node`` structure to handle 64 CPUs, as permitted by the +number of bits in the ``->qsmask`` field on a 64-bit system, results in +excessive contention for the leaf ``rcu_node`` structures' ``->lock`` +fields. The number of CPUs per leaf ``rcu_node`` structure is therefore +limited to 16 given the default value of ``CONFIG_RCU_FANOUT_LEAF``. If +``CONFIG_RCU_FANOUT_LEAF`` is unspecified, the value selected is based +on the word size of the system, just as for ``CONFIG_RCU_FANOUT``. +Lines 11-19 perform this computation. + +Lines 21-24 compute the maximum number of CPUs supported by a +single-level (which contains a single ``rcu_node`` structure), +two-level, three-level, and four-level ``rcu_node`` tree, respectively, +given the fanout specified by ``RCU_FANOUT`` and ``RCU_FANOUT_LEAF``. +These numbers of CPUs are retained in the ``RCU_FANOUT_1``, +``RCU_FANOUT_2``, ``RCU_FANOUT_3``, and ``RCU_FANOUT_4`` C-preprocessor +variables, respectively. + +These variables are used to control the C-preprocessor ``#if`` statement +spanning lines 26-66 that computes the number of ``rcu_node`` structures +required for each level of the tree, as well as the number of levels +required. The number of levels is placed in the ``NUM_RCU_LVLS`` +C-preprocessor variable by lines 27, 35, 44, and 54. The number of +``rcu_node`` structures for the topmost level of the tree is always +exactly one, and this value is unconditionally placed into +``NUM_RCU_LVL_0`` by lines 28, 36, 45, and 55. The rest of the levels +(if any) of the ``rcu_node`` tree are computed by dividing the maximum +number of CPUs by the fanout supported by the number of levels from the +current level down, rounding up. This computation is performed by +lines 37, 46-47, and 56-58. Lines 31-33, 40-42, 50-52, and 62-63 create +initializers for lockdep lock-class names. Finally, lines 64-66 produce +an error if the maximum number of CPUs is too large for the specified +fanout. + +The ``rcu_segcblist`` Structure +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``rcu_segcblist`` structure maintains a segmented list of callbacks +as follows: + +:: + + 1 #define RCU_DONE_TAIL 0 + 2 #define RCU_WAIT_TAIL 1 + 3 #define RCU_NEXT_READY_TAIL 2 + 4 #define RCU_NEXT_TAIL 3 + 5 #define RCU_CBLIST_NSEGS 4 + 6 + 7 struct rcu_segcblist { + 8 struct rcu_head *head; + 9 struct rcu_head **tails[RCU_CBLIST_NSEGS]; + 10 unsigned long gp_seq[RCU_CBLIST_NSEGS]; + 11 long len; + 12 long len_lazy; + 13 }; + +The segments are as follows: + +#. ``RCU_DONE_TAIL``: Callbacks whose grace periods have elapsed. These + callbacks are ready to be invoked. +#. ``RCU_WAIT_TAIL``: Callbacks that are waiting for the current grace + period. Note that different CPUs can have different ideas about which + grace period is current, hence the ``->gp_seq`` field. +#. ``RCU_NEXT_READY_TAIL``: Callbacks waiting for the next grace period + to start. +#. ``RCU_NEXT_TAIL``: Callbacks that have not yet been associated with a + grace period. + +The ``->head`` pointer references the first callback or is ``NULL`` if +the list contains no callbacks (which is *not* the same as being empty). +Each element of the ``->tails[]`` array references the ``->next`` +pointer of the last callback in the corresponding segment of the list, +or the list's ``->head`` pointer if that segment and all previous +segments are empty. If the corresponding segment is empty but some +previous segment is not empty, then the array element is identical to +its predecessor. Older callbacks are closer to the head of the list, and +new callbacks are added at the tail. This relationship between the +``->head`` pointer, the ``->tails[]`` array, and the callbacks is shown +in this diagram: + +.. kernel-figure:: nxtlist.svg + +In this figure, the ``->head`` pointer references the first RCU callback +in the list. The ``->tails[RCU_DONE_TAIL]`` array element references the +``->head`` pointer itself, indicating that none of the callbacks is +ready to invoke. The ``->tails[RCU_WAIT_TAIL]`` array element references +callback CB 2's ``->next`` pointer, which indicates that CB 1 and CB 2 +are both waiting on the current grace period, give or take possible +disagreements about exactly which grace period is the current one. The +``->tails[RCU_NEXT_READY_TAIL]`` array element references the same RCU +callback that ``->tails[RCU_WAIT_TAIL]`` does, which indicates that +there are no callbacks waiting on the next RCU grace period. The +``->tails[RCU_NEXT_TAIL]`` array element references CB 4's ``->next`` +pointer, indicating that all the remaining RCU callbacks have not yet +been assigned to an RCU grace period. Note that the +``->tails[RCU_NEXT_TAIL]`` array element always references the last RCU +callback's ``->next`` pointer unless the callback list is empty, in +which case it references the ``->head`` pointer. + +There is one additional important special case for the +``->tails[RCU_NEXT_TAIL]`` array element: It can be ``NULL`` when this +list is *disabled*. Lists are disabled when the corresponding CPU is +offline or when the corresponding CPU's callbacks are offloaded to a +kthread, both of which are described elsewhere. + +CPUs advance their callbacks from the ``RCU_NEXT_TAIL`` to the +``RCU_NEXT_READY_TAIL`` to the ``RCU_WAIT_TAIL`` to the +``RCU_DONE_TAIL`` list segments as grace periods advance. + +The ``->gp_seq[]`` array records grace-period numbers corresponding to +the list segments. This is what allows different CPUs to have different +ideas as to which is the current grace period while still avoiding +premature invocation of their callbacks. In particular, this allows CPUs +that go idle for extended periods to determine which of their callbacks +are ready to be invoked after reawakening. + +The ``->len`` counter contains the number of callbacks in ``->head``, +and the ``->len_lazy`` contains the number of those callbacks that are +known to only free memory, and whose invocation can therefore be safely +deferred. + +.. important:: + + It is the ``->len`` field that determines whether or + not there are callbacks associated with this ``rcu_segcblist`` + structure, *not* the ``->head`` pointer. The reason for this is that all + the ready-to-invoke callbacks (that is, those in the ``RCU_DONE_TAIL`` + segment) are extracted all at once at callback-invocation time + (``rcu_do_batch``), due to which ``->head`` may be set to NULL if there + are no not-done callbacks remaining in the ``rcu_segcblist``. If + callback invocation must be postponed, for example, because a + high-priority process just woke up on this CPU, then the remaining + callbacks are placed back on the ``RCU_DONE_TAIL`` segment and + ``->head`` once again points to the start of the segment. In short, the + head field can briefly be ``NULL`` even though the CPU has callbacks + present the entire time. Therefore, it is not appropriate to test the + ``->head`` pointer for ``NULL``. + +In contrast, the ``->len`` and ``->len_lazy`` counts are adjusted only +after the corresponding callbacks have been invoked. This means that the +``->len`` count is zero only if the ``rcu_segcblist`` structure really +is devoid of callbacks. Of course, off-CPU sampling of the ``->len`` +count requires careful use of appropriate synchronization, for example, +memory barriers. This synchronization can be a bit subtle, particularly +in the case of ``rcu_barrier()``. + +The ``rcu_data`` Structure +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``rcu_data`` maintains the per-CPU state for the RCU subsystem. The +fields in this structure may be accessed only from the corresponding CPU +(and from tracing) unless otherwise stated. This structure is the focus +of quiescent-state detection and RCU callback queuing. It also tracks +its relationship to the corresponding leaf ``rcu_node`` structure to +allow more-efficient propagation of quiescent states up the ``rcu_node`` +combining tree. Like the ``rcu_node`` structure, it provides a local +copy of the grace-period information to allow for-free synchronized +access to this information from the corresponding CPU. Finally, this +structure records past dyntick-idle state for the corresponding CPU and +also tracks statistics. + +The ``rcu_data`` structure's fields are discussed, singly and in groups, +in the following sections. + +Connection to Other Data Structures +''''''''''''''''''''''''''''''''''' + +This portion of the ``rcu_data`` structure is declared as follows: + +:: + + 1 int cpu; + 2 struct rcu_node *mynode; + 3 unsigned long grpmask; + 4 bool beenonline; + +The ``->cpu`` field contains the number of the corresponding CPU and the +``->mynode`` field references the corresponding ``rcu_node`` structure. +The ``->mynode`` is used to propagate quiescent states up the combining +tree. These two fields are constant and therefore do not require +synchronization. + +The ``->grpmask`` field indicates the bit in the ``->mynode->qsmask`` +corresponding to this ``rcu_data`` structure, and is also used when +propagating quiescent states. The ``->beenonline`` flag is set whenever +the corresponding CPU comes online, which means that the debugfs tracing +need not dump out any ``rcu_data`` structure for which this flag is not +set. + +Quiescent-State and Grace-Period Tracking +''''''''''''''''''''''''''''''''''''''''' + +This portion of the ``rcu_data`` structure is declared as follows: + +:: + + 1 unsigned long gp_seq; + 2 unsigned long gp_seq_needed; + 3 bool cpu_no_qs; + 4 bool core_needs_qs; + 5 bool gpwrap; + +The ``->gp_seq`` field is the counterpart of the field of the same name +in the ``rcu_state`` and ``rcu_node`` structures. The +``->gp_seq_needed`` field is the counterpart of the field of the same +name in the rcu_node structure. They may each lag up to one behind their +``rcu_node`` counterparts, but in ``CONFIG_NO_HZ_IDLE`` and +``CONFIG_NO_HZ_FULL`` kernels can lag arbitrarily far behind for CPUs in +dyntick-idle mode (but these counters will catch up upon exit from +dyntick-idle mode). If the lower two bits of a given ``rcu_data`` +structure's ``->gp_seq`` are zero, then this ``rcu_data`` structure +believes that RCU is idle. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| All this replication of the grace period numbers can only cause | +| massive confusion. Why not just keep a global sequence number and be | +| done with it??? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Because if there was only a single global sequence numbers, there | +| would need to be a single global lock to allow safely accessing and | +| updating it. And if we are not going to have a single global lock, we | +| need to carefully manage the numbers on a per-node basis. Recall from | +| the answer to a previous Quick Quiz that the consequences of applying | +| a previously sampled quiescent state to the wrong grace period are | +| quite severe. | ++-----------------------------------------------------------------------+ + +The ``->cpu_no_qs`` flag indicates that the CPU has not yet passed +through a quiescent state, while the ``->core_needs_qs`` flag indicates +that the RCU core needs a quiescent state from the corresponding CPU. +The ``->gpwrap`` field indicates that the corresponding CPU has remained +idle for so long that the ``gp_seq`` counter is in danger of overflow, +which will cause the CPU to disregard the values of its counters on its +next exit from idle. + +RCU Callback Handling +''''''''''''''''''''' + +In the absence of CPU-hotplug events, RCU callbacks are invoked by the +same CPU that registered them. This is strictly a cache-locality +optimization: callbacks can and do get invoked on CPUs other than the +one that registered them. After all, if the CPU that registered a given +callback has gone offline before the callback can be invoked, there +really is no other choice. + +This portion of the ``rcu_data`` structure is declared as follows: + +:: + + 1 struct rcu_segcblist cblist; + 2 long qlen_last_fqs_check; + 3 unsigned long n_cbs_invoked; + 4 unsigned long n_nocbs_invoked; + 5 unsigned long n_cbs_orphaned; + 6 unsigned long n_cbs_adopted; + 7 unsigned long n_force_qs_snap; + 8 long blimit; + +The ``->cblist`` structure is the segmented callback list described +earlier. The CPU advances the callbacks in its ``rcu_data`` structure +whenever it notices that another RCU grace period has completed. The CPU +detects the completion of an RCU grace period by noticing that the value +of its ``rcu_data`` structure's ``->gp_seq`` field differs from that of +its leaf ``rcu_node`` structure. Recall that each ``rcu_node`` +structure's ``->gp_seq`` field is updated at the beginnings and ends of +each grace period. + +The ``->qlen_last_fqs_check`` and ``->n_force_qs_snap`` coordinate the +forcing of quiescent states from ``call_rcu()`` and friends when +callback lists grow excessively long. + +The ``->n_cbs_invoked``, ``->n_cbs_orphaned``, and ``->n_cbs_adopted`` +fields count the number of callbacks invoked, sent to other CPUs when +this CPU goes offline, and received from other CPUs when those other +CPUs go offline. The ``->n_nocbs_invoked`` is used when the CPU's +callbacks are offloaded to a kthread. + +Finally, the ``->blimit`` counter is the maximum number of RCU callbacks +that may be invoked at a given time. + +Dyntick-Idle Handling +''''''''''''''''''''' + +This portion of the ``rcu_data`` structure is declared as follows: + +:: + + 1 int dynticks_snap; + 2 unsigned long dynticks_fqs; + +The ``->dynticks_snap`` field is used to take a snapshot of the +corresponding CPU's dyntick-idle state when forcing quiescent states, +and is therefore accessed from other CPUs. Finally, the +``->dynticks_fqs`` field is used to count the number of times this CPU +is determined to be in dyntick-idle state, and is used for tracing and +debugging purposes. + +This portion of the rcu_data structure is declared as follows: + +:: + + 1 long dynticks_nesting; + 2 long dynticks_nmi_nesting; + 3 atomic_t dynticks; + 4 bool rcu_need_heavy_qs; + 5 bool rcu_urgent_qs; + +These fields in the rcu_data structure maintain the per-CPU dyntick-idle +state for the corresponding CPU. The fields may be accessed only from +the corresponding CPU (and from tracing) unless otherwise stated. + +The ``->dynticks_nesting`` field counts the nesting depth of process +execution, so that in normal circumstances this counter has value zero +or one. NMIs, irqs, and tracers are counted by the +``->dynticks_nmi_nesting`` field. Because NMIs cannot be masked, changes +to this variable have to be undertaken carefully using an algorithm +provided by Andy Lutomirski. The initial transition from idle adds one, +and nested transitions add two, so that a nesting level of five is +represented by a ``->dynticks_nmi_nesting`` value of nine. This counter +can therefore be thought of as counting the number of reasons why this +CPU cannot be permitted to enter dyntick-idle mode, aside from +process-level transitions. + +However, it turns out that when running in non-idle kernel context, the +Linux kernel is fully capable of entering interrupt handlers that never +exit and perhaps also vice versa. Therefore, whenever the +``->dynticks_nesting`` field is incremented up from zero, the +``->dynticks_nmi_nesting`` field is set to a large positive number, and +whenever the ``->dynticks_nesting`` field is decremented down to zero, +the the ``->dynticks_nmi_nesting`` field is set to zero. Assuming that +the number of misnested interrupts is not sufficient to overflow the +counter, this approach corrects the ``->dynticks_nmi_nesting`` field +every time the corresponding CPU enters the idle loop from process +context. + +The ``->dynticks`` field counts the corresponding CPU's transitions to +and from either dyntick-idle or user mode, so that this counter has an +even value when the CPU is in dyntick-idle mode or user mode and an odd +value otherwise. The transitions to/from user mode need to be counted +for user mode adaptive-ticks support (see timers/NO_HZ.txt). + +The ``->rcu_need_heavy_qs`` field is used to record the fact that the +RCU core code would really like to see a quiescent state from the +corresponding CPU, so much so that it is willing to call for +heavy-weight dyntick-counter operations. This flag is checked by RCU's +context-switch and ``cond_resched()`` code, which provide a momentary +idle sojourn in response. + +Finally, the ``->rcu_urgent_qs`` field is used to record the fact that +the RCU core code would really like to see a quiescent state from the +corresponding CPU, with the various other fields indicating just how +badly RCU wants this quiescent state. This flag is checked by RCU's +context-switch path (``rcu_note_context_switch``) and the cond_resched +code. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why not simply combine the ``->dynticks_nesting`` and | +| ``->dynticks_nmi_nesting`` counters into a single counter that just | +| counts the number of reasons that the corresponding CPU is non-idle? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Because this would fail in the presence of interrupts whose handlers | +| never return and of handlers that manage to return from a made-up | +| interrupt. | ++-----------------------------------------------------------------------+ + +Additional fields are present for some special-purpose builds, and are +discussed separately. + +The ``rcu_head`` Structure +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Each ``rcu_head`` structure represents an RCU callback. These structures +are normally embedded within RCU-protected data structures whose +algorithms use asynchronous grace periods. In contrast, when using +algorithms that block waiting for RCU grace periods, RCU users need not +provide ``rcu_head`` structures. + +The ``rcu_head`` structure has fields as follows: + +:: + + 1 struct rcu_head *next; + 2 void (*func)(struct rcu_head *head); + +The ``->next`` field is used to link the ``rcu_head`` structures +together in the lists within the ``rcu_data`` structures. The ``->func`` +field is a pointer to the function to be called when the callback is +ready to be invoked, and this function is passed a pointer to the +``rcu_head`` structure. However, ``kfree_rcu()`` uses the ``->func`` +field to record the offset of the ``rcu_head`` structure within the +enclosing RCU-protected data structure. + +Both of these fields are used internally by RCU. From the viewpoint of +RCU users, this structure is an opaque “cookie”. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Given that the callback function ``->func`` is passed a pointer to | +| the ``rcu_head`` structure, how is that function supposed to find the | +| beginning of the enclosing RCU-protected data structure? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| In actual practice, there is a separate callback function per type of | +| RCU-protected data structure. The callback function can therefore use | +| the ``container_of()`` macro in the Linux kernel (or other | +| pointer-manipulation facilities in other software environments) to | +| find the beginning of the enclosing structure. | ++-----------------------------------------------------------------------+ + +RCU-Specific Fields in the ``task_struct`` Structure +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``CONFIG_PREEMPT_RCU`` implementation uses some additional fields in +the ``task_struct`` structure: + +:: + + 1 #ifdef CONFIG_PREEMPT_RCU + 2 int rcu_read_lock_nesting; + 3 union rcu_special rcu_read_unlock_special; + 4 struct list_head rcu_node_entry; + 5 struct rcu_node *rcu_blocked_node; + 6 #endif /* #ifdef CONFIG_PREEMPT_RCU */ + 7 #ifdef CONFIG_TASKS_RCU + 8 unsigned long rcu_tasks_nvcsw; + 9 bool rcu_tasks_holdout; + 10 struct list_head rcu_tasks_holdout_list; + 11 int rcu_tasks_idle_cpu; + 12 #endif /* #ifdef CONFIG_TASKS_RCU */ + +The ``->rcu_read_lock_nesting`` field records the nesting level for RCU +read-side critical sections, and the ``->rcu_read_unlock_special`` field +is a bitmask that records special conditions that require +``rcu_read_unlock()`` to do additional work. The ``->rcu_node_entry`` +field is used to form lists of tasks that have blocked within +preemptible-RCU read-side critical sections and the +``->rcu_blocked_node`` field references the ``rcu_node`` structure whose +list this task is a member of, or ``NULL`` if it is not blocked within a +preemptible-RCU read-side critical section. + +The ``->rcu_tasks_nvcsw`` field tracks the number of voluntary context +switches that this task had undergone at the beginning of the current +tasks-RCU grace period, ``->rcu_tasks_holdout`` is set if the current +tasks-RCU grace period is waiting on this task, +``->rcu_tasks_holdout_list`` is a list element enqueuing this task on +the holdout list, and ``->rcu_tasks_idle_cpu`` tracks which CPU this +idle task is running, but only if the task is currently running, that +is, if the CPU is currently idle. + +Accessor Functions +~~~~~~~~~~~~~~~~~~ + +The following listing shows the ``rcu_get_root()``, +``rcu_for_each_node_breadth_first`` and ``rcu_for_each_leaf_node()`` +function and macros: + +:: + + 1 static struct rcu_node *rcu_get_root(struct rcu_state *rsp) + 2 { + 3 return &rsp->node[0]; + 4 } + 5 + 6 #define rcu_for_each_node_breadth_first(rsp, rnp) \ + 7 for ((rnp) = &(rsp)->node[0]; \ + 8 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) + 9 + 10 #define rcu_for_each_leaf_node(rsp, rnp) \ + 11 for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \ + 12 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) + +The ``rcu_get_root()`` simply returns a pointer to the first element of +the specified ``rcu_state`` structure's ``->node[]`` array, which is the +root ``rcu_node`` structure. + +As noted earlier, the ``rcu_for_each_node_breadth_first()`` macro takes +advantage of the layout of the ``rcu_node`` structures in the +``rcu_state`` structure's ``->node[]`` array, performing a breadth-first +traversal by simply traversing the array in order. Similarly, the +``rcu_for_each_leaf_node()`` macro traverses only the last part of the +array, thus traversing only the leaf ``rcu_node`` structures. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| What does ``rcu_for_each_leaf_node()`` do if the ``rcu_node`` tree | +| contains only a single node? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| In the single-node case, ``rcu_for_each_leaf_node()`` traverses the | +| single node. | ++-----------------------------------------------------------------------+ + +Summary +~~~~~~~ + +So the state of RCU is represented by an ``rcu_state`` structure, which +contains a combining tree of ``rcu_node`` and ``rcu_data`` structures. +Finally, in ``CONFIG_NO_HZ_IDLE`` kernels, each CPU's dyntick-idle state +is tracked by dynticks-related fields in the ``rcu_data`` structure. If +you made it this far, you are well prepared to read the code +walkthroughs in the other articles in this series. + +Acknowledgments +~~~~~~~~~~~~~~~ + +I owe thanks to Cyrill Gorcunov, Mathieu Desnoyers, Dhaval Giani, Paul +Turner, Abhishek Srivastava, Matt Kowalczyk, and Serge Hallyn for +helping me get this document into a more human-readable state. + +Legal Statement +~~~~~~~~~~~~~~~ + +This work represents the view of the author and does not necessarily +represent the view of IBM. + +Linux is a registered trademark of Linus Torvalds. + +Other company, product, and service names may be trademarks or service +marks of others. |