From f1387d770527b11c5467ed6b6b3d9c3e5aa12dd4 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sun, 15 Jan 2017 15:18:22 -0800 Subject: doc: Synchronous RCU grace periods are now legal throughout boot This commit updates the "Early Boot" section of the RCU requirements to describe how synchronous RCU grace periods are now legal throughout the boot process. Signed-off-by: Paul E. McKenney --- .../RCU/Design/Requirements/Requirements.html | 81 +++++++++++++--------- 1 file changed, 47 insertions(+), 34 deletions(-) (limited to 'Documentation') diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html index 21593496aca6..999b3ed3444e 100644 --- a/Documentation/RCU/Design/Requirements/Requirements.html +++ b/Documentation/RCU/Design/Requirements/Requirements.html @@ -2154,7 +2154,8 @@ as will rcu_assign_pointer().

Although call_rcu() may be invoked at any time during boot, callbacks are not guaranteed to be invoked until after -the scheduler is fully up and running. +all of RCU's kthreads have been spawned, which occurs at +early_initcall() time. This delay in callback invocation is due to the fact that RCU does not invoke callbacks until it is fully initialized, and this full initialization cannot occur until after the scheduler has initialized itself to the @@ -2167,8 +2168,10 @@ on what operations those callbacks could invoke. Perhaps surprisingly, synchronize_rcu(), synchronize_rcu_bh() (discussed below), -and -synchronize_sched() +synchronize_sched(), +synchronize_rcu_expedited(), +synchronize_rcu_bh_expedited(), and +synchronize_sched_expedited() will all operate normally during very early boot, the reason being that there is only one CPU and preemption is disabled. @@ -2178,45 +2181,55 @@ state and thus a grace period, so the early-boot implementation can be a no-op.

-Both synchronize_rcu_bh() and synchronize_sched() -continue to operate normally through the remainder of boot, courtesy -of the fact that preemption is disabled across their RCU read-side -critical sections and also courtesy of the fact that there is still -only one CPU. -However, once the scheduler starts initializing, preemption is enabled. -There is still only a single CPU, but the fact that preemption is enabled -means that the no-op implementation of synchronize_rcu() no -longer works in CONFIG_PREEMPT=y kernels. -Therefore, as soon as the scheduler starts initializing, the early-boot -fastpath is disabled. -This means that synchronize_rcu() switches to its runtime -mode of operation where it posts callbacks, which in turn means that -any call to synchronize_rcu() will block until the corresponding -callback is invoked. -Unfortunately, the callback cannot be invoked until RCU's runtime -grace-period machinery is up and running, which cannot happen until -the scheduler has initialized itself sufficiently to allow RCU's -kthreads to be spawned. -Therefore, invoking synchronize_rcu() during scheduler -initialization can result in deadlock. +However, once the scheduler has spawned its first kthread, this early +boot trick fails for synchronize_rcu() (as well as for +synchronize_rcu_expedited()) in CONFIG_PREEMPT=y +kernels. +The reason is that an RCU read-side critical section might be preempted, +which means that a subsequent synchronize_rcu() really does have +to wait for something, as opposed to simply returning immediately. +Unfortunately, synchronize_rcu() can't do this until all of +its kthreads are spawned, which doesn't happen until some time during +early_initcalls() time. +But this is no excuse: RCU is nevertheless required to correctly handle +synchronous grace periods during this time period, which it currently does. +Once all of its kthreads are up and running, RCU starts running +normally.
 
Quick Quiz:
- So what happens with synchronize_rcu() during - scheduler initialization for CONFIG_PREEMPT=n - kernels? + How can RCU possibly handle grace periods before all of its + kthreads have been spawned???
Answer:
- In CONFIG_PREEMPT=n kernel, synchronize_rcu() - maps directly to synchronize_sched(). - Therefore, synchronize_rcu() works normally throughout - boot in CONFIG_PREEMPT=n kernels. - However, your code must also work in CONFIG_PREEMPT=y kernels, - so it is still necessary to avoid invoking synchronize_rcu() - during scheduler initialization. + Very carefully! + +

During the “dead zone” between the time that the + scheduler spawns the first task and the time that all of RCU's + kthreads have been spawned, all synchronous grace periods are + handled by the expedited grace-period mechanism. + At runtime, this expedited mechanism relies on workqueues, but + during the dead zone the requesting task itself drives the + desired expedited grace period. + Because dead-zone execution takes place within task context, + everything works. + Once the dead zone ends, expedited grace periods go back to + using workqueues, as is required to avoid problems that would + otherwise occur when a user task received a POSIX signal while + driving an expedited grace period. + +

And yes, this does mean that it is unhelpful to send POSIX + signals to random tasks between the time that the scheduler + spawns its first kthread and the time that RCU's kthreads + have all been spawned. + If there ever turns out to be a good reason for sending POSIX + signals during that time, appropriate adjustments will be made. + (If it turns out that POSIX signals are sent during this time for + no good reason, other adjustments will be made, appropriate + or otherwise.)

 
-- cgit From b4553f0cfea5ab5a02967e482bcafe7db6407afd Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sun, 15 Jan 2017 16:12:09 -0800 Subject: doc: Add mid-boot operation to expedited grace periods This commit adds a description of how expedited grace periods operate during the mid-boot "dead zone", which starts when the scheduler spawns the first kthread and ends when all of RCU's kthreads have been spawned. In short, before mid-boot, synchronous grace periods can be a no-op. After the end of mid-boot, workqueues may be used. During mid-boot, the requesting task drivees the expedited grace period. For more detail, see https://lwn.net/Articles/716148/. Signed-off-by: Paul E. McKenney --- .../Expedited-Grace-Periods.html | 47 ++++++++++++++++++++-- 1 file changed, 44 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html index 7a3194c5559a..e5d0bbd0230b 100644 --- a/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html @@ -284,6 +284,7 @@ Expedited Grace Period Refinements Funnel locking and wait/wakeup.

  • Use of Workqueues.
  • Stall warnings. +
  • Mid-boot operation.

    Idle-CPU Checks

    @@ -524,7 +525,7 @@ their grace periods and carrying out their wakeups. In earlier implementations, the task requesting the expedited grace period also drove it to completion. This straightforward approach had the disadvantage of needing to -account for signals sent to user tasks, +account for POSIX signals sent to user tasks, so more recent implemementations use the Linux kernel's workqueues. @@ -533,8 +534,8 @@ The requesting task still does counter snapshotting and funnel-lock processing, but the task reaching the top of the funnel lock does a schedule_work() (from _synchronize_rcu_expedited() so that a workqueue kthread does the actual grace-period processing. -Because workqueue kthreads do not accept signals, grace-period-wait -processing need not allow for signals. +Because workqueue kthreads do not accept POSIX signals, grace-period-wait +processing need not allow for POSIX signals. In addition, this approach allows wakeups for the previous expedited grace period to be overlapped with processing for the next expedited @@ -586,6 +587,46 @@ blocking the current grace period are printed. Each stall warning results in another pass through the loop, but the second and subsequent passes use longer stall times. +

    Mid-boot operation

    + +

    +The use of workqueues has the advantage that the expedited +grace-period code need not worry about POSIX signals. +Unfortunately, it has the +corresponding disadvantage that workqueues cannot be used until +they are initialized, which does not happen until some time after +the scheduler spawns the first task. +Given that there are parts of the kernel that really do want to +execute grace periods during this mid-boot “dead zone”, +expedited grace periods must do something else during thie time. + +

    +What they do is to fall back to the old practice of requiring that the +requesting task drive the expedited grace period, as was the case +before the use of workqueues. +However, the requesting task is only required to drive the grace period +during the mid-boot dead zone. +Before mid-boot, a synchronous grace period is a no-op. +Some time after mid-boot, workqueues are used. + +

    +Non-expedited non-SRCU synchronous grace periods must also operate +normally during mid-boot. +This is handled by causing non-expedited grace periods to take the +expedited code path during mid-boot. + +

    +The current code assumes that there are no POSIX signals during +the mid-boot dead zone. +However, if an overwhelming need for POSIX signals somehow arises, +appropriate adjustments can be made to the expedited stall-warning code. +One such adjustment would reinstate the pre-workqueue stall-warning +checks, but only during the mid-boot dead zone. + +

    +With this refinement, synchronous grace periods can now be used from +task context pretty much any time during the life of the kernel. +

    Summary

    -- cgit From 8e2a439753b1b708c6aa58249ab3cab8015597b1 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 8 Feb 2017 14:30:15 -0800 Subject: doc: Update stallwarn.txt to make causes more prominent This commit rearranges the Documentation/RCU/stallwarn.txt file to put the list of issues that can cause RCU CPU stall warnings near the beginning of the document. Signed-off-by: Paul E. McKenney --- Documentation/RCU/stallwarn.txt | 190 +++++++++++++++++++++------------------- 1 file changed, 100 insertions(+), 90 deletions(-) (limited to 'Documentation') diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt index e93d04133fe7..96a3d81837e1 100644 --- a/Documentation/RCU/stallwarn.txt +++ b/Documentation/RCU/stallwarn.txt @@ -1,9 +1,102 @@ Using RCU's CPU Stall Detector -The rcu_cpu_stall_suppress module parameter enables RCU's CPU stall -detector, which detects conditions that unduly delay RCU grace periods. -This module parameter enables CPU stall detection by default, but -may be overridden via boot-time parameter or at runtime via sysfs. +This document first discusses what sorts of issues RCU's CPU stall +detector can locate, and then discusses kernel parameters and Kconfig +options that can be used to fine-tune the detector's operation. Finally, +this document explains the stall detector's "splat" format. + + +What Causes RCU CPU Stall Warnings? + +So your kernel printed an RCU CPU stall warning. The next question is +"What caused it?" The following problems can result in RCU CPU stall +warnings: + +o A CPU looping in an RCU read-side critical section. + +o A CPU looping with interrupts disabled. + +o A CPU looping with preemption disabled. This condition can + result in RCU-sched stalls and, if ksoftirqd is in use, RCU-bh + stalls. + +o A CPU looping with bottom halves disabled. This condition can + result in RCU-sched and RCU-bh stalls. + +o For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the + kernel without invoking schedule(). Note that cond_resched() + does not necessarily prevent RCU CPU stall warnings. Therefore, + if the looping in the kernel is really expected and desirable + behavior, you might need to replace some of the cond_resched() + calls with calls to cond_resched_rcu_qs(). + +o Booting Linux using a console connection that is too slow to + keep up with the boot-time console-message rate. For example, + a 115Kbaud serial console can be -way- too slow to keep up + with boot-time message rates, and will frequently result in + RCU CPU stall warning messages. Especially if you have added + debug printk()s. + +o Anything that prevents RCU's grace-period kthreads from running. + This can result in the "All QSes seen" console-log message. + This message will include information on when the kthread last + ran and how often it should be expected to run. + +o A CPU-bound real-time task in a CONFIG_PREEMPT kernel, which might + happen to preempt a low-priority task in the middle of an RCU + read-side critical section. This is especially damaging if + that low-priority task is not permitted to run on any other CPU, + in which case the next RCU grace period can never complete, which + will eventually cause the system to run out of memory and hang. + While the system is in the process of running itself out of + memory, you might see stall-warning messages. + +o A CPU-bound real-time task in a CONFIG_PREEMPT_RT kernel that + is running at a higher priority than the RCU softirq threads. + This will prevent RCU callbacks from ever being invoked, + and in a CONFIG_PREEMPT_RCU kernel will further prevent + RCU grace periods from ever completing. Either way, the + system will eventually run out of memory and hang. In the + CONFIG_PREEMPT_RCU case, you might see stall-warning + messages. + +o A hardware or software issue shuts off the scheduler-clock + interrupt on a CPU that is not in dyntick-idle mode. This + problem really has happened, and seems to be most likely to + result in RCU CPU stall warnings for CONFIG_NO_HZ_COMMON=n kernels. + +o A bug in the RCU implementation. + +o A hardware failure. This is quite unlikely, but has occurred + at least once in real life. A CPU failed in a running system, + becoming unresponsive, but not causing an immediate crash. + This resulted in a series of RCU CPU stall warnings, eventually + leading the realization that the CPU had failed. + +The RCU, RCU-sched, RCU-bh, and RCU-tasks implementations have CPU stall +warning. Note that SRCU does -not- have CPU stall warnings. Please note +that RCU only detects CPU stalls when there is a grace period in progress. +No grace period, no CPU stall warnings. + +To diagnose the cause of the stall, inspect the stack traces. +The offending function will usually be near the top of the stack. +If you have a series of stall warnings from a single extended stall, +comparing the stack traces can often help determine where the stall +is occurring, which will usually be in the function nearest the top of +that portion of the stack which remains the same from trace to trace. +If you can reliably trigger the stall, ftrace can be quite helpful. + +RCU bugs can often be debugged with the help of CONFIG_RCU_TRACE +and with RCU's event tracing. For information on RCU's event tracing, +see include/trace/events/rcu.h. + + +Fine-Tuning the RCU CPU Stall Detector + +The rcuupdate.rcu_cpu_stall_suppress module parameter disables RCU's +CPU stall detector, which detects conditions that unduly delay RCU grace +periods. This module parameter enables CPU stall detection by default, +but may be overridden via boot-time parameter or at runtime via sysfs. The stall detector's idea of what constitutes "unduly delayed" is controlled by a set of kernel configuration variables and cpp macros: @@ -56,6 +149,9 @@ rcupdate.rcu_task_stall_timeout And continues with the output of sched_show_task() for each task stalling the current RCU-tasks grace period. + +Interpreting RCU's CPU Stall-Detector "Splats" + For non-RCU-tasks flavors of RCU, when a CPU detects that it is stalling, it will print a message similar to the following: @@ -178,89 +274,3 @@ grace period is in flight. It is entirely possible to see stall warnings from normal and from expedited grace periods at about the same time from the same run. - - -What Causes RCU CPU Stall Warnings? - -So your kernel printed an RCU CPU stall warning. The next question is -"What caused it?" The following problems can result in RCU CPU stall -warnings: - -o A CPU looping in an RCU read-side critical section. - -o A CPU looping with interrupts disabled. This condition can - result in RCU-sched and RCU-bh stalls. - -o A CPU looping with preemption disabled. This condition can - result in RCU-sched stalls and, if ksoftirqd is in use, RCU-bh - stalls. - -o A CPU looping with bottom halves disabled. This condition can - result in RCU-sched and RCU-bh stalls. - -o For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the - kernel without invoking schedule(). Note that cond_resched() - does not necessarily prevent RCU CPU stall warnings. Therefore, - if the looping in the kernel is really expected and desirable - behavior, you might need to replace some of the cond_resched() - calls with calls to cond_resched_rcu_qs(). - -o Booting Linux using a console connection that is too slow to - keep up with the boot-time console-message rate. For example, - a 115Kbaud serial console can be -way- too slow to keep up - with boot-time message rates, and will frequently result in - RCU CPU stall warning messages. Especially if you have added - debug printk()s. - -o Anything that prevents RCU's grace-period kthreads from running. - This can result in the "All QSes seen" console-log message. - This message will include information on when the kthread last - ran and how often it should be expected to run. - -o A CPU-bound real-time task in a CONFIG_PREEMPT kernel, which might - happen to preempt a low-priority task in the middle of an RCU - read-side critical section. This is especially damaging if - that low-priority task is not permitted to run on any other CPU, - in which case the next RCU grace period can never complete, which - will eventually cause the system to run out of memory and hang. - While the system is in the process of running itself out of - memory, you might see stall-warning messages. - -o A CPU-bound real-time task in a CONFIG_PREEMPT_RT kernel that - is running at a higher priority than the RCU softirq threads. - This will prevent RCU callbacks from ever being invoked, - and in a CONFIG_PREEMPT_RCU kernel will further prevent - RCU grace periods from ever completing. Either way, the - system will eventually run out of memory and hang. In the - CONFIG_PREEMPT_RCU case, you might see stall-warning - messages. - -o A hardware or software issue shuts off the scheduler-clock - interrupt on a CPU that is not in dyntick-idle mode. This - problem really has happened, and seems to be most likely to - result in RCU CPU stall warnings for CONFIG_NO_HZ_COMMON=n kernels. - -o A bug in the RCU implementation. - -o A hardware failure. This is quite unlikely, but has occurred - at least once in real life. A CPU failed in a running system, - becoming unresponsive, but not causing an immediate crash. - This resulted in a series of RCU CPU stall warnings, eventually - leading the realization that the CPU had failed. - -The RCU, RCU-sched, RCU-bh, and RCU-tasks implementations have CPU stall -warning. Note that SRCU does -not- have CPU stall warnings. Please note -that RCU only detects CPU stalls when there is a grace period in progress. -No grace period, no CPU stall warnings. - -To diagnose the cause of the stall, inspect the stack traces. -The offending function will usually be near the top of the stack. -If you have a series of stall warnings from a single extended stall, -comparing the stack traces can often help determine where the stall -is occurring, which will usually be in the function nearest the top of -that portion of the stack which remains the same from trace to trace. -If you can reliably trigger the stall, ftrace can be quite helpful. - -RCU bugs can often be debugged with the help of CONFIG_RCU_TRACE -and with RCU's event tracing. For information on RCU's event tracing, -see include/trace/events/rcu.h. -- cgit From aa123a748ea552b18f0d4add823c29ddbddaf7b4 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 11 Apr 2017 09:17:08 -0700 Subject: doc: Update RCU data-structure documentation for rcu_segcblist The rcu_segcblist data structure, which contains segmented lists of RCU callbacks, was recently added. This commit updates the documentation accordingly. Signed-off-by: Paul E. McKenney --- .../Design/Data-Structures/Data-Structures.html | 207 ++++++++++++++------- .../RCU/Design/Data-Structures/nxtlist.svg | 34 ++-- 2 files changed, 156 insertions(+), 85 deletions(-) (limited to 'Documentation') diff --git a/Documentation/RCU/Design/Data-Structures/Data-Structures.html b/Documentation/RCU/Design/Data-Structures/Data-Structures.html index d583c653a703..2ab38ee420c5 100644 --- a/Documentation/RCU/Design/Data-Structures/Data-Structures.html +++ b/Documentation/RCU/Design/Data-Structures/Data-Structures.html @@ -19,6 +19,8 @@ to each other. The rcu_state Structure
  • The rcu_node Structure +
  • + The rcu_segcblist Structure
  • The rcu_data Structure
  • @@ -841,6 +843,134 @@ for lockdep lock-class names. Finally, lines 64-66 produce an error if the maximum number of CPUs is too large for the specified fanout. +

    +The rcu_segcblist Structure

    + +The rcu_segcblist structure maintains a segmented list of +callbacks as follows: + +
    + 1 #define RCU_DONE_TAIL        0
    + 2 #define RCU_WAIT_TAIL        1
    + 3 #define RCU_NEXT_READY_TAIL  2
    + 4 #define RCU_NEXT_TAIL        3
    + 5 #define RCU_CBLIST_NSEGS     4
    + 6
    + 7 struct rcu_segcblist {
    + 8   struct rcu_head *head;
    + 9   struct rcu_head **tails[RCU_CBLIST_NSEGS];
    +10   unsigned long gp_seq[RCU_CBLIST_NSEGS];
    +11   long len;
    +12   long len_lazy;
    +13 };
    +
    + +

    +The segments are as follows: + +

      +
    1. RCU_DONE_TAIL: Callbacks whose grace periods have elapsed. + These callbacks are ready to be invoked. +
    2. RCU_WAIT_TAIL: Callbacks that are waiting for the + current grace period. + Note that different CPUs can have different ideas about which + grace period is current, hence the ->gp_seq field. +
    3. RCU_NEXT_READY_TAIL: Callbacks waiting for the next + grace period to start. +
    4. RCU_NEXT_TAIL: Callbacks that have not yet been + associated with a grace period. +
    + +

    +The ->head pointer references the first callback or +is NULL if the list contains no callbacks (which is +not the same as being empty). +Each element of the ->tails[] array references the +->next pointer of the last callback in the corresponding +segment of the list, or the list's ->head pointer if +that segment and all previous segments are empty. +If the corresponding segment is empty but some previous segment is +not empty, then the array element is identical to its predecessor. +Older callbacks are closer to the head of the list, and new callbacks +are added at the tail. +This relationship between the ->head pointer, the +->tails[] array, and the callbacks is shown in this +diagram: + +

    nxtlist.svg + +

    In this figure, the ->head pointer references the +first +RCU callback in the list. +The ->tails[RCU_DONE_TAIL] array element references +the ->head pointer itself, indicating that none +of the callbacks is ready to invoke. +The ->tails[RCU_WAIT_TAIL] array element references callback +CB 2's ->next pointer, which indicates that +CB 1 and CB 2 are both waiting on the current grace period, +give or take possible disagreements about exactly which grace period +is the current one. +The ->tails[RCU_NEXT_READY_TAIL] array element +references the same RCU callback that ->tails[RCU_WAIT_TAIL] +does, which indicates that there are no callbacks waiting on the next +RCU grace period. +The ->tails[RCU_NEXT_TAIL] array element references +CB 4's ->next pointer, indicating that all the +remaining RCU callbacks have not yet been assigned to an RCU grace +period. +Note that the ->tails[RCU_NEXT_TAIL] array element +always references the last RCU callback's ->next pointer +unless the callback list is empty, in which case it references +the ->head pointer. + +

    +There is one additional important special case for the +->tails[RCU_NEXT_TAIL] array element: It can be NULL +when this list is disabled. +Lists are disabled when the corresponding CPU is offline or when +the corresponding CPU's callbacks are offloaded to a kthread, +both of which are described elsewhere. + +

    CPUs advance their callbacks from the +RCU_NEXT_TAIL to the RCU_NEXT_READY_TAIL to the +RCU_WAIT_TAIL to the RCU_DONE_TAIL list segments +as grace periods advance. + +

    The ->gp_seq[] array records grace-period +numbers corresponding to the list segments. +This is what allows different CPUs to have different ideas as to +which is the current grace period while still avoiding premature +invocation of their callbacks. +In particular, this allows CPUs that go idle for extended periods +to determine which of their callbacks are ready to be invoked after +reawakening. + +

    The ->len counter contains the number of +callbacks in ->head, and the +->len_lazy contains the number of those callbacks that +are known to only free memory, and whose invocation can therefore +be safely deferred. + +

    Important note: It is the ->len field that +determines whether or not there are callbacks associated with +this rcu_segcblist structure, not the ->head +pointer. +The reason for this is that all the ready-to-invoke callbacks +(that is, those in the RCU_DONE_TAIL segment) are extracted +all at once at callback-invocation time. +If callback invocation must be postponed, for example, because a +high-priority process just woke up on this CPU, then the remaining +callbacks are placed back on the RCU_DONE_TAIL segment. +Either way, the ->len and ->len_lazy counts +are adjusted after the corresponding callbacks have been invoked, and so +again it is the ->len count that accurately reflects whether +or not there are callbacks associated with this rcu_segcblist +structure. +Of course, off-CPU sampling of the ->len count requires +the use of appropriate synchronization, for example, memory barriers. +This synchronization can be a bit subtle, particularly in the case +of rcu_barrier(). +

    The rcu_data Structure

    @@ -983,62 +1113,18 @@ choice. as follows:
    - 1 struct rcu_head *nxtlist;
    - 2 struct rcu_head **nxttail[RCU_NEXT_SIZE];
    - 3 unsigned long nxtcompleted[RCU_NEXT_SIZE];
    - 4 long qlen_lazy;
    - 5 long qlen;
    - 6 long qlen_last_fqs_check;
    + 1 struct rcu_segcblist cblist;
    + 2 long qlen_last_fqs_check;
    + 3 unsigned long n_cbs_invoked;
    + 4 unsigned long n_nocbs_invoked;
    + 5 unsigned long n_cbs_orphaned;
    + 6 unsigned long n_cbs_adopted;
      7 unsigned long n_force_qs_snap;
    - 8 unsigned long n_cbs_invoked;
    - 9 unsigned long n_cbs_orphaned;
    -10 unsigned long n_cbs_adopted;
    -11 long blimit;
    + 8 long blimit;
     
    -

    The ->nxtlist pointer and the -->nxttail[] array form a four-segment list with -older callbacks near the head and newer ones near the tail. -Each segment contains callbacks with the corresponding relationship -to the current grace period. -The pointer out of the end of each of the four segments is referenced -by the element of the ->nxttail[] array indexed by -RCU_DONE_TAIL (for callbacks handled by a prior grace period), -RCU_WAIT_TAIL (for callbacks waiting on the current grace period), -RCU_NEXT_READY_TAIL (for callbacks that will wait on the next -grace period), and -RCU_NEXT_TAIL (for callbacks that are not yet associated -with a specific grace period) -respectively, as shown in the following figure. - -

    nxtlist.svg - -

    In this figure, the ->nxtlist pointer references the -first -RCU callback in the list. -The ->nxttail[RCU_DONE_TAIL] array element references -the ->nxtlist pointer itself, indicating that none -of the callbacks is ready to invoke. -The ->nxttail[RCU_WAIT_TAIL] array element references callback -CB 2's ->next pointer, which indicates that -CB 1 and CB 2 are both waiting on the current grace period. -The ->nxttail[RCU_NEXT_READY_TAIL] array element -references the same RCU callback that ->nxttail[RCU_WAIT_TAIL] -does, which indicates that there are no callbacks waiting on the next -RCU grace period. -The ->nxttail[RCU_NEXT_TAIL] array element references -CB 4's ->next pointer, indicating that all the -remaining RCU callbacks have not yet been assigned to an RCU grace -period. -Note that the ->nxttail[RCU_NEXT_TAIL] array element -always references the last RCU callback's ->next pointer -unless the callback list is empty, in which case it references -the ->nxtlist pointer. - -

    CPUs advance their callbacks from the -RCU_NEXT_TAIL to the RCU_NEXT_READY_TAIL to the -RCU_WAIT_TAIL to the RCU_DONE_TAIL list segments -as grace periods advance. +

    The ->cblist structure is the segmented callback list +described earlier. The CPU advances the callbacks in its rcu_data structure whenever it notices that another RCU grace period has completed. The CPU detects the completion of an RCU grace period by noticing @@ -1049,16 +1135,7 @@ Recall that each rcu_node structure's ->completed field is updated at the end of each grace period. -

    The ->nxtcompleted[] array records grace-period -numbers corresponding to the list segments. -This allows CPUs that go idle for extended periods to determine -which of their callbacks are ready to be invoked after reawakening. - -

    The ->qlen counter contains the number of -callbacks in ->nxtlist, and the -->qlen_lazy contains the number of those callbacks that -are known to only free memory, and whose invocation can therefore -be safely deferred. +

    The ->qlen_last_fqs_check and ->n_force_qs_snap coordinate the forcing of quiescent states from call_rcu() and friends when callback @@ -1069,6 +1146,10 @@ lists grow excessively long. fields count the number of callbacks invoked, sent to other CPUs when this CPU goes offline, and received from other CPUs when those other CPUs go offline. +The ->n_nocbs_invoked is used when the CPU's callbacks +are offloaded to a kthread. + +

    Finally, the ->blimit counter is the maximum number of RCU callbacks that may be invoked at a given time. diff --git a/Documentation/RCU/Design/Data-Structures/nxtlist.svg b/Documentation/RCU/Design/Data-Structures/nxtlist.svg index abc4cc73a097..0223e79c38e0 100644 --- a/Documentation/RCU/Design/Data-Structures/nxtlist.svg +++ b/Documentation/RCU/Design/Data-Structures/nxtlist.svg @@ -19,7 +19,7 @@ id="svg2" version="1.1" inkscape:version="0.48.4 r9939" - sodipodi:docname="nxtlist.fig"> + sodipodi:docname="segcblist.svg"> @@ -28,7 +28,7 @@ image/svg+xml - + @@ -241,61 +241,51 @@ xml:space="preserve" x="225" y="675" - fill="#000000" - font-family="Courier" font-style="normal" font-weight="bold" font-size="324" - text-anchor="start" - id="text64">nxtlist + id="text64" + style="font-size:324px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;font-family:Courier">->head nxttail[RCU_DONE_TAIL] + id="text66" + style="font-size:324px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;font-family:Courier">->tails[RCU_DONE_TAIL] nxttail[RCU_WAIT_TAIL] + id="text68" + style="font-size:324px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;font-family:Courier">->tails[RCU_WAIT_TAIL] nxttail[RCU_NEXT_READY_TAIL] + id="text70" + style="font-size:324px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;font-family:Courier">->tails[RCU_NEXT_READY_TAIL] nxttail[RCU_NEXT_TAIL] + id="text72" + style="font-size:324px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;font-family:Courier">->tails[RCU_NEXT_TAIL] Date: Tue, 11 Apr 2017 10:17:23 -0700 Subject: doc: Update requirements based on recent changes These changes include lighter-weight expedited grace periods, the fact that expedited grace periods and rcu_barrier() no longer block CPU hotplug, some HTML font fixups, noting that rcu_barrier() need not wait for a grace period (even if callbacks are posted), the fact that SRCU read-side critical sections can be used from offline CPUs, and the fact that SRCU now maintains per-CPU callback lists. Signed-off-by: Paul E. McKenney --- .../RCU/Design/Requirements/Requirements.html | 120 ++++++++++++++++----- 1 file changed, 94 insertions(+), 26 deletions(-) (limited to 'Documentation') diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html index 999b3ed3444e..f60adf112663 100644 --- a/Documentation/RCU/Design/Requirements/Requirements.html +++ b/Documentation/RCU/Design/Requirements/Requirements.html @@ -659,8 +659,9 @@ systems with more than one CPU: In other words, a given instance of synchronize_rcu() can avoid waiting on a given RCU read-side critical section only if it can prove that synchronize_rcu() started first. + -

    +

    A related question is “When rcu_read_lock() doesn't generate any code, why does it matter how it relates to a grace period?” @@ -675,8 +676,9 @@ systems with more than one CPU: within the critical section, in which case none of the accesses within the critical section may observe the effects of any access following the grace period. + -

    +

    As of late 2016, mathematical models of RCU take this viewpoint, for example, see slides 62 and 63 of the @@ -1616,8 +1618,8 @@ CPUs should at least make reasonable forward progress. In return for its shorter latencies, synchronize_rcu_expedited() is permitted to impose modest degradation of real-time latency on non-idle online CPUs. -That said, it will likely be necessary to take further steps to reduce this -degradation, hopefully to roughly that of a scheduling-clock interrupt. +Here, “modest” means roughly the same latency +degradation as a scheduling-clock interrupt.

    There are a number of situations where even @@ -1913,12 +1915,9 @@ This requirement is another factor driving batching of grace periods, but it is also the driving force behind the checks for large numbers of queued RCU callbacks in the call_rcu() code path. Finally, high update rates should not delay RCU read-side critical -sections, although some read-side delays can occur when using +sections, although some small read-side delays can occur when using synchronize_rcu_expedited(), courtesy of this function's use -of try_stop_cpus(). -(In the future, synchronize_rcu_expedited() will be -converted to use lighter-weight inter-processor interrupts (IPIs), -but this will still disturb readers, though to a much smaller degree.) +of smp_call_function_single().

    Although all three of these corner cases were understood in the early @@ -2192,7 +2191,7 @@ Unfortunately, synchronize_rcu() can't do this until all of its kthreads are spawned, which doesn't happen until some time during early_initcalls() time. But this is no excuse: RCU is nevertheless required to correctly handle -synchronous grace periods during this time period, which it currently does. +synchronous grace periods during this time period. Once all of its kthreads are up and running, RCU starts running normally. @@ -2206,8 +2205,10 @@ normally. Answer: Very carefully! + -

    During the “dead zone” between the time that the +

    + During the “dead zone” between the time that the scheduler spawns the first task and the time that all of RCU's kthreads have been spawned, all synchronous grace periods are handled by the expedited grace-period mechanism. @@ -2220,8 +2221,10 @@ normally. using workqueues, as is required to avoid problems that would otherwise occur when a user task received a POSIX signal while driving an expedited grace period. + -

    And yes, this does mean that it is unhelpful to send POSIX +

    + And yes, this does mean that it is unhelpful to send POSIX signals to random tasks between the time that the scheduler spawns its first kthread and the time that RCU's kthreads have all been spawned. @@ -2308,12 +2311,61 @@ situation, and Dipankar Sarma incorporated rcu_barrier() into RCU. The need for rcu_barrier() for module unloading became apparent later. +

    +Important note: The rcu_barrier() function is not, +repeat, not, obligated to wait for a grace period. +It is instead only required to wait for RCU callbacks that have +already been posted. +Therefore, if there are no RCU callbacks posted anywhere in the system, +rcu_barrier() is within its rights to return immediately. +Even if there are callbacks posted, rcu_barrier() does not +necessarily need to wait for a grace period. + + + + + + + + +
     
    Quick Quiz:
    + Wait a minute! + Each RCU callbacks must wait for a grace period to complete, + and rcu_barrier() must wait for each pre-existing + callback to be invoked. + Doesn't rcu_barrier() therefore need to wait for + a full grace period if there is even one callback posted anywhere + in the system? +
    Answer:
    + Absolutely not!!! + + +

    + Yes, each RCU callbacks must wait for a grace period to complete, + but it might well be partly (or even completely) finished waiting + by the time rcu_barrier() is invoked. + In that case, rcu_barrier() need only wait for the + remaining portion of the grace period to elapse. + So even if there are quite a few callbacks posted, + rcu_barrier() might well return quite quickly. + + +

    + So if you need to wait for a grace period as well as for all + pre-existing callbacks, you will need to invoke both + synchronize_rcu() and rcu_barrier(). + If latency is a concern, you can always use workqueues + to invoke them concurrently. +

     
    +

    Hotplug CPU

    The Linux kernel supports CPU hotplug, which means that CPUs can come and go. -It is of course illegal to use any RCU API member from an offline CPU. +It is of course illegal to use any RCU API member from an offline CPU, +with the exception of SRCU read-side +critical sections. This requirement was present from day one in DYNIX/ptx, but on the other hand, the Linux kernel's CPU-hotplug implementation is “interesting.” @@ -2323,19 +2375,18 @@ The Linux-kernel CPU-hotplug implementation has notifiers that are used to allow the various kernel subsystems (including RCU) to respond appropriately to a given CPU-hotplug operation. Most RCU operations may be invoked from CPU-hotplug notifiers, -including even normal synchronous grace-period operations -such as synchronize_rcu(). -However, expedited grace-period operations such as -synchronize_rcu_expedited() are not supported, -due to the fact that current implementations block CPU-hotplug -operations, which could result in deadlock. +including even synchronous grace-period operations such as +synchronize_rcu() and synchronize_rcu_expedited().

    -In addition, all-callback-wait operations such as +However, all-callback-wait operations such as rcu_barrier() are also not supported, due to the fact that there are phases of CPU-hotplug operations where the outgoing CPU's callbacks will not be invoked until after the CPU-hotplug operation ends, which could also result in deadlock. +Furthermore, rcu_barrier() blocks CPU-hotplug operations +during its execution, which results in another type of deadlock +when invoked from a CPU-hotplug notifier.

    Scheduler and RCU

    @@ -2876,6 +2927,27 @@ It also motivates the smp_mb__after_srcu_read_unlock() API, which, in combination with srcu_read_unlock(), guarantees a full memory barrier. +

    +Also unlike other RCU flavors, SRCU's callbacks-wait function +srcu_barrier() may be invoked from CPU-hotplug notifiers, +though this is not necessarily a good idea. +The reason that this is possible is that SRCU is insensitive +to whether or not a CPU is online, which means that srcu_barrier() +need not exclude CPU-hotplug operations. + +

    +As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating +a locking bottleneck present in prior kernel versions. +Although this will allow users to put much heavier stress on +call_srcu(), it is important to note that SRCU does not +yet take any special steps to deal with callback flooding. +So if you are posting (say) 10,000 SRCU callbacks per second per CPU, +you are probably totally OK, but if you intend to post (say) 1,000,000 +SRCU callbacks per second per CPU, please run some tests first. +SRCU just might need a few adjustment to deal with that sort of load. +Of course, your mileage may vary based on the speed of your CPUs and +the size of your memory. +

    The SRCU API @@ -3034,8 +3106,8 @@ to do some redesign to avoid this scalability problem.

    RCU disables CPU hotplug in a few places, perhaps most notably in the -expedited grace-period and rcu_barrier() operations. -If there is a strong reason to use expedited grace periods in CPU-hotplug +rcu_barrier() operations. +If there is a strong reason to use rcu_barrier() in CPU-hotplug notifiers, it will be necessary to avoid disabling CPU hotplug. This would introduce some complexity, so there had better be a very good reason. @@ -3109,9 +3181,5 @@ Andy Lutomirski for their help in rendering this article human readable, and to Michelle Rankin for her support of this effort. Other contributions are acknowledged in the Linux kernel's git archive. -The cartoon is copyright (c) 2013 by Melissa Broussard, -and is provided -under the terms of the Creative Commons Attribution-Share Alike 3.0 -United States license. -- cgit From 066bb1c84aa430d15f36070471cbfe8976631cce Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 7 Mar 2017 07:30:58 -0800 Subject: doc: Update rcu_assign_pointer() definition in whatisRCU.txt The rcu_assign_pointer() macro has changed over time, and the version in Documentation/RCU/whatisRCU.txt has not kept up. This commit brings it into 2017, albeit in a simplified fashion. Reported-by: Andrea Parri Signed-off-by: Paul E. McKenney --- Documentation/RCU/whatisRCU.txt | 29 +++++++++++++++-------------- 1 file changed, 15 insertions(+), 14 deletions(-) (limited to 'Documentation') diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt index 5cbd8b2395b8..6b0337008f9c 100644 --- a/Documentation/RCU/whatisRCU.txt +++ b/Documentation/RCU/whatisRCU.txt @@ -587,20 +587,21 @@ It is extremely simple: write_unlock(&rcu_gp_mutex); } -[You can ignore rcu_assign_pointer() and rcu_dereference() without -missing much. But here they are anyway. And whatever you do, don't -forget about them when submitting patches making use of RCU!] - - #define rcu_assign_pointer(p, v) ({ \ - smp_wmb(); \ - (p) = (v); \ - }) - - #define rcu_dereference(p) ({ \ - typeof(p) _________p1 = p; \ - smp_read_barrier_depends(); \ - (_________p1); \ - }) +[You can ignore rcu_assign_pointer() and rcu_dereference() without missing +much. But here are simplified versions anyway. And whatever you do, +don't forget about them when submitting patches making use of RCU!] + + #define rcu_assign_pointer(p, v) \ + ({ \ + smp_store_release(&(p), (v)); \ + }) + + #define rcu_dereference(p) \ + ({ \ + typeof(p) _________p1 = p; \ + smp_read_barrier_depends(); \ + (_________p1); \ + }) The rcu_read_lock() and rcu_read_unlock() primitive read-acquire -- cgit From 93728af0a1f63e13d6f7f56a434965b05b8b2abd Mon Sep 17 00:00:00 2001 From: Michalis Kokologiannakis Date: Mon, 20 Mar 2017 22:38:35 +0100 Subject: doc: Update the comparisons rule in rcu_dereference.txt When an RCU-protected pointer is fetched but never dereferenced rcu_access_pointer() should be used in place of rcu_dereference(). This commit explicitly records this very fact in Documentation/ RCU/rcu_dereference.txt, in order to prevent the usage of rcu_dereference() in comparisons. Signed-off-by: Michalis Kokologiannakis Signed-off-by: Paul E. McKenney --- Documentation/RCU/rcu_dereference.txt | 9 +++++++++ 1 file changed, 9 insertions(+) (limited to 'Documentation') diff --git a/Documentation/RCU/rcu_dereference.txt b/Documentation/RCU/rcu_dereference.txt index c0bf2441a2ba..b2a613f16d74 100644 --- a/Documentation/RCU/rcu_dereference.txt +++ b/Documentation/RCU/rcu_dereference.txt @@ -138,6 +138,15 @@ o Be very careful about comparing pointers obtained from This sort of comparison occurs frequently when scanning RCU-protected circular linked lists. + Note that if checks for being within an RCU read-side + critical section are not required and the pointer is never + dereferenced, rcu_access_pointer() should be used in place + of rcu_dereference(). The rcu_access_pointer() primitive + does not require an enclosing read-side critical section, + and also omits the smp_read_barrier_depends() included in + rcu_dereference(), which in turn should provide a small + performance gain in some CPUs (e.g., the DEC Alpha). + o The comparison is against a pointer that references memory that was initialized "a long time ago." The reason this is safe is that even if misordering occurs, the -- cgit From d3d3a3ccc4a8f1f254fb6788081f35bebe374174 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 28 Mar 2017 19:57:45 -0700 Subject: doc: Emphasize that "toy" RCU requires recursive rwlock Reported-by: "yangzc@uit.com.cn" Signed-off-by: Paul E. McKenney --- Documentation/RCU/whatisRCU.txt | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt index 6b0337008f9c..8c131a1c62ea 100644 --- a/Documentation/RCU/whatisRCU.txt +++ b/Documentation/RCU/whatisRCU.txt @@ -562,7 +562,9 @@ This section presents a "toy" RCU implementation that is based on familiar locking primitives. Its overhead makes it a non-starter for real-life use, as does its lack of scalability. It is also unsuitable for realtime use, since it allows scheduling latency to "bleed" from -one read-side critical section to another. +one read-side critical section to another. It also assumes recursive +reader-writer locks: If you try this with non-recursive locks, and +you allow nested rcu_read_lock() calls, you can deadlock. However, it is probably the easiest implementation to relate to, so is a good starting point. -- cgit From b26cfc48e3e03126c183f1f3960e6d69460bb852 Mon Sep 17 00:00:00 2001 From: pierre Kuo Date: Fri, 7 Apr 2017 14:37:36 +0800 Subject: doc: Update control-dependencies section of memory-barriers.txt In the following example, if MAX is defined to be 1, then the compiler knows (Q % MAX) is equal to zero. The compiler can therefore throw away the "then" branch (and the "if"), retaining only the "else" branch. q = READ_ONCE(a); if (q % MAX) { WRITE_ONCE(b, 1); do_something(); } else { WRITE_ONCE(b, 2); do_something_else(); } It is therefore necessary to modify the example like this: q = READ_ONCE(a); - WRITE_ONCE(b, 1); + WRITE_ONCE(b, 2); do_something_else(); Signed-off-by: pierre Kuo Acked-by: Will Deacon Signed-off-by: Paul E. McKenney --- Documentation/memory-barriers.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index d2b0a8d81258..08329cb857ed 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -768,7 +768,7 @@ equal to zero, in which case the compiler is within its rights to transform the above code into the following: q = READ_ONCE(a); - WRITE_ONCE(b, 1); + WRITE_ONCE(b, 2); do_something_else(); Given this transformation, the CPU is not required to respect the ordering -- cgit From abb06b99484a9f5af05c7147c289faf835f68e8e Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 26 Jan 2017 13:45:38 -0800 Subject: rcu: Pull rcu_sched_qs_mask into rcu_dynticks structure The rcu_sched_qs_mask variable is yet another isolated per-CPU variable, so this commit pulls it into the pre-existing rcu_dynticks per-CPU structure. Signed-off-by: Paul E. McKenney --- Documentation/RCU/Design/Data-Structures/Data-Structures.html | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/RCU/Design/Data-Structures/Data-Structures.html b/Documentation/RCU/Design/Data-Structures/Data-Structures.html index d583c653a703..bf7f266e8888 100644 --- a/Documentation/RCU/Design/Data-Structures/Data-Structures.html +++ b/Documentation/RCU/Design/Data-Structures/Data-Structures.html @@ -1104,6 +1104,7 @@ Its fields are as follows: 1 int dynticks_nesting; 2 int dynticks_nmi_nesting; 3 atomic_t dynticks; + 4 int rcu_sched_qs_mask;

    The ->dynticks_nesting field counts the @@ -1117,11 +1118,17 @@ NMIs are counted by the ->dynticks_nmi_nesting field, except that NMIs that interrupt non-dyntick-idle execution are not counted. -

    Finally, the ->dynticks field counts the corresponding +

    The ->dynticks field counts the corresponding CPU's transitions to and from dyntick-idle mode, so that this counter has an even value when the CPU is in dyntick-idle mode and an odd value otherwise. +

    Finally, the ->rcu_sched_qs_mask field is used +to record the fact that the RCU core code would really like to +see a quiescent state from the corresponding CPU. +This flag is checked by RCU's context-switch and cond_resched() +code, which provide a momentary idle sojourn in response. + -- cgit From 9577df9a3122af08fff84b8a1a60dccf524a3891 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 26 Jan 2017 16:18:07 -0800 Subject: rcu: Pull rcu_qs_ctr into rcu_dynticks structure The rcu_qs_ctr variable is yet another isolated per-CPU variable, so this commit pulls it into the pre-existing rcu_dynticks per-CPU structure. Signed-off-by: Paul E. McKenney --- .../RCU/Design/Data-Structures/Data-Structures.html | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/RCU/Design/Data-Structures/Data-Structures.html b/Documentation/RCU/Design/Data-Structures/Data-Structures.html index bf7f266e8888..3d0311657533 100644 --- a/Documentation/RCU/Design/Data-Structures/Data-Structures.html +++ b/Documentation/RCU/Design/Data-Structures/Data-Structures.html @@ -1105,6 +1105,7 @@ Its fields are as follows: 2 int dynticks_nmi_nesting; 3 atomic_t dynticks; 4 int rcu_sched_qs_mask; + 5 unsigned long rcu_qs_ctr;

    The ->dynticks_nesting field counts the @@ -1123,12 +1124,19 @@ CPU's transitions to and from dyntick-idle mode, so that this counter has an even value when the CPU is in dyntick-idle mode and an odd value otherwise. -

    Finally, the ->rcu_sched_qs_mask field is used +

    The ->rcu_sched_qs_mask field is used to record the fact that the RCU core code would really like to -see a quiescent state from the corresponding CPU. +see a quiescent state from the corresponding CPU, so much so that +it is willing to call for heavy-weight dyntick-counter operations. This flag is checked by RCU's context-switch and cond_resched() code, which provide a momentary idle sojourn in response. +

    Finally the ->rcu_qs_ctr field is used to record +quiescent states from cond_resched(). +Because cond_resched() can execute quite frequently, this +must be quite lightweight, as in a non-atomic increment of this +per-CPU field. +

     
    Quick Quiz:
    -- cgit From 0f9be8cabbc343218dd2807af7308656be113045 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 27 Jan 2017 13:17:02 -0800 Subject: rcu: Eliminate flavor scan in rcu_momentary_dyntick_idle() The rcu_momentary_dyntick_idle() function scans the RCU flavors, checking that one of them still needs a quiescent state before doing an expensive atomic operation on the ->dynticks counter. However, this check reduces overhead only after a rare race condition, and increases complexity. This commit therefore removes the scan and the mechanism enabling the scan. Signed-off-by: Paul E. McKenney --- Documentation/RCU/Design/Data-Structures/Data-Structures.html | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/RCU/Design/Data-Structures/Data-Structures.html b/Documentation/RCU/Design/Data-Structures/Data-Structures.html index 3d0311657533..e4bf20a68fa3 100644 --- a/Documentation/RCU/Design/Data-Structures/Data-Structures.html +++ b/Documentation/RCU/Design/Data-Structures/Data-Structures.html @@ -1104,7 +1104,7 @@ Its fields are as follows: 1 int dynticks_nesting; 2 int dynticks_nmi_nesting; 3 atomic_t dynticks; - 4 int rcu_sched_qs_mask; + 4 bool rcu_need_heavy_qs; 5 unsigned long rcu_qs_ctr; @@ -1124,7 +1124,7 @@ CPU's transitions to and from dyntick-idle mode, so that this counter has an even value when the CPU is in dyntick-idle mode and an odd value otherwise. -

    The ->rcu_sched_qs_mask field is used +

    The ->rcu_need_heavy_qs field is used to record the fact that the RCU core code would really like to see a quiescent state from the corresponding CPU, so much so that it is willing to call for heavy-weight dyntick-counter operations. -- cgit From 9226b10d78ffe7895549045fe388dc5e73b87eac Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 27 Jan 2017 14:17:50 -0800 Subject: rcu: Place guard on rcu_all_qs() and rcu_note_context_switch() actions The rcu_all_qs() and rcu_note_context_switch() do a series of checks, taking various actions to supply RCU with quiescent states, depending on the outcomes of the various checks. This is a bit much for scheduling fastpaths, so this commit creates a separate ->rcu_urgent_qs field in the rcu_dynticks structure that acts as a global guard for these checks. Thus, in the common case, rcu_all_qs() and rcu_note_context_switch() check the ->rcu_urgent_qs field, find it false, and simply return. Signed-off-by: Paul E. McKenney Cc: Peter Zijlstra --- Documentation/RCU/Design/Data-Structures/Data-Structures.html | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/RCU/Design/Data-Structures/Data-Structures.html b/Documentation/RCU/Design/Data-Structures/Data-Structures.html index e4bf20a68fa3..4dec89097559 100644 --- a/Documentation/RCU/Design/Data-Structures/Data-Structures.html +++ b/Documentation/RCU/Design/Data-Structures/Data-Structures.html @@ -1106,6 +1106,7 @@ Its fields are as follows: 3 atomic_t dynticks; 4 bool rcu_need_heavy_qs; 5 unsigned long rcu_qs_ctr; + 6 bool rcu_urgent_qs;

    The ->dynticks_nesting field counts the @@ -1131,12 +1132,20 @@ it is willing to call for heavy-weight dyntick-counter operations. This flag is checked by RCU's context-switch and cond_resched() code, which provide a momentary idle sojourn in response. -

    Finally the ->rcu_qs_ctr field is used to record +

    The ->rcu_qs_ctr field is used to record quiescent states from cond_resched(). Because cond_resched() can execute quite frequently, this must be quite lightweight, as in a non-atomic increment of this per-CPU field. +

    Finally, the ->rcu_urgent_qs field is used to record +the fact that the RCU core code would really like to see a quiescent +state from the corresponding CPU, with the various other fields indicating +just how badly RCU wants this quiescent state. +This flag is checked by RCU's context-switch and cond_resched() +code, which, if nothing else, non-atomically increment ->rcu_qs_ctr +in response. +

     
    Quick Quiz:
    -- cgit From 5f0d5a3ae7cff0d7fa943c199c3a2e44f23e1fac Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 18 Jan 2017 02:53:44 -0800 Subject: mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU A group of Linux kernel hackers reported chasing a bug that resulted from their assumption that SLAB_DESTROY_BY_RCU provided an existence guarantee, that is, that no block from such a slab would be reallocated during an RCU read-side critical section. Of course, that is not the case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire slab of blocks. However, there is a phrase for this, namely "type safety". This commit therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order to avoid future instances of this sort of confusion. Signed-off-by: Paul E. McKenney Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Andrew Morton Cc: Acked-by: Johannes Weiner Acked-by: Vlastimil Babka [ paulmck: Add comments mentioning the old name, as requested by Eric Dumazet, in order to help people familiar with the old name find the new one. ] Acked-by: David Rientjes --- Documentation/RCU/00-INDEX | 2 +- Documentation/RCU/rculist_nulls.txt | 6 +++--- Documentation/RCU/whatisRCU.txt | 3 ++- 3 files changed, 6 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX index f773a264ae02..1672573b037a 100644 --- a/Documentation/RCU/00-INDEX +++ b/Documentation/RCU/00-INDEX @@ -17,7 +17,7 @@ rcu_dereference.txt rcubarrier.txt - RCU and Unloadable Modules rculist_nulls.txt - - RCU list primitives for use with SLAB_DESTROY_BY_RCU + - RCU list primitives for use with SLAB_TYPESAFE_BY_RCU rcuref.txt - Reference-count design for elements of lists/arrays protected by RCU rcu.txt diff --git a/Documentation/RCU/rculist_nulls.txt b/Documentation/RCU/rculist_nulls.txt index 18f9651ff23d..8151f0195f76 100644 --- a/Documentation/RCU/rculist_nulls.txt +++ b/Documentation/RCU/rculist_nulls.txt @@ -1,5 +1,5 @@ Using hlist_nulls to protect read-mostly linked lists and -objects using SLAB_DESTROY_BY_RCU allocations. +objects using SLAB_TYPESAFE_BY_RCU allocations. Please read the basics in Documentation/RCU/listRCU.txt @@ -7,7 +7,7 @@ Using special makers (called 'nulls') is a convenient way to solve following problem : A typical RCU linked list managing objects which are -allocated with SLAB_DESTROY_BY_RCU kmem_cache can +allocated with SLAB_TYPESAFE_BY_RCU kmem_cache can use following algos : 1) Lookup algo @@ -96,7 +96,7 @@ unlock_chain(); // typically a spin_unlock() 3) Remove algo -------------- Nothing special here, we can use a standard RCU hlist deletion. -But thanks to SLAB_DESTROY_BY_RCU, beware a deleted object can be reused +But thanks to SLAB_TYPESAFE_BY_RCU, beware a deleted object can be reused very very fast (before the end of RCU grace period) if (put_last_reference_on(obj) { diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt index 5cbd8b2395b8..91c912e86915 100644 --- a/Documentation/RCU/whatisRCU.txt +++ b/Documentation/RCU/whatisRCU.txt @@ -925,7 +925,8 @@ d. Do you need RCU grace periods to complete even in the face e. Is your workload too update-intensive for normal use of RCU, but inappropriate for other synchronization mechanisms? - If so, consider SLAB_DESTROY_BY_RCU. But please be careful! + If so, consider SLAB_TYPESAFE_BY_RCU (which was originally + named SLAB_DESTROY_BY_RCU). But please be careful! f. Do you need read-side critical sections that are respected even though they are in the middle of the idle loop, during -- cgit From 22607d66bbc3e81140d3bcf08894f4378eb36428 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 25 Apr 2017 14:03:11 -0700 Subject: srcu: Specify auto-expedite holdoff time On small systems, in the absence of readers, expedited SRCU grace periods can complete in less than a microsecond. This means that an eight-CPU system can have all CPUs doing synchronize_srcu() in a tight loop and almost always expedite. This might actually be desirable in some situations, but in general it is a good way to needlessly burn CPU cycles. And in those situations where it is desirable, your friend is the function synchronize_srcu_expedited(). For other situations, this commit adds a kernel parameter that specifies a holdoff between completing the last SRCU grace period and auto-expediting the next. If the next grace period starts before the holdoff expires, auto-expediting is disabled. The holdoff is 50 microseconds by default, and can be tuned to the desired number of nanoseconds. A value of zero disables auto-expediting. Signed-off-by: Paul E. McKenney Tested-by: Mike Galbraith --- Documentation/admin-guide/kernel-parameters.txt | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'Documentation') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index facc20a3f962..4a4b9266c4de 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3779,6 +3779,14 @@ spia_pedr= spia_peddr= + srcutree.exp_holdoff [KNL] + Specifies how many nanoseconds must elapse + since the end of the last SRCU grace period for + a given srcu_struct until the next normal SRCU + grace period will be considered for automatic + expediting. Set to zero to disable automatic + expediting. + stacktrace [FTRACE] Enabled the stack tracer on boot up. -- cgit
     
    Quick Quiz: