14 files changed, 667 insertions, 180 deletions
diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html
index 95b30fa25d56..62e847bcdcdd 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.html
+++ b/Documentation/RCU/Design/Requirements/Requirements.html
@@ -2080,6 +2080,8 @@ Some of the relevant points of interest are as follows:
 <li>	<a href="#Scheduler and RCU">Scheduler and RCU</a>.
 <li>	<a href="#Tracing and RCU">Tracing and RCU</a>.
 <li>	<a href="#Energy Efficiency">Energy Efficiency</a>.
+<li>	<a href="#Scheduling-Clock Interrupts and RCU">
+	Scheduling-Clock Interrupts and RCU</a>.
 <li>	<a href="#Memory Efficiency">Memory Efficiency</a>.
 <li>	<a href="#Performance, Scalability, Response Time, and Reliability">
 	Performance, Scalability, Response Time, and Reliability</a>.
@@ -2532,6 +2534,134 @@ I learned of many of these requirements via angry phone calls:
 Flaming me on the Linux-kernel mailing list was apparently not
 sufficient to fully vent their ire at RCU's energy-efficiency bugs!
 
+<h3><a name="Scheduling-Clock Interrupts and RCU">
+Scheduling-Clock Interrupts and RCU</a></h3>
+
+<p>
+The kernel transitions between in-kernel non-idle execution, userspace
+execution, and the idle loop.
+Depending on kernel configuration, RCU handles these states differently:
+
+<table border=3>
+<tr><th><tt>HZ</tt> Kconfig</th>
+	<th>In-Kernel</th>
+		<th>Usermode</th>
+			<th>Idle</th></tr>
+<tr><th align="left"><tt>HZ_PERIODIC</tt></th>
+	<td>Can rely on scheduling-clock interrupt.</td>
+		<td>Can rely on scheduling-clock interrupt and its
+		    detection of interrupt from usermode.</td>
+			<td>Can rely on RCU's dyntick-idle detection.</td></tr>
+<tr><th align="left"><tt>NO_HZ_IDLE</tt></th>
+	<td>Can rely on scheduling-clock interrupt.</td>
+		<td>Can rely on scheduling-clock interrupt and its
+		    detection of interrupt from usermode.</td>
+			<td>Can rely on RCU's dyntick-idle detection.</td></tr>
+<tr><th align="left"><tt>NO_HZ_FULL</tt></th>
+	<td>Can only sometimes rely on scheduling-clock interrupt.
+	    In other cases, it is necessary to bound kernel execution
+	    times and/or use IPIs.</td>
+		<td>Can rely on RCU's dyntick-idle detection.</td>
+			<td>Can rely on RCU's dyntick-idle detection.</td></tr>
+</table>
+
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+	Why can't <tt>NO_HZ_FULL</tt> in-kernel execution rely on the
+	scheduling-clock interrupt, just like <tt>HZ_PERIODIC</tt>
+	and <tt>NO_HZ_IDLE</tt> do?
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+	Because, as a performance optimization, <tt>NO_HZ_FULL</tt>
+	does not necessarily re-enable the scheduling-clock interrupt
+	on entry to each and every system call.
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+
+<p>
+However, RCU must be reliably informed as to whether any given
+CPU is currently in the idle loop, and, for <tt>NO_HZ_FULL</tt>,
+also whether that CPU is executing in usermode, as discussed
+<a href="#Energy Efficiency">earlier</a>.
+It also requires that the scheduling-clock interrupt be enabled when
+RCU needs it to be:
+
+<ol>
+<li>	If a CPU is either idle or executing in usermode, and RCU believes
+	it is non-idle, the scheduling-clock tick had better be running.
+	Otherwise, you will get RCU CPU stall warnings.  Or at best,
+	very long (11-second) grace periods, with a pointless IPI waking
+	the CPU from time to time.
+<li>	If a CPU is in a portion of the kernel that executes RCU read-side
+	critical sections, and RCU believes this CPU to be idle, you will get
+	random memory corruption.  <b>DON'T DO THIS!!!</b>
+
+	<br>This is one reason to test with lockdep, which will complain
+	about this sort of thing.
+<li>	If a CPU is in a portion of the kernel that is absolutely
+	positively no-joking guaranteed to never execute any RCU read-side
+	critical sections, and RCU believes this CPU to to be idle,
+	no problem.  This sort of thing is used by some architectures
+	for light-weight exception handlers, which can then avoid the
+	overhead of <tt>rcu_irq_enter()</tt> and <tt>rcu_irq_exit()</tt>
+	at exception entry and exit, respectively.
+	Some go further and avoid the entireties of <tt>irq_enter()</tt>
+	and <tt>irq_exit()</tt>.
+
+	<br>Just make very sure you are running some of your tests with
+	<tt>CONFIG_PROVE_RCU=y</tt>, just in case one of your code paths
+	was in fact joking about not doing RCU read-side critical sections.
+<li>	If a CPU is executing in the kernel with the scheduling-clock
+	interrupt disabled and RCU believes this CPU to be non-idle,
+	and if the CPU goes idle (from an RCU perspective) every few
+	jiffies, no problem.  It is usually OK for there to be the
+	occasional gap between idle periods of up to a second or so.
+
+	<br>If the gap grows too long, you get RCU CPU stall warnings.
+<li>	If a CPU is either idle or executing in usermode, and RCU believes
+	it to be idle, of course no problem.
+<li>	If a CPU is executing in the kernel, the kernel code
+	path is passing through quiescent states at a reasonable
+	frequency (preferably about once per few jiffies, but the
+	occasional excursion to a second or so is usually OK) and the
+	scheduling-clock interrupt is enabled, of course no problem.
+
+	<br>If the gap between a successive pair of quiescent states grows
+	too long, you get RCU CPU stall warnings.
+</ol>
+
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+	But what if my driver has a hardware interrupt handler
+	that can run for many seconds?
+	I cannot invoke <tt>schedule()</tt> from an hardware
+	interrupt handler, after all!
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+	One approach is to do <tt>rcu_irq_exit();rcu_irq_enter();</tt>
+	every so often.
+	But given that long-running interrupt handlers can cause
+	other problems, not least for response time, shouldn't you
+	work to keep your interrupt handler's runtime within reasonable
+	bounds?
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+
+<p>
+But as long as RCU is properly informed of kernel state transitions between
+in-kernel execution, usermode execution, and idle, and as long as the
+scheduling-clock interrupt is enabled when RCU needs it to be, you
+can rest assured that the bugs you encounter will be in some other
+part of RCU or some other part of the kernel!
+
 <h3><a name="Memory Efficiency">Memory Efficiency</a></h3>
 
 <p>
diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt
index 6beda556faf3..49747717d905 100644
--- a/Documentation/RCU/checklist.txt
+++ b/Documentation/RCU/checklist.txt
@@ -23,6 +23,14 @@ over a rather long period of time, but improvements are always welcome!
 	Yet another exception is where the low real-time latency of RCU's
 	read-side primitives is critically important.
 
+	One final exception is where RCU readers are used to prevent
+	the ABA problem (https://en.wikipedia.org/wiki/ABA_problem)
+	for lockless updates.  This does result in the mildly
+	counter-intuitive situation where rcu_read_lock() and
+	rcu_read_unlock() are used to protect updates, however, this
+	approach provides the same potential simplifications that garbage
+	collectors do.
+
 1.	Does the update code have proper mutual exclusion?
 
 	RCU does allow -readers- to run (almost) naked, but -writers- must
@@ -40,7 +48,9 @@ over a rather long period of time, but improvements are always welcome!
 	explain how this single task does not become a major bottleneck on
 	big multiprocessor machines (for example, if the task is updating
 	information relating to itself that other tasks can read, there
-	by definition can be no bottleneck).
+	by definition can be no bottleneck).  Note that the definition
+	of "large" has changed significantly:  Eight CPUs was "large"
+	in the year 2000, but a hundred CPUs was unremarkable in 2017.
 
 2.	Do the RCU read-side critical sections make proper use of
 	rcu_read_lock() and friends?  These primitives are needed
@@ -55,6 +65,12 @@ over a rather long period of time, but improvements are always welcome!
 	Disabling of preemption can serve as rcu_read_lock_sched(), but
 	is less readable.
 
+	Letting RCU-protected pointers "leak" out of an RCU read-side
+	critical section is every bid as bad as letting them leak out
+	from under a lock.  Unless, of course, you have arranged some
+	other means of protection, such as a lock or a reference count
+	-before- letting them out of the RCU read-side critical section.
+
 3.	Does the update code tolerate concurrent accesses?
 
 	The whole point of RCU is to permit readers to run without
@@ -78,10 +94,10 @@ over a rather long period of time, but improvements are always welcome!
 
 		This works quite well, also.
 
-	c.	Make updates appear atomic to readers.  For example,
+	c.	Make updates appear atomic to readers.	For example,
 		pointer updates to properly aligned fields will
 		appear atomic, as will individual atomic primitives.
-		Sequences of perations performed under a lock will -not-
+		Sequences of operations performed under a lock will -not-
 		appear to be atomic to RCU readers, nor will sequences
 		of multiple atomic primitives.
 
@@ -168,8 +184,8 @@ over a rather long period of time, but improvements are always welcome!
 
 5.	If call_rcu(), or a related primitive such as call_rcu_bh(),
 	call_rcu_sched(), or call_srcu() is used, the callback function
-	must be written to be called from softirq context.  In particular,
-	it cannot block.
+	will be called from softirq context.  In particular, it cannot
+	block.
 
 6.	Since synchronize_rcu() can block, it cannot be called from
 	any sort of irq context.  The same rule applies for
@@ -178,11 +194,14 @@ over a rather long period of time, but improvements are always welcome!
 	synchronize_sched_expedite(), and synchronize_srcu_expedited().
 
 	The expedited forms of these primitives have the same semantics
-	as the non-expedited forms, but expediting is both expensive
-	and unfriendly to real-time workloads.	Use of the expedited
-	primitives should be restricted to rare configuration-change
-	operations that would not normally be undertaken while a real-time
-	workload is running.
+	as the non-expedited forms, but expediting is both expensive and
+	(with the exception of synchronize_srcu_expedited()) unfriendly
+	to real-time workloads.  Use of the expedited primitives should
+	be restricted to rare configuration-change operations that would
+	not normally be undertaken while a real-time workload is running.
+	However, real-time workloads can use rcupdate.rcu_normal kernel
+	boot parameter to completely disable expedited grace periods,
+	though this might have performance implications.
 
 	In particular, if you find yourself invoking one of the expedited
 	primitives repeatedly in a loop, please do everyone a favor:
@@ -193,11 +212,6 @@ over a rather long period of time, but improvements are always welcome!
 	of the system, especially to real-time workloads running on
 	the rest of the system.
 
-	In addition, it is illegal to call the expedited forms from
-	a CPU-hotplug notifier, or while holding a lock that is acquired
-	by a CPU-hotplug notifier.  Failing to observe this restriction
-	will result in deadlock.
-
 7.	If the updater uses call_rcu() or synchronize_rcu(), then the
 	corresponding readers must use rcu_read_lock() and
 	rcu_read_unlock().  If the updater uses call_rcu_bh() or
@@ -321,7 +335,7 @@ over a rather long period of time, but improvements are always welcome!
 	Similarly, disabling preemption is not an acceptable substitute
 	for rcu_read_lock().  Code that attempts to use preemption
 	disabling where it should be using rcu_read_lock() will break
-	in real-time kernel builds.
+	in CONFIG_PREEMPT=y kernel builds.
 
 	If you want to wait for interrupt handlers, NMI handlers, and
 	code under the influence of preempt_disable(), you instead
@@ -356,23 +370,22 @@ over a rather long period of time, but improvements are always welcome!
 	not the case, a self-spawning RCU callback would prevent the
 	victim CPU from ever going offline.)
 
-14.	SRCU (srcu_read_lock(), srcu_read_unlock(), srcu_dereference(),
-	synchronize_srcu(), synchronize_srcu_expedited(), and call_srcu())
-	may only be invoked from process context.  Unlike other forms of
-	RCU, it -is- permissible to block in an SRCU read-side critical
-	section (demarked by srcu_read_lock() and srcu_read_unlock()),
-	hence the "SRCU": "sleepable RCU".  Please note that if you
-	don't need to sleep in read-side critical sections, you should be
-	using RCU rather than SRCU, because RCU is almost always faster
-	and easier to use than is SRCU.
-
-	Also unlike other forms of RCU, explicit initialization
-	and cleanup is required via init_srcu_struct() and
-	cleanup_srcu_struct().	These are passed a "struct srcu_struct"
-	that defines the scope of a given SRCU domain.	Once initialized,
-	the srcu_struct is passed to srcu_read_lock(), srcu_read_unlock()
-	synchronize_srcu(), synchronize_srcu_expedited(), and call_srcu().
-	A given synchronize_srcu() waits only for SRCU read-side critical
+14.	Unlike other forms of RCU, it -is- permissible to block in an
+	SRCU read-side critical section (demarked by srcu_read_lock()
+	and srcu_read_unlock()), hence the "SRCU": "sleepable RCU".
+	Please note that if you don't need to sleep in read-side critical
+	sections, you should be using RCU rather than SRCU, because RCU
+	is almost always faster and easier to use than is SRCU.
+
+	Also unlike other forms of RCU, explicit initialization and
+	cleanup is required either at build time via DEFINE_SRCU()
+	or DEFINE_STATIC_SRCU() or at runtime via init_srcu_struct()
+	and cleanup_srcu_struct().  These last two are passed a
+	"struct srcu_struct" that defines the scope of a given
+	SRCU domain.  Once initialized, the srcu_struct is passed
+	to srcu_read_lock(), srcu_read_unlock() synchronize_srcu(),
+	synchronize_srcu_expedited(), and call_srcu().	A given
+	synchronize_srcu() waits only for SRCU read-side critical
 	sections governed by srcu_read_lock() and srcu_read_unlock()
 	calls that have been passed the same srcu_struct.  This property
 	is what makes sleeping read-side critical sections tolerable --
@@ -390,10 +403,16 @@ over a rather long period of time, but improvements are always welcome!
 	Therefore, SRCU should be used in preference to rw_semaphore
 	only in extremely read-intensive situations, or in situations
 	requiring SRCU's read-side deadlock immunity or low read-side
-	realtime latency.
+	realtime latency.  You should also consider percpu_rw_semaphore
+	when you need lightweight readers.
 
-	Note that, rcu_assign_pointer() relates to SRCU just as it does
-	to other forms of RCU.
+	SRCU's expedited primitive (synchronize_srcu_expedited())
+	never sends IPIs to other CPUs, so it is easier on
+	real-time workloads than is synchronize_rcu_expedited(),
+	synchronize_rcu_bh_expedited() or synchronize_sched_expedited().
+
+	Note that rcu_dereference() and rcu_assign_pointer() relate to
+	SRCU just as they do to other forms of RCU.
 
 15.	The whole point of call_rcu(), synchronize_rcu(), and friends
 	is to wait until all pre-existing readers have finished before
@@ -435,3 +454,33 @@ over a rather long period of time, but improvements are always welcome!
 
 	These debugging aids can help you find problems that are
 	otherwise extremely difficult to spot.
+
+18.	If you register a callback using call_rcu(), call_rcu_bh(),
+	call_rcu_sched(), or call_srcu(), and pass in a function defined
+	within a loadable module, then it in necessary to wait for
+	all pending callbacks to be invoked after the last invocation
+	and before unloading that module.  Note that it is absolutely
+	-not- sufficient to wait for a grace period!  The current (say)
+	synchronize_rcu() implementation waits only for all previous
+	callbacks registered on the CPU that synchronize_rcu() is running
+	on, but it is -not- guaranteed to wait for callbacks registered
+	on other CPUs.
+
+	You instead need to use one of the barrier functions:
+
+	o	call_rcu() -> rcu_barrier()
+	o	call_rcu_bh() -> rcu_barrier_bh()
+	o	call_rcu_sched() -> rcu_barrier_sched()
+	o	call_srcu() -> srcu_barrier()
+
+	However, these barrier functions are absolutely -not- guaranteed
+	to wait for a grace period.  In fact, if there are no call_rcu()
+	callbacks waiting anywhere in the system, rcu_barrier() is within
+	its rights to return immediately.
+
+	So if you need to wait for both an RCU grace period and for
+	all pre-existing call_rcu() callbacks, you will need to execute
+	both rcu_barrier() and synchronize_rcu(), if necessary, using
+	something like workqueues to to execute them concurrently.
+
+	See rcubarrier.txt for more information.
diff --git a/Documentation/RCU/rcu.txt b/Documentation/RCU/rcu.txt
index 745f429fda79..7d4ae110c2c9 100644
--- a/Documentation/RCU/rcu.txt
+++ b/Documentation/RCU/rcu.txt
@@ -76,15 +76,12 @@ o	I hear that RCU is patented?  What is with that?
 	Of these, one was allowed to lapse by the assignee, and the
 	others have been contributed to the Linux kernel under GPL.
 	There are now also LGPL implementations of user-level RCU
-	available (http://lttng.org/?q=node/18).
+	available (http://liburcu.org/).
 
 o	I hear that RCU needs work in order to support realtime kernels?
 
-	This work is largely completed.  Realtime-friendly RCU can be
-	enabled via the CONFIG_PREEMPT_RCU kernel configuration
-	parameter.  However, work is in progress for enabling priority
-	boosting of preempted RCU read-side critical sections.	This is
-	needed if you have CPU-bound realtime threads.
+	Realtime-friendly RCU can be enabled via the CONFIG_PREEMPT_RCU
+	kernel configuration parameter.
 
 o	Where can I find more information on RCU?
 
diff --git a/Documentation/RCU/rcu_dereference.txt b/Documentation/RCU/rcu_dereference.txt
index b2a613f16d74..1acb26b09b48 100644
--- a/Documentation/RCU/rcu_dereference.txt
+++ b/Documentation/RCU/rcu_dereference.txt
@@ -25,35 +25,35 @@ o	You must use one of the rcu_dereference() family of primitives
 	for an example where the compiler can in fact deduce the exact
 	value of the pointer, and thus cause misordering.
 
+o	You are only permitted to use rcu_dereference on pointer values.
+	The compiler simply knows too much about integral values to
+	trust it to carry dependencies through integer operations.
+	There are a very few exceptions, namely that you can temporarily
+	cast the pointer to uintptr_t in order to:
+
+	o	Set bits and clear bits down in the must-be-zero low-order
+		bits of that pointer.  This clearly means that the pointer
+		must have alignment constraints, for example, this does
+		-not- work in general for char* pointers.
+
+	o	XOR bits to translate pointers, as is done in some
+		classic buddy-allocator algorithms.
+
+	It is important to cast the value back to pointer before
+	doing much of anything else with it.
+
 o	Avoid cancellation when using the "+" and "-" infix arithmetic
 	operators.  For example, for a given variable "x", avoid
-	"(x-x)".  There are similar arithmetic pitfalls from other
-	arithmetic operators, such as "(x*0)", "(x/(x+1))" or "(x%1)".
-	The compiler is within its rights to substitute zero for all of
-	these expressions, so that subsequent accesses no longer depend
-	on the rcu_dereference(), again possibly resulting in bugs due
-	to misordering.
+	"(x-(uintptr_t)x)" for char* pointers.	The compiler is within its
+	rights to substitute zero for this sort of expression, so that
+	subsequent accesses no longer depend on the rcu_dereference(),
+	again possibly resulting in bugs due to misordering.
 
 	Of course, if "p" is a pointer from rcu_dereference(), and "a"
 	and "b" are integers that happen to be equal, the expression
 	"p+a-b" is safe because its value still necessarily depends on
 	the rcu_dereference(), thus maintaining proper ordering.
 
-o	Avoid all-zero operands to the bitwise "&" operator, and
-	similarly avoid all-ones operands to the bitwise "|" operator.
-	If the compiler is able to deduce the value of such operands,
-	it is within its rights to substitute the corresponding constant
-	for the bitwise operation.  Once again, this causes subsequent
-	accesses to no longer depend on the rcu_dereference(), causing
-	bugs due to misordering.
-
-	Please note that single-bit operands to bitwise "&" can also
-	be dangerous.  At this point, the compiler knows that the
-	resulting value can only take on one of two possible values.
-	Therefore, a very small amount of additional information will
-	allow the compiler to deduce the exact value, which again can
-	result in misordering.
-
 o	If you are using RCU to protect JITed functions, so that the
 	"()" function-invocation operator is applied to a value obtained
 	(directly or indirectly) from rcu_dereference(), you may need to
@@ -61,25 +61,6 @@ o	If you are using RCU to protect JITed functions, so that the
 	This issue arises on some systems when a newly JITed function is
 	using the same memory that was used by an earlier JITed function.
 
-o	Do not use the results from the boolean "&&" and "||" when
-	dereferencing.	For example, the following (rather improbable)
-	code is buggy:
-
-		int *p;
-		int *q;
-
-		...
-
-		p = rcu_dereference(gp)
-		q = &global_q;
-		q += p != &oom_p1 && p != &oom_p2;
-		r1 = *q;  /* BUGGY!!! */
-
-	The reason this is buggy is that "&&" and "||" are often compiled
-	using branches.  While weak-memory machines such as ARM or PowerPC
-	do order stores after such branches, they can speculate loads,
-	which can result in misordering bugs.
-
 o	Do not use the results from relational operators ("==", "!=",
 	">", ">=", "<", or "<=") when dereferencing.  For example,
 	the following (quite strange) code is buggy:
diff --git a/Documentation/RCU/rcubarrier.txt b/Documentation/RCU/rcubarrier.txt
index b10cfe711e68..5d7759071a3e 100644
--- a/Documentation/RCU/rcubarrier.txt
+++ b/Documentation/RCU/rcubarrier.txt
@@ -263,6 +263,11 @@ Quick Quiz #2: What happens if CPU 0's rcu_barrier_func() executes
 	are delayed for a full grace period? Couldn't this result in
 	rcu_barrier() returning prematurely?
 
+The current rcu_barrier() implementation is more complex, due to the need
+to avoid disturbing idle CPUs (especially on battery-powered systems)
+and the need to minimally disturb non-idle CPUs in real-time systems.
+However, the code above illustrates the concepts.
+
 
 rcu_barrier() Summary
 
diff --git a/Documentation/RCU/torture.txt b/Documentation/RCU/torture.txt
index 278f6a9383b6..55918b54808b 100644
--- a/Documentation/RCU/torture.txt
+++ b/Documentation/RCU/torture.txt
@@ -276,15 +276,17 @@ o	"Free-Block Circulation": Shows the number of torture structures
 	somehow gets incremented farther than it should.
 
 Different implementations of RCU can provide implementation-specific
-additional information.  For example, SRCU provides the following
+additional information.  For example, Tree SRCU provides the following
 additional line:
 
-	srcu-torture: per-CPU(idx=1): 0(0,1) 1(0,1) 2(0,0) 3(0,1)
+	srcud-torture: Tree SRCU per-CPU(idx=0): 0(35,-21) 1(-4,24) 2(1,1) 3(-26,20) 4(28,-47) 5(-9,4) 6(-10,14) 7(-14,11) T(1,6)
 
-This line shows the per-CPU counter state.  The numbers in parentheses are
-the values of the "old" and "current" counters for the corresponding CPU.
-The "idx" value maps the "old" and "current" values to the underlying
-array, and is useful for debugging.
+This line shows the per-CPU counter state, in this case for Tree SRCU
+using a dynamically allocated srcu_struct (hence "srcud-" rather than
+"srcu-").  The numbers in parentheses are the values of the "old" and
+"current" counters for the corresponding CPU.  The "idx" value maps the
+"old" and "current" values to the underlying array, and is useful for
+debugging.  The final "T" entry contains the totals of the counters.
 
 
 USAGE
@@ -304,3 +306,9 @@ checked for such errors.  The "rmmod" command forces a "SUCCESS",
 "FAILURE", or "RCU_HOTPLUG" indication to be printk()ed.  The first
 two are self-explanatory, while the last indicates that while there
 were no RCU failures, CPU-hotplug problems were detected.
+
+However, the tools/testing/selftests/rcutorture/bin/kvm.sh script
+provides better automation, including automatic failure analysis.
+It assumes a qemu/kvm-enabled platform, and runs guest OSes out of initrd.
+See tools/testing/selftests/rcutorture/doc/initrd.txt for instructions
+on setting up such an initrd.
diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt
index 8ed6c9f6133c..df62466da4e0 100644
--- a/Documentation/RCU/whatisRCU.txt
+++ b/Documentation/RCU/whatisRCU.txt
@@ -890,6 +890,8 @@ SRCU:	Critical sections	Grace period		Barrier
 	srcu_read_lock_held
 
 SRCU:	Initialization/cleanup
+	DEFINE_SRCU
+	DEFINE_STATIC_SRCU
 	init_srcu_struct
 	cleanup_srcu_struct
 
@@ -913,7 +915,8 @@ a.	Will readers need to block?  If so, you need SRCU.
 b.	What about the -rt patchset?  If readers would need to block
 	in an non-rt kernel, you need SRCU.  If readers would block
 	in a -rt kernel, but not in a non-rt kernel, SRCU is not
-	necessary.
+	necessary.  (The -rt patchset turns spinlocks into sleeplocks,
+	hence this distinction.)
 
 c.	Do you need to treat NMI handlers, hardirq handlers,
 	and code segments with preemption disabled (whether
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index d9c171ce4190..3a99cc96b6b1 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2633,9 +2633,10 @@
 			In kernels built with CONFIG_NO_HZ_FULL=y, set
 			the specified list of CPUs whose tick will be stopped
 			whenever possible. The boot CPU will be forced outside
-			the range to maintain the timekeeping.
-			The CPUs in this range must also be included in the
-			rcu_nocbs= set.
+			the range to maintain the timekeeping.  Any CPUs
+			in this list will have their RCU callbacks offloaded,
+			just as if they had also been called out in the
+			rcu_nocbs= boot parameter.
 
 	noiotrap	[SH] Disables trapped I/O port accesses.
 
diff --git a/Documentation/core-api/kernel-api.rst b/Documentation/core-api/kernel-api.rst
index 17b00914c6ab..8282099e0cbf 100644
--- a/Documentation/core-api/kernel-api.rst
+++ b/Documentation/core-api/kernel-api.rst
@@ -344,3 +344,52 @@ codecs, and devices with strict requirements for interface clocking.
 
 .. kernel-doc:: include/linux/clk.h
    :internal:
+
+Synchronization Primitives
+==========================
+
+Read-Copy Update (RCU)
+----------------------
+
+.. kernel-doc:: include/linux/rcupdate.h
+   :external:
+
+.. kernel-doc:: include/linux/rcupdate_wait.h
+   :external:
+
+.. kernel-doc:: include/linux/rcutree.h
+   :external:
+
+.. kernel-doc:: kernel/rcu/tree.c
+   :external:
+
+.. kernel-doc:: kernel/rcu/tree_plugin.h
+   :external:
+
+.. kernel-doc:: kernel/rcu/tree_exp.h
+   :external:
+
+.. kernel-doc:: kernel/rcu/update.c
+   :external:
+
+.. kernel-doc:: include/linux/srcu.h
+   :external:
+
+.. kernel-doc:: kernel/rcu/srcutree.c
+   :external:
+
+.. kernel-doc:: include/linux/rculist_bl.h
+   :external:
+
+.. kernel-doc:: include/linux/rculist.h
+   :external:
+
+.. kernel-doc:: include/linux/rculist_nulls.h
+   :external:
+
+.. kernel-doc:: include/linux/rcu_sync.h
+   :external:
+
+.. kernel-doc:: kernel/rcu/sync.c
+   :external:
+
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index d1d1716f904b..b759a60624fd 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -594,7 +594,24 @@ between the address load and the data load:
 This enforces the occurrence of one of the two implications, and prevents the
 third possibility from arising.
 
-A data-dependency barrier must also order against dependent writes:
+
+[!] Note that this extremely counterintuitive situation arises most easily on
+machines with split caches, so that, for example, one cache bank processes
+even-numbered cache lines and the other bank processes odd-numbered cache
+lines.  The pointer P might be stored in an odd-numbered cache line, and the
+variable B might be stored in an even-numbered cache line.  Then, if the
+even-numbered bank of the reading CPU's cache is extremely busy while the
+odd-numbered bank is idle, one can see the new value of the pointer P (&B),
+but the old value of the variable B (2).
+
+
+A data-dependency barrier is not required to order dependent writes
+because the CPUs that the Linux kernel supports don't do writes
+until they are certain (1) that the write will actually happen, (2)
+of the location of the write, and (3) of the value to be written.
+But please carefully read the "CONTROL DEPENDENCIES" section and the
+Documentation/RCU/rcu_dereference.txt file:  The compiler can and does
+break dependencies in a great many highly creative ways.
 
 	CPU 1		      CPU 2
 	===============	      ===============
@@ -603,29 +620,19 @@ A data-dependency barrier must also order against dependent writes:
 	<write barrier>
 	WRITE_ONCE(P, &B);
 			      Q = READ_ONCE(P);
-			      <data dependency barrier>
-			      *Q = 5;
+			      WRITE_ONCE(*Q, 5);
 
-The data-dependency barrier must order the read into Q with the store
-into *Q.  This prohibits this outcome:
+Therefore, no data-dependency barrier is required to order the read into
+Q with the store into *Q.  In other words, this outcome is prohibited,
+even without a data-dependency barrier:
 
 	(Q == &B) && (B == 4)
 
 Please note that this pattern should be rare.  After all, the whole point
 of dependency ordering is to -prevent- writes to the data structure, along
 with the expensive cache misses associated with those writes.  This pattern
-can be used to record rare error conditions and the like, and the ordering
-prevents such records from being lost.
-
-
-[!] Note that this extremely counterintuitive situation arises most easily on
-machines with split caches, so that, for example, one cache bank processes
-even-numbered cache lines and the other bank processes odd-numbered cache
-lines.  The pointer P might be stored in an odd-numbered cache line, and the
-variable B might be stored in an even-numbered cache line.  Then, if the
-even-numbered bank of the reading CPU's cache is extremely busy while the
-odd-numbered bank is idle, one can see the new value of the pointer P (&B),
-but the old value of the variable B (2).
+can be used to record rare error conditions and the like, and the CPUs'
+naturally occurring ordering prevents such records from being lost.
 
 
 The data dependency barrier is very important to the RCU system,
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index bac23c198360..ce61d1fe08ca 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -61,6 +61,7 @@ show up in /proc/sys/kernel:
 - perf_cpu_time_max_percent
 - perf_event_paranoid
 - perf_event_max_stack
+- perf_event_mlock_kb
 - perf_event_max_contexts_per_stack
 - pid_max
 - powersave-nap               [ PPC only ]
@@ -654,7 +655,9 @@ Controls use of the performance events system by unprivileged
 users (without CAP_SYS_ADMIN).  The default value is 2.
 
  -1: Allow use of (almost) all events by all users
->=0: Disallow raw tracepoint access by users without CAP_IOC_LOCK
+     Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
+>=0: Disallow ftrace function tracepoint by users without CAP_SYS_ADMIN
+     Disallow raw tracepoint access by users without CAP_SYS_ADMIN
 >=1: Disallow CPU event access by users without CAP_SYS_ADMIN
 >=2: Disallow kernel profiling by users without CAP_SYS_ADMIN
 
@@ -673,6 +676,14 @@ The default value is 127.
 
 ==============================================================
 
+perf_event_mlock_kb:
+
+Control size of per-cpu ring buffer not counted agains mlock limit.
+
+The default value is 512 + 1 page
+
+==============================================================
+
 perf_event_max_contexts_per_stack:
 
 Controls maximum number of stack frame context entries for
diff --git a/Documentation/x86/early-microcode.txt b/Documentation/x86/early-microcode.txt
deleted file mode 100644
index 07749e7f3d50..000000000000
--- a/Documentation/x86/early-microcode.txt
+++ /dev/null
@@ -1,70 +0,0 @@
-Early load microcode
-====================
-By Fenghua Yu <fenghua.yu@intel.com>
-
-Kernel can update microcode in early phase of boot time. Loading microcode early
-can fix CPU issues before they are observed during kernel boot time.
-
-Microcode is stored in an initrd file. The microcode is read from the initrd
-file and loaded to CPUs during boot time.
-
-The format of the combined initrd image is microcode in cpio format followed by
-the initrd image (maybe compressed). Kernel parses the combined initrd image
-during boot time. The microcode file in cpio name space is:
-on Intel: kernel/x86/microcode/GenuineIntel.bin
-on AMD  : kernel/x86/microcode/AuthenticAMD.bin
-
-During BSP boot (before SMP starts), if the kernel finds the microcode file in
-the initrd file, it parses the microcode and saves matching microcode in memory.
-If matching microcode is found, it will be uploaded in BSP and later on in all
-APs.
-
-The cached microcode patch is applied when CPUs resume from a sleep state.
-
-There are two legacy user space interfaces to load microcode, either through
-/dev/cpu/microcode or through /sys/devices/system/cpu/microcode/reload file
-in sysfs.
-
-In addition to these two legacy methods, the early loading method described
-here is the third method with which microcode can be uploaded to a system's
-CPUs.
-
-The following example script shows how to generate a new combined initrd file in
-/boot/initrd-3.5.0.ucode.img with original microcode microcode.bin and
-original initrd image /boot/initrd-3.5.0.img.
-
-mkdir initrd
-cd initrd
-mkdir -p kernel/x86/microcode
-cp ../microcode.bin kernel/x86/microcode/GenuineIntel.bin (or AuthenticAMD.bin)
-find . | cpio -o -H newc >../ucode.cpio
-cd ..
-cat ucode.cpio /boot/initrd-3.5.0.img >/boot/initrd-3.5.0.ucode.img
-
-Builtin microcode
-=================
-
-We can also load builtin microcode supplied through the regular firmware
-builtin method CONFIG_FIRMWARE_IN_KERNEL. Only 64-bit is currently
-supported.
-
-Here's an example:
-
-CONFIG_FIRMWARE_IN_KERNEL=y
-CONFIG_EXTRA_FIRMWARE="intel-ucode/06-3a-09 amd-ucode/microcode_amd_fam15h.bin"
-CONFIG_EXTRA_FIRMWARE_DIR="/lib/firmware"
-
-This basically means, you have the following tree structure locally:
-
-/lib/firmware/
-|-- amd-ucode
-...
-|   |-- microcode_amd_fam15h.bin
-...
-|-- intel-ucode
-...
-|   |-- 06-3a-09
-...
-
-so that the build system can find those files and integrate them into
-the final kernel image. The early loader finds them and applies them.
diff --git a/Documentation/x86/microcode.txt b/Documentation/x86/microcode.txt
new file mode 100644
index 000000000000..f57e1b45e628
--- /dev/null
+++ b/Documentation/x86/microcode.txt
@@ -0,0 +1,137 @@
+	The Linux Microcode Loader
+
+Authors: Fenghua Yu <fenghua.yu@intel.com>
+	 Borislav Petkov <bp@suse.de>
+
+The kernel has a x86 microcode loading facility which is supposed to
+provide microcode loading methods in the OS. Potential use cases are
+updating the microcode on platforms beyond the OEM End-Of-Life support,
+and updating the microcode on long-running systems without rebooting.
+
+The loader supports three loading methods:
+
+1. Early load microcode
+=======================
+
+The kernel can update microcode very early during boot. Loading
+microcode early can fix CPU issues before they are observed during
+kernel boot time.
+
+The microcode is stored in an initrd file. During boot, it is read from
+it and loaded into the CPU cores.
+
+The format of the combined initrd image is microcode in (uncompressed)
+cpio format followed by the (possibly compressed) initrd image. The
+loader parses the combined initrd image during boot.
+
+The microcode files in cpio name space are:
+
+on Intel: kernel/x86/microcode/GenuineIntel.bin
+on AMD  : kernel/x86/microcode/AuthenticAMD.bin
+
+During BSP (BootStrapping Processor) boot (pre-SMP), the kernel
+scans the microcode file in the initrd. If microcode matching the
+CPU is found, it will be applied in the BSP and later on in all APs
+(Application Processors).
+
+The loader also saves the matching microcode for the CPU in memory.
+Thus, the cached microcode patch is applied when CPUs resume from a
+sleep state.
+
+Here's a crude example how to prepare an initrd with microcode (this is
+normally done automatically by the distribution, when recreating the
+initrd, so you don't really have to do it yourself. It is documented
+here for future reference only).
+
+---
+  #!/bin/bash
+
+  if [ -z "$1" ]; then
+      echo "You need to supply an initrd file"
+      exit 1
+  fi
+
+  INITRD="$1"
+
+  DSTDIR=kernel/x86/microcode
+  TMPDIR=/tmp/initrd
+
+  rm -rf $TMPDIR
+
+  mkdir $TMPDIR
+  cd $TMPDIR
+  mkdir -p $DSTDIR
+
+  if [ -d /lib/firmware/amd-ucode ]; then
+          cat /lib/firmware/amd-ucode/microcode_amd*.bin > $DSTDIR/AuthenticAMD.bin
+  fi
+
+  if [ -d /lib/firmware/intel-ucode ]; then
+          cat /lib/firmware/intel-ucode/* > $DSTDIR/GenuineIntel.bin
+  fi
+
+  find . | cpio -o -H newc >../ucode.cpio
+  cd ..
+  mv $INITRD $INITRD.orig
+  cat ucode.cpio $INITRD.orig > $INITRD
+
+  rm -rf $TMPDIR
+---
+
+The system needs to have the microcode packages installed into
+/lib/firmware or you need to fixup the paths above if yours are
+somewhere else and/or you've downloaded them directly from the processor
+vendor's site.
+
+2. Late loading
+===============
+
+There are two legacy user space interfaces to load microcode, either through
+/dev/cpu/microcode or through /sys/devices/system/cpu/microcode/reload file
+in sysfs.
+
+The /dev/cpu/microcode method is deprecated because it needs a special
+userspace tool for that.
+
+The easier method is simply installing the microcode packages your distro
+supplies and running:
+
+# echo 1 > /sys/devices/system/cpu/microcode/reload
+
+as root.
+
+The loading mechanism looks for microcode blobs in
+/lib/firmware/{intel-ucode,amd-ucode}. The default distro installation
+packages already put them there.
+
+3. Builtin microcode
+====================
+
+The loader supports also loading of a builtin microcode supplied through
+the regular firmware builtin method CONFIG_FIRMWARE_IN_KERNEL. Only
+64-bit is currently supported.
+
+Here's an example:
+
+CONFIG_FIRMWARE_IN_KERNEL=y
+CONFIG_EXTRA_FIRMWARE="intel-ucode/06-3a-09 amd-ucode/microcode_amd_fam15h.bin"
+CONFIG_EXTRA_FIRMWARE_DIR="/lib/firmware"
+
+This basically means, you have the following tree structure locally:
+
+/lib/firmware/
+|-- amd-ucode
+...
+|   |-- microcode_amd_fam15h.bin
+...
+|-- intel-ucode
+...
+|   |-- 06-3a-09
+...
+
+so that the build system can find those files and integrate them into
+the final kernel image. The early loader finds them and applies them.
+
+Needless to say, this method is not the most flexible one because it
+requires rebuilding the kernel each time updated microcode from the CPU
+vendor is available.
diff --git a/Documentation/x86/orc-unwinder.txt b/Documentation/x86/orc-unwinder.txt
new file mode 100644
index 000000000000..af0c9a4c65a6
--- /dev/null
+++ b/Documentation/x86/orc-unwinder.txt
@@ -0,0 +1,179 @@
+ORC unwinder
+============
+
+Overview
+--------
+
+The kernel CONFIG_ORC_UNWINDER option enables the ORC unwinder, which is
+similar in concept to a DWARF unwinder.  The difference is that the
+format of the ORC data is much simpler than DWARF, which in turn allows
+the ORC unwinder to be much simpler and faster.
+
+The ORC data consists of unwind tables which are generated by objtool.
+They contain out-of-band data which is used by the in-kernel ORC
+unwinder.  Objtool generates the ORC data by first doing compile-time
+stack metadata validation (CONFIG_STACK_VALIDATION).  After analyzing
+all the code paths of a .o file, it determines information about the
+stack state at each instruction address in the file and outputs that
+information to the .orc_unwind and .orc_unwind_ip sections.
+
+The per-object ORC sections are combined at link time and are sorted and
+post-processed at boot time.  The unwinder uses the resulting data to
+correlate instruction addresses with their stack states at run time.
+
+
+ORC vs frame pointers
+---------------------
+
+With frame pointers enabled, GCC adds instrumentation code to every
+function in the kernel.  The kernel's .text size increases by about
+3.2%, resulting in a broad kernel-wide slowdown.  Measurements by Mel
+Gorman [1] have shown a slowdown of 5-10% for some workloads.
+
+In contrast, the ORC unwinder has no effect on text size or runtime
+performance, because the debuginfo is out of band.  So if you disable
+frame pointers and enable the ORC unwinder, you get a nice performance
+improvement across the board, and still have reliable stack traces.
+
+Ingo Molnar says:
+
+  "Note that it's not just a performance improvement, but also an
+  instruction cache locality improvement: 3.2% .text savings almost
+  directly transform into a similarly sized reduction in cache
+  footprint. That can transform to even higher speedups for workloads
+  whose cache locality is borderline."
+
+Another benefit of ORC compared to frame pointers is that it can
+reliably unwind across interrupts and exceptions.  Frame pointer based
+unwinds can sometimes skip the caller of the interrupted function, if it
+was a leaf function or if the interrupt hit before the frame pointer was
+saved.
+
+The main disadvantage of the ORC unwinder compared to frame pointers is
+that it needs more memory to store the ORC unwind tables: roughly 2-4MB
+depending on the kernel config.
+
+
+ORC vs DWARF
+------------
+
+ORC debuginfo's advantage over DWARF itself is that it's much simpler.
+It gets rid of the complex DWARF CFI state machine and also gets rid of
+the tracking of unnecessary registers.  This allows the unwinder to be
+much simpler, meaning fewer bugs, which is especially important for
+mission critical oops code.
+
+The simpler debuginfo format also enables the unwinder to be much faster
+than DWARF, which is important for perf and lockdep.  In a basic
+performance test by Jiri Slaby [2], the ORC unwinder was about 20x
+faster than an out-of-tree DWARF unwinder.  (Note: That measurement was
+taken before some performance tweaks were added, which doubled
+performance, so the speedup over DWARF may be closer to 40x.)
+
+The ORC data format does have a few downsides compared to DWARF.  ORC
+unwind tables take up ~50% more RAM (+1.3MB on an x86 defconfig kernel)
+than DWARF-based eh_frame tables.
+
+Another potential downside is that, as GCC evolves, it's conceivable
+that the ORC data may end up being *too* simple to describe the state of
+the stack for certain optimizations.  But IMO this is unlikely because
+GCC saves the frame pointer for any unusual stack adjustments it does,
+so I suspect we'll really only ever need to keep track of the stack
+pointer and the frame pointer between call frames.  But even if we do
+end up having to track all the registers DWARF tracks, at least we will
+still be able to control the format, e.g. no complex state machines.
+
+
+ORC unwind table generation
+---------------------------
+
+The ORC data is generated by objtool.  With the existing compile-time
+stack metadata validation feature, objtool already follows all code
+paths, and so it already has all the information it needs to be able to
+generate ORC data from scratch.  So it's an easy step to go from stack
+validation to ORC data generation.
+
+It should be possible to instead generate the ORC data with a simple
+tool which converts DWARF to ORC data.  However, such a solution would
+be incomplete due to the kernel's extensive use of asm, inline asm, and
+special sections like exception tables.
+
+That could be rectified by manually annotating those special code paths
+using GNU assembler .cfi annotations in .S files, and homegrown
+annotations for inline asm in .c files.  But asm annotations were tried
+in the past and were found to be unmaintainable.  They were often
+incorrect/incomplete and made the code harder to read and keep updated.
+And based on looking at glibc code, annotating inline asm in .c files
+might be even worse.
+
+Objtool still needs a few annotations, but only in code which does
+unusual things to the stack like entry code.  And even then, far fewer
+annotations are needed than what DWARF would need, so they're much more
+maintainable than DWARF CFI annotations.
+
+So the advantages of using objtool to generate ORC data are that it
+gives more accurate debuginfo, with very few annotations.  It also
+insulates the kernel from toolchain bugs which can be very painful to
+deal with in the kernel since we often have to workaround issues in
+older versions of the toolchain for years.
+
+The downside is that the unwinder now becomes dependent on objtool's
+ability to reverse engineer GCC code flow.  If GCC optimizations become
+too complicated for objtool to follow, the ORC data generation might
+stop working or become incomplete.  (It's worth noting that livepatch
+already has such a dependency on objtool's ability to follow GCC code
+flow.)
+
+If newer versions of GCC come up with some optimizations which break
+objtool, we may need to revisit the current implementation.  Some
+possible solutions would be asking GCC to make the optimizations more
+palatable, or having objtool use DWARF as an additional input, or
+creating a GCC plugin to assist objtool with its analysis.  But for now,
+objtool follows GCC code quite well.
+
+
+Unwinder implementation details
+-------------------------------
+
+Objtool generates the ORC data by integrating with the compile-time
+stack metadata validation feature, which is described in detail in
+tools/objtool/Documentation/stack-validation.txt.  After analyzing all
+the code paths of a .o file, it creates an array of orc_entry structs,
+and a parallel array of instruction addresses associated with those
+structs, and writes them to the .orc_unwind and .orc_unwind_ip sections
+respectively.
+
+The ORC data is split into the two arrays for performance reasons, to
+make the searchable part of the data (.orc_unwind_ip) more compact.  The
+arrays are sorted in parallel at boot time.
+
+Performance is further improved by the use of a fast lookup table which
+is created at runtime.  The fast lookup table associates a given address
+with a range of indices for the .orc_unwind table, so that only a small
+subset of the table needs to be searched.
+
+
+Etymology
+---------
+
+Orcs, fearsome creatures of medieval folklore, are the Dwarves' natural
+enemies.  Similarly, the ORC unwinder was created in opposition to the
+complexity and slowness of DWARF.
+
+"Although Orcs rarely consider multiple solutions to a problem, they do
+excel at getting things done because they are creatures of action, not
+thought." [3]  Similarly, unlike the esoteric DWARF unwinder, the
+veracious ORC unwinder wastes no time or siloconic effort decoding
+variable-length zero-extended unsigned-integer byte-coded
+state-machine-based debug information entries.
+
+Similar to how Orcs frequently unravel the well-intentioned plans of
+their adversaries, the ORC unwinder frequently unravels stacks with
+brutal, unyielding efficiency.
+
+ORC stands for Oops Rewind Capability.
+
+
+[1] https://lkml.kernel.org/r/20170602104048.jkkzssljsompjdwy@suse.de
+[2] https://lkml.kernel.org/r/d2ca5435-6386-29b8-db87-7f227c2b713a@suse.cz
+[3] http://dustin.wikidot.com/half-orcs-and-orcs