Merge tag 'locking-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking and atomics updates from Thomas Gleixner: "The regular pile: - A few improvements to the mutex code - Documentation updates for atomics to clarify the difference between cmpxchg() and try_cmpxchg() and to explain the forward progress expectations. - Simplification of the atomics fallback generator - The addition of arch_atomic_long*() variants and generic arch_*() bitops based on them. - Add the missing might_sleep() invocations to the down*() operations of semaphores. The PREEMPT_RT locking core: - Scheduler updates to support the state preserving mechanism for 'sleeping' spin- and rwlocks on RT. This mechanism is carefully preserving the state of the task when blocking on a 'sleeping' spin- or rwlock and takes regular wake-ups targeted at the same task into account. The preserved or updated (via a regular wakeup) state is restored when the lock has been acquired. - Restructuring of the rtmutex code so it can be utilized and extended for the RT specific lock variants. - Restructuring of the ww_mutex code to allow sharing of the ww_mutex specific functionality for rtmutex based ww_mutexes. - Header file disentangling to allow substitution of the regular lock implementations with the PREEMPT_RT variants without creating an unmaintainable #ifdef mess. - Shared base code for the PREEMPT_RT specific rw_semaphore and rwlock implementations. Contrary to the regular rw_semaphores and rwlocks the PREEMPT_RT implementation is writer unfair because it is infeasible to do priority inheritance on multiple readers. Experience over the years has shown that real-time workloads are not the typical workloads which are sensitive to writer starvation. The alternative solution would be to allow only a single reader which has been tried and discarded as it is a major bottleneck especially for mmap_sem. Aside of that many of the writer starvation critical usage sites have been converted to a writer side mutex/spinlock and RCU read side protections in the past decade so that the issue is less prominent than it used to be. - The actual rtmutex based lock substitutions for PREEMPT_RT enabled kernels which affect mutex, ww_mutex, rw_semaphore, spinlock_t and rwlock_t. The spin/rw_lock*() functions disable migration across the critical section to preserve the existing semantics vs per-CPU variables. - Rework of the futex REQUEUE_PI mechanism to handle the case of early wake-ups which interleave with a re-queue operation to prevent the situation that a task would be blocked on both the rtmutex associated to the outer futex and the rtmutex based hash bucket spinlock. While this situation cannot happen on !RT enabled kernels the changes make the underlying concurrency problems easier to understand in general. As a result the difference between !RT and RT kernels is reduced to the handling of waiting for the critical section. !RT kernels simply spin-wait as before and RT kernels utilize rcu_wait(). - The substitution of local_lock for PREEMPT_RT with a spinlock which protects the critical section while staying preemptible. The CPU locality is established by disabling migration. The underlying concepts of this code have been in use in PREEMPT_RT for way more than a decade. The code has been refactored several times over the years and this final incarnation has been optimized once again to be as non-intrusive as possible, i.e. the RT specific parts are mostly isolated. It has been extensively tested in the 5.14-rt patch series and it has been verified that !RT kernels are not affected by these changes" * tag 'locking-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (92 commits) locking/rtmutex: Return success on deadlock for ww_mutex waiters locking/rtmutex: Prevent spurious EDEADLK return caused by ww_mutexes locking/rtmutex: Dequeue waiter on ww_mutex deadlock locking/rtmutex: Dont dereference waiter lockless locking/semaphore: Add might_sleep() to down_*() family locking/ww_mutex: Initialize waiter.ww_ctx properly static_call: Update API documentation locking/local_lock: Add PREEMPT_RT support locking/spinlock/rt: Prepare for RT local_lock locking/rtmutex: Add adaptive spinwait mechanism locking/rtmutex: Implement equal priority lock stealing preempt: Adjust PREEMPT_LOCK_OFFSET for RT locking/rtmutex: Prevent lockdep false positive with PI futexes futex: Prevent requeue_pi() lock nesting issue on RT futex: Simplify handle_early_requeue_pi_wakeup() futex: Reorder sanity checks in futex_requeue() futex: Clarify comment in futex_requeue() futex: Restructure futex_requeue() futex: Correct the number of requeued waiters for PI futex: Remove bogus condition for requeue PI ...
author: Linus Torvalds <torvalds@linux-foundation.org> 2021-08-30 14:26:36 -0700
committer: Linus Torvalds <torvalds@linux-foundation.org> 2021-08-30 14:26:36 -0700
commit: e5e726f7bb9f711102edea7e5bd511835640e3b4 (patch)
tree: e9f2d1696cd7a9664a04735568b2adbd2527e2e0 /include/linux/sched.h
parent: 08403e2174c4ac8b23922b5b7abe670129f8acb5 (diff)
parent: a055fcc132d4c25b96d1115aea514258810dc6fc (diff)
1 files changed, 94 insertions, 25 deletions
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7c6a77da8b10..af7179f8572c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -95,7 +95,9 @@ struct task_group;
 #define TASK_WAKING			0x0200
 #define TASK_NOLOAD			0x0400
 #define TASK_NEW			0x0800
-#define TASK_STATE_MAX			0x1000
+/* RT specific auxilliary flag to mark RT lock waiters */
+#define TASK_RTLOCK_WAIT		0x1000
+#define TASK_STATE_MAX			0x2000
 
 /* Convenience macros for the sake of set_current_state: */
 #define TASK_KILLABLE			(TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)
@@ -121,8 +123,6 @@ struct task_group;
 
 #define task_is_stopped_or_traced(task)	((READ_ONCE(task->__state) & (__TASK_STOPPED | __TASK_TRACED)) != 0)
 
-#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
-
 /*
  * Special states are those that do not use the normal wait-loop pattern. See
  * the comment with set_special_state().
@@ -130,30 +130,37 @@ struct task_group;
 #define is_special_task_state(state)				\
 	((state) & (__TASK_STOPPED | __TASK_TRACED | TASK_PARKED | TASK_DEAD))
 
-#define __set_current_state(state_value)			\
-	do {							\
-		WARN_ON_ONCE(is_special_task_state(state_value));\
-		current->task_state_change = _THIS_IP_;		\
-		WRITE_ONCE(current->__state, (state_value));	\
-	} while (0)
-
-#define set_current_state(state_value)				\
-	do {							\
-		WARN_ON_ONCE(is_special_task_state(state_value));\
-		current->task_state_change = _THIS_IP_;		\
-		smp_store_mb(current->__state, (state_value));	\
+#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
+# define debug_normal_state_change(state_value)				\
+	do {								\
+		WARN_ON_ONCE(is_special_task_state(state_value));	\
+		current->task_state_change = _THIS_IP_;			\
 	} while (0)
 
-#define set_special_state(state_value)					\
+# define debug_special_state_change(state_value)			\
 	do {								\
-		unsigned long flags; /* may shadow */			\
 		WARN_ON_ONCE(!is_special_task_state(state_value));	\
-		raw_spin_lock_irqsave(&current->pi_lock, flags);	\
 		current->task_state_change = _THIS_IP_;			\
-		WRITE_ONCE(current->__state, (state_value));		\
-		raw_spin_unlock_irqrestore(&current->pi_lock, flags);	\
 	} while (0)
+
+# define debug_rtlock_wait_set_state()					\
+	do {								 \
+		current->saved_state_change = current->task_state_change;\
+		current->task_state_change = _THIS_IP_;			 \
+	} while (0)
+
+# define debug_rtlock_wait_restore_state()				\
+	do {								 \
+		current->task_state_change = current->saved_state_change;\
+	} while (0)
+
 #else
+# define debug_normal_state_change(cond)	do { } while (0)
+# define debug_special_state_change(cond)	do { } while (0)
+# define debug_rtlock_wait_set_state()		do { } while (0)
+# define debug_rtlock_wait_restore_state()	do { } while (0)
+#endif
+
 /*
  * set_current_state() includes a barrier so that the write of current->state
  * is correctly serialised wrt the caller's subsequent test of whether to
@@ -192,26 +199,77 @@ struct task_group;
  * Also see the comments of try_to_wake_up().
  */
 #define __set_current_state(state_value)				\
-	WRITE_ONCE(current->__state, (state_value))
+	do {								\
+		debug_normal_state_change((state_value));		\
+		WRITE_ONCE(current->__state, (state_value));		\
+	} while (0)
 
 #define set_current_state(state_value)					\
-	smp_store_mb(current->__state, (state_value))
+	do {								\
+		debug_normal_state_change((state_value));		\
+		smp_store_mb(current->__state, (state_value));		\
+	} while (0)
 
 /*
  * set_special_state() should be used for those states when the blocking task
  * can not use the regular condition based wait-loop. In that case we must
- * serialize against wakeups such that any possible in-flight TASK_RUNNING stores
- * will not collide with our state change.
+ * serialize against wakeups such that any possible in-flight TASK_RUNNING
+ * stores will not collide with our state change.
  */
 #define set_special_state(state_value)					\
 	do {								\
 		unsigned long flags; /* may shadow */			\
+									\
 		raw_spin_lock_irqsave(&current->pi_lock, flags);	\
+		debug_special_state_change((state_value));		\
 		WRITE_ONCE(current->__state, (state_value));		\
 		raw_spin_unlock_irqrestore(&current->pi_lock, flags);	\
 	} while (0)
 
-#endif
+/*
+ * PREEMPT_RT specific variants for "sleeping" spin/rwlocks
+ *
+ * RT's spin/rwlock substitutions are state preserving. The state of the
+ * task when blocking on the lock is saved in task_struct::saved_state and
+ * restored after the lock has been acquired.  These operations are
+ * serialized by task_struct::pi_lock against try_to_wake_up(). Any non RT
+ * lock related wakeups while the task is blocked on the lock are
+ * redirected to operate on task_struct::saved_state to ensure that these
+ * are not dropped. On restore task_struct::saved_state is set to
+ * TASK_RUNNING so any wakeup attempt redirected to saved_state will fail.
+ *
+ * The lock operation looks like this:
+ *
+ *	current_save_and_set_rtlock_wait_state();
+ *	for (;;) {
+ *		if (try_lock())
+ *			break;
+ *		raw_spin_unlock_irq(&lock->wait_lock);
+ *		schedule_rtlock();
+ *		raw_spin_lock_irq(&lock->wait_lock);
+ *		set_current_state(TASK_RTLOCK_WAIT);
+ *	}
+ *	current_restore_rtlock_saved_state();
+ */
+#define current_save_and_set_rtlock_wait_state()			\
+	do {								\
+		lockdep_assert_irqs_disabled();				\
+		raw_spin_lock(&current->pi_lock);			\
+		current->saved_state = current->__state;		\
+		debug_rtlock_wait_set_state();				\
+		WRITE_ONCE(current->__state, TASK_RTLOCK_WAIT);		\
+		raw_spin_unlock(&current->pi_lock);			\
+	} while (0);
+
+#define current_restore_rtlock_saved_state()				\
+	do {								\
+		lockdep_assert_irqs_disabled();				\
+		raw_spin_lock(&current->pi_lock);			\
+		debug_rtlock_wait_restore_state();			\
+		WRITE_ONCE(current->__state, current->saved_state);	\
+		current->saved_state = TASK_RUNNING;			\
+		raw_spin_unlock(&current->pi_lock);			\
+	} while (0);
 
 #define get_current_state()	READ_ONCE(current->__state)
 
@@ -230,6 +288,9 @@ extern long schedule_timeout_idle(long timeout);
 asmlinkage void schedule(void);
 extern void schedule_preempt_disabled(void);
 asmlinkage void preempt_schedule_irq(void);
+#ifdef CONFIG_PREEMPT_RT
+ extern void schedule_rtlock(void);
+#endif
 
 extern int __must_check io_schedule_prepare(void);
 extern void io_schedule_finish(int token);
@@ -668,6 +729,11 @@ struct task_struct {
 #endif
 	unsigned int			__state;
 
+#ifdef CONFIG_PREEMPT_RT
+	/* saved state for "spinlock sleepers" */
+	unsigned int			saved_state;
+#endif
+
 	/*
 	 * This begins the randomizable portion of task_struct. Only
 	 * scheduling-critical items should be added above here.
@@ -1362,6 +1428,9 @@ struct task_struct {
 	struct kmap_ctrl		kmap_ctrl;
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 	unsigned long			task_state_change;
+# ifdef CONFIG_PREEMPT_RT
+	unsigned long			saved_state_change;
+# endif
 #endif
 	int				pagefault_disabled;
 #ifdef CONFIG_MMU
author	Linus Torvalds <torvalds@linux-foundation.org>	2021-08-30 14:26:36 -0700
committer	Linus Torvalds <torvalds@linux-foundation.org>	2021-08-30 14:26:36 -0700
commit	e5e726f7bb9f711102edea7e5bd511835640e3b4 (patch)
tree	e9f2d1696cd7a9664a04735568b2adbd2527e2e0 /include/linux/sched.h
parent	08403e2174c4ac8b23922b5b7abe670129f8acb5 (diff)
parent	a055fcc132d4c25b96d1115aea514258810dc6fc (diff)