diff options
Diffstat (limited to 'Documentation/core-api/real-time/differences.rst')
-rw-r--r-- | Documentation/core-api/real-time/differences.rst | 242 |
1 files changed, 242 insertions, 0 deletions
diff --git a/Documentation/core-api/real-time/differences.rst b/Documentation/core-api/real-time/differences.rst new file mode 100644 index 000000000000..83ec9aa1c61a --- /dev/null +++ b/Documentation/core-api/real-time/differences.rst @@ -0,0 +1,242 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================== +How realtime kernels differ +=========================== + +:Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> + +Preface +======= + +With forced-threaded interrupts and sleeping spin locks, code paths that +previously caused long scheduling latencies have been made preemptible and +moved into process context. This allows the scheduler to manage them more +effectively and respond to higher-priority tasks with reduced latency. + +The following chapters provide an overview of key differences between a +PREEMPT_RT kernel and a standard, non-PREEMPT_RT kernel. + +Locking +======= + +Spinning locks such as spinlock_t are used to provide synchronization for data +structures accessed from both interrupt context and process context. For this +reason, locking functions are also available with the _irq() or _irqsave() +suffixes, which disable interrupts before acquiring the lock. This ensures that +the lock can be safely acquired in process context when interrupts are enabled. + +However, on a PREEMPT_RT system, interrupts are forced-threaded and no longer +run in hard IRQ context. As a result, there is no need to disable interrupts as +part of the locking procedure when using spinlock_t. + +For low-level core components such as interrupt handling, the scheduler, or the +timer subsystem the kernel uses raw_spinlock_t. This lock type preserves +traditional semantics: it disables preemption and, when used with _irq() or +_irqsave(), also disables interrupts. This ensures proper synchronization in +critical sections that must remain non-preemptible or with interrupts disabled. + +Execution context +================= + +Interrupt handling in a PREEMPT_RT system is invoked in process context through +the use of threaded interrupts. Other parts of the kernel also shift their +execution into threaded context by different mechanisms. The goal is to keep +execution paths preemptible, allowing the scheduler to interrupt them when a +higher-priority task needs to run. + +Below is an overview of the kernel subsystems involved in this transition to +threaded, preemptible execution. + +Interrupt handling +------------------ + +All interrupts are forced-threaded in a PREEMPT_RT system. The exceptions are +interrupts that are requested with the IRQF_NO_THREAD, IRQF_PERCPU, or +IRQF_ONESHOT flags. + +The IRQF_ONESHOT flag is used together with threaded interrupts, meaning those +registered using request_threaded_irq() and providing only a threaded handler. +Its purpose is to keep the interrupt line masked until the threaded handler has +completed. + +If a primary handler is also provided in this case, it is essential that the +handler does not acquire any sleeping locks, as it will not be threaded. The +handler should be minimal and must avoid introducing delays, such as +busy-waiting on hardware registers. + + +Soft interrupts, bottom half handling +------------------------------------- + +Soft interrupts are raised by the interrupt handler and are executed after the +handler returns. Since they run in thread context, they can be preempted by +other threads. Do not assume that softirq context runs with preemption +disabled. This means you must not rely on mechanisms like local_bh_disable() in +process context to protect per-CPU variables. Because softirq handlers are +preemptible under PREEMPT_RT, this approach does not provide reliable +synchronization. + +If this kind of protection is required for performance reasons, consider using +local_lock_nested_bh(). On non-PREEMPT_RT kernels, this allows lockdep to +verify that bottom halves are disabled. On PREEMPT_RT systems, it adds the +necessary locking to ensure proper protection. + +Using local_lock_nested_bh() also makes the locking scope explicit and easier +for readers and maintainers to understand. + + +per-CPU variables +----------------- + +Protecting access to per-CPU variables solely by using preempt_disable() should +be avoided, especially if the critical section has unbounded runtime or may +call APIs that can sleep. + +If using a spinlock_t is considered too costly for performance reasons, +consider using local_lock_t. On non-PREEMPT_RT configurations, this introduces +no runtime overhead when lockdep is disabled. With lockdep enabled, it verifies +that the lock is only acquired in process context and never from softirq or +hard IRQ context. + +On a PREEMPT_RT kernel, local_lock_t is implemented using a per-CPU spinlock_t, +which provides safe local protection for per-CPU data while keeping the system +preemptible. + +Because spinlock_t on PREEMPT_RT does not disable preemption, it cannot be used +to protect per-CPU data by relying on implicit preemption disabling. If this +inherited preemption disabling is essential and if local_lock_t cannot be used +due to performance constraints, brevity of the code, or abstraction boundaries +within an API then preempt_disable_nested() may be a suitable alternative. On +non-PREEMPT_RT kernels, it verifies with lockdep that preemption is already +disabled. On PREEMPT_RT, it explicitly disables preemption. + +Timers +------ + +By default, an hrtimer is executed in hard interrupt context. The exception is +timers initialized with the HRTIMER_MODE_SOFT flag, which are executed in +softirq context. + +On a PREEMPT_RT kernel, this behavior is reversed: hrtimers are executed in +softirq context by default, typically within the ktimersd thread. This thread +runs at the lowest real-time priority, ensuring it executes before any +SCHED_OTHER tasks but does not interfere with higher-priority real-time +threads. To explicitly request execution in hard interrupt context on +PREEMPT_RT, the timer must be marked with the HRTIMER_MODE_HARD flag. + +Memory allocation +----------------- + +The memory allocation APIs, such as kmalloc() and alloc_pages(), require a +gfp_t flag to indicate the allocation context. On non-PREEMPT_RT kernels, it is +necessary to use GFP_ATOMIC when allocating memory from interrupt context or +from sections where preemption is disabled. This is because the allocator must +not sleep in these contexts waiting for memory to become available. + +However, this approach does not work on PREEMPT_RT kernels. The memory +allocator in PREEMPT_RT uses sleeping locks internally, which cannot be +acquired when preemption is disabled. Fortunately, this is generally not a +problem, because PREEMPT_RT moves most contexts that would traditionally run +with preemption or interrupts disabled into threaded context, where sleeping is +allowed. + +What remains problematic is code that explicitly disables preemption or +interrupts. In such cases, memory allocation must be performed outside the +critical section. + +This restriction also applies to memory deallocation routines such as kfree() +and free_pages(), which may also involve internal locking and must not be +called from non-preemptible contexts. + +IRQ work +-------- + +The irq_work API provides a mechanism to schedule a callback in interrupt +context. It is designed for use in contexts where traditional scheduling is not +possible, such as from within NMI handlers or from inside the scheduler, where +using a workqueue would be unsafe. + +On non-PREEMPT_RT systems, all irq_work items are executed immediately in +interrupt context. Items marked with IRQ_WORK_LAZY are deferred until the next +timer tick but are still executed in interrupt context. + +On PREEMPT_RT systems, the execution model changes. Because irq_work callbacks +may acquire sleeping locks or have unbounded execution time, they are handled +in thread context by a per-CPU irq_work kernel thread. This thread runs at the +lowest real-time priority, ensuring it executes before any SCHED_OTHER tasks +but does not interfere with higher-priority real-time threads. + +The exception are work items marked with IRQ_WORK_HARD_IRQ, which are still +executed in hard interrupt context. Lazy items (IRQ_WORK_LAZY) continue to be +deferred until the next timer tick and are also executed by the irq_work/ +thread. + +RCU callbacks +------------- + +RCU callbacks are invoked by default in softirq context. Their execution is +important because, depending on the use case, they either free memory or ensure +progress in state transitions. Running these callbacks as part of the softirq +chain can lead to undesired situations, such as contention for CPU resources +with other SCHED_OTHER tasks when executed within ksoftirqd. + +To avoid running callbacks in softirq context, the RCU subsystem provides a +mechanism to execute them in process context instead. This behavior can be +enabled by setting the boot command-line parameter rcutree.use_softirq=0. This +setting is enforced in kernels configured with PREEMPT_RT. + +Spin until ready +================ + +The "spin until ready" pattern involves repeatedly checking (spinning on) the +state of a data structure until it becomes available. This pattern assumes that +preemption, soft interrupts, or interrupts are disabled. If the data structure +is marked busy, it is presumed to be in use by another CPU, and spinning should +eventually succeed as that CPU makes progress. + +Some examples are hrtimer_cancel() or timer_delete_sync(). These functions +cancel timers that execute with interrupts or soft interrupts disabled. If a +thread attempts to cancel a timer and finds it active, spinning until the +callback completes is safe because the callback can only run on another CPU and +will eventually finish. + +On PREEMPT_RT kernels, however, timer callbacks run in thread context. This +introduces a challenge: a higher-priority thread attempting to cancel the timer +may preempt the timer callback thread. Since the scheduler cannot migrate the +callback thread to another CPU due to affinity constraints, spinning can result +in livelock even on multiprocessor systems. + +To avoid this, both the canceling and callback sides must use a handshake +mechanism that supports priority inheritance. This allows the canceling thread +to suspend until the callback completes, ensuring forward progress without +risking livelock. + +In order to solve the problem at the API level, the sequence locks were extended +to allow a proper handover between the the spinning reader and the maybe +blocked writer. + +Sequence locks +-------------- + +Sequence counters and sequential locks are documented in +Documentation/locking/seqlock.rst. + +The interface has been extended to ensure proper preemption states for the +writer and spinning reader contexts. This is achieved by embedding the writer +serialization lock directly into the sequence counter type, resulting in +composite types such as seqcount_spinlock_t or seqcount_mutex_t. + +These composite types allow readers to detect an ongoing write and actively +boost the writer’s priority to help it complete its update instead of spinning +and waiting for its completion. + +If the plain seqcount_t is used, extra care must be taken to synchronize the +reader with the writer during updates. The writer must ensure its update is +serialized and non-preemptible relative to the reader. This cannot be achieved +using a regular spinlock_t because spinlock_t on PREEMPT_RT does not disable +preemption. In such cases, using seqcount_spinlock_t is the preferred solution. + +However, if there is no spinning involved i.e., if the reader only needs to +detect whether a write has started and not serialize against it then using +seqcount_t is reasonable. |